Message ID | 1513632128-13492-1-git-send-email-yaodong.li@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Jackie Li (2017-12-18 21:22:08) > From: Zhipeng Gong <zhipeng.gong@intel.com> > > SKL platforms requires a higher ring multiplier when there's massive > GPU load. Current driver doesn't provide a way to override the ring > multiplier. > > This patch adds a new module parameter to allow the overriding of > ring multiplier for Gen9 platforms. So the default ring-scaling is not good enough, the first thing we do is to try and ensure the defaults work for nearly all use cases. My impression is that you want a nonlinear scalefactor, low power workloads don't try and ramp up the ring frequencies as aggressively, high power workloads try hard for higher frequencies, and then get throttled back harder as well. How well can we autotune it? What events tells us if the ratio is too high or too low? Adding an untested unsafe modparam is a big no. If you want us to support such a parameter, we at least need it not to taint the kernel and come with some tests that prove it functions as intended. Better still might be a I915_SETPARAM (or hooking up the write func of the modparam) to support changing the value on the fly. -Chris
On 12/18/2017 01:47 PM, Chris Wilson wrote: > Quoting Jackie Li (2017-12-18 21:22:08) >> From: Zhipeng Gong <zhipeng.gong@intel.com> >> >> SKL platforms requires a higher ring multiplier when there's massive >> GPU load. Current driver doesn't provide a way to override the ring >> multiplier. >> >> This patch adds a new module parameter to allow the overriding of >> ring multiplier for Gen9 platforms. > So the default ring-scaling is not good enough, the first thing we do is > to try and ensure the defaults work for nearly all use cases. My > impression is that you want a nonlinear scalefactor, low power workloads > don't try and ramp up the ring frequencies as aggressively, high power > workloads try hard for higher frequencies, and then get throttled back > harder as well. How well can we autotune it? What events tells us if the > ratio is too high or too low? Thanks Chris! the background was that we found that the performance would drop on some SKL platforms with the default ring scaling factor. and use of a 3x ring scaler can solve this problem. One of the trigger of this issue is when there's massive GPU workloads while having little/negligible CPU load - currently, we have no way to monitor GPU/CPU workload in our driver. This is the reason why we use a modparam. Do you have any suggestion about this? Zhipeng and Dmitry may add more insights on this issue and related tuning. > Adding an untested unsafe modparam is a big no. If you want us to > support such a parameter, we at least need it not to taint the kernel > and come with some tests that prove it functions as intended. Better > still might be a I915_SETPARAM (or hooking up the write func of the > modparam) to support changing the value on the fly. the intention of this patch was to add a way to override the default 2x ring scaling factor so that we would be able choose a different ring scaler for SKL platforms which were known that would have massive GPU workload. And related changes (using a 3x scaler) had been tested. It would be great if we can do this on the fly. However, we're going to need a way to monitor the trigger events in order to change the value dynamically. By saying use I915_SETPARAM (or different write func), are you suggesting to request the new ring freq directly while setting the new ring scaling factor? Regards, -Jackie > -Chris
Quoting Chris Wilson (2017-12-18 21:47:25) > Quoting Jackie Li (2017-12-18 21:22:08) > > From: Zhipeng Gong <zhipeng.gong@intel.com> > > > > SKL platforms requires a higher ring multiplier when there's massive > > GPU load. Current driver doesn't provide a way to override the ring > > multiplier. > > > > This patch adds a new module parameter to allow the overriding of > > ring multiplier for Gen9 platforms. > > So the default ring-scaling is not good enough, the first thing we do is > to try and ensure the defaults work for nearly all use cases. My > impression is that you want a nonlinear scalefactor, low power workloads > don't try and ramp up the ring frequencies as aggressively, high power > workloads try hard for higher frequencies, and then get throttled back > harder as well. How well can we autotune it? What events tells us if the > ratio is too high or too low? One thing that came to mind is that we don't know the min/max ring frequencies and just program them blindly. Is it the case that at max gpu freq, there is still headroom on the ring freq, or do you require a steeper ramp so that you hit the max ringfreq earlier for your workload (which then presumably can run at less than max gpufreq, so pushing power elsewhere). -Chris
Clarification on the issue. Consider that you have a massive load on GT and just tiny one on IA. If GT will program the RING frequency to be lower than IA frequency, then you will fall into the situation when RING frequency constantly transits from GT to IA level and back. Each transition of a RING frequency is a full system stall. If you will have "good" transition rate with few transitions per few milliseconds you will lose ~10% of performance. That's the case for media workloads when you easily can step into this since 1) media utilizes few GPU engines and with few parallel workloads you can make sure that at least 1 engine is _always_ doing something, 2) media BB are relatively small, so you have regular wakeups of the IA to manage requests. This will affect Gen9 platforms due to HW design change (we've spot this in SKL). This will not happen in Gen8 (old HW design). This will be fixed in Gen10+ (CNL+). On SKL we ran into this with the GPU frequency pinned to 700MHz, CPU to 2GHz. Multipliers were x2 for GT, x1 for IA. So, effectively, what we need to do is to make sure that RING frequency request from GT is _not_ below the request from IA. If IA requests 2GHz, we can't request 1.4GHz, we need request at least 2GHz. Multiplier patch was intended to do exactly that, but manually. Can we somehow automate that managing IA frequency requests to the RING? Dmitry. -----Original Message----- From: Chris Wilson [mailto:chris@chris-wilson.co.uk] Sent: Tuesday, December 26, 2017 6:36 AM To: Li, Yaodong <yaodong.li@intel.com>; intel-gfx@lists.freedesktop.org Cc: Gong, Zhipeng <zhipeng.gong@intel.com>; Widawsky, Benjamin <benjamin.widawsky@intel.com>; Mateo Lozano, Oscar <oscar.mateo@intel.com>; Kamble, Sagar A <sagar.a.kamble@intel.com>; Rogozhkin, Dmitry V <dmitry.v.rogozhkin@intel.com>; Li, Yaodong <yaodong.li@intel.com> Subject: Re: [RFC] drm/i915: Add a new modparam for customized ring multiplier Quoting Chris Wilson (2017-12-18 21:47:25) > Quoting Jackie Li (2017-12-18 21:22:08) > > From: Zhipeng Gong <zhipeng.gong@intel.com> > > > > SKL platforms requires a higher ring multiplier when there's massive > > GPU load. Current driver doesn't provide a way to override the ring > > multiplier. > > > > This patch adds a new module parameter to allow the overriding of > > ring multiplier for Gen9 platforms. > > So the default ring-scaling is not good enough, the first thing we do > is to try and ensure the defaults work for nearly all use cases. My > impression is that you want a nonlinear scalefactor, low power > workloads don't try and ramp up the ring frequencies as aggressively, > high power workloads try hard for higher frequencies, and then get > throttled back harder as well. How well can we autotune it? What > events tells us if the ratio is too high or too low? One thing that came to mind is that we don't know the min/max ring frequencies and just program them blindly. Is it the case that at max gpu freq, there is still headroom on the ring freq, or do you require a steeper ramp so that you hit the max ringfreq earlier for your workload (which then presumably can run at less than max gpufreq, so pushing power elsewhere). -Chris
Quoting Rogozhkin, Dmitry V (2017-12-26 16:39:23) > Clarification on the issue. Consider that you have a massive load on GT and just tiny one on IA. If GT will program the RING frequency to be lower than IA frequency, then you will fall into the situation when RING frequency constantly transits from GT to IA level and back. Each transition of a RING frequency is a full system stall. If you will have "good" transition rate with few transitions per few milliseconds you will lose ~10% of performance. That's the case for media workloads when you easily can step into this since 1) media utilizes few GPU engines and with few parallel workloads you can make sure that at least 1 engine is _always_ doing something, 2) media BB are relatively small, so you have regular wakeups of the IA to manage requests. This will affect Gen9 platforms due to HW design change (we've spot this in SKL). This will not happen in Gen8 (old HW design). This will be fixed in Gen10+ (CNL+). To clarify, the HW will flip between the two GT/IA requests rather than stick to the highest? Iirc, the expectation was that we were setting a requested minimum frequency for the ring/ia based off the gpu freq. > On SKL we ran into this with the GPU frequency pinned to 700MHz, CPU to 2GHz. Multipliers were x2 for GT, x1 for IA. Basically, with the GPU clocked to mid frequency, memory throughput is insufficient to keep the fixed functions occupied, and you need to increase the ring frequency. Is there ever a case where we don't need max ring frequency? (Perhaps we still need to set low frequency for GT idle?) I guess media is more susceptible to this as that workload should be sustainable at reduced clocks, GL et al are much more likely to keep the clocks ramped all the way up. Do you know anything about the opposite position. I heard a suggestion that simply increasing the ringfreq universally caused thermal throttling in some other workloads. Do you have any knowledge of those? > So, effectively, what we need to do is to make sure that RING frequency request from GT is _not_ below the request from IA. If IA requests 2GHz, we can't request 1.4GHz, we need request at least 2GHz. Multiplier patch was intended to do exactly that, but manually. Can we somehow automate that managing IA frequency requests to the RING? You are thinking of plugging into intel_pstate to make it smarter for ia freq transitions? That seems possible, certainly. I'm not sure if the ring frequency is actually poked from anywhere else in the kernel, would be interesting to find out. -Chris
>> To clarify, the HW will flip between the two GT/IA requests rather than stick to the highest? Yes, it will flip on Gen9. On Gen8 there was some mechanism (HW) which flattened that. But it was removed/substituted in Gen9. In Gen10 it was tuned to close the mentioned issue. >> Do you know anything about the opposite position. I heard a suggestion that simply increasing the ringfreq universally caused thermal throttling in some other workloads. Do you have any knowledge of those? Initially we tried to just increase GT multiplier to x3 and stepped into the throttling. Thus, we introduced parameter to be able to mitigate all that depending on the SKU and user needs. I definitely asked what will be if GT request will be bigger than IA request. But that was a year ago and I don't remember the answer. Let me ask again. I will mail back in few days. >> You are thinking of plugging into intel_pstate to make it smarter for ia freq transitions? Yep. This seems a correct step to give some automatic support instead of parameter/hardcoded multiplier. Dmitry. -----Original Message----- From: Chris Wilson [mailto:chris@chris-wilson.co.uk] Sent: Tuesday, December 26, 2017 8:59 AM To: Rogozhkin, Dmitry V <dmitry.v.rogozhkin@intel.com>; Li, Yaodong <yaodong.li@intel.com>; intel-gfx@lists.freedesktop.org Cc: Gong, Zhipeng <zhipeng.gong@intel.com>; Widawsky, Benjamin <benjamin.widawsky@intel.com>; Mateo Lozano, Oscar <oscar.mateo@intel.com>; Kamble, Sagar A <sagar.a.kamble@intel.com>; Li, Yaodong <yaodong.li@intel.com> Subject: RE: [RFC] drm/i915: Add a new modparam for customized ring multiplier Quoting Rogozhkin, Dmitry V (2017-12-26 16:39:23) > Clarification on the issue. Consider that you have a massive load on GT and just tiny one on IA. If GT will program the RING frequency to be lower than IA frequency, then you will fall into the situation when RING frequency constantly transits from GT to IA level and back. Each transition of a RING frequency is a full system stall. If you will have "good" transition rate with few transitions per few milliseconds you will lose ~10% of performance. That's the case for media workloads when you easily can step into this since 1) media utilizes few GPU engines and with few parallel workloads you can make sure that at least 1 engine is _always_ doing something, 2) media BB are relatively small, so you have regular wakeups of the IA to manage requests. This will affect Gen9 platforms due to HW design change (we've spot this in SKL). This will not happen in Gen8 (old HW design). This will be fixed in Gen10+ (CNL+). To clarify, the HW will flip between the two GT/IA requests rather than stick to the highest? Iirc, the expectation was that we were setting a requested minimum frequency for the ring/ia based off the gpu freq. > On SKL we ran into this with the GPU frequency pinned to 700MHz, CPU to 2GHz. Multipliers were x2 for GT, x1 for IA. Basically, with the GPU clocked to mid frequency, memory throughput is insufficient to keep the fixed functions occupied, and you need to increase the ring frequency. Is there ever a case where we don't need max ring frequency? (Perhaps we still need to set low frequency for GT idle?) I guess media is more susceptible to this as that workload should be sustainable at reduced clocks, GL et al are much more likely to keep the clocks ramped all the way up. Do you know anything about the opposite position. I heard a suggestion that simply increasing the ringfreq universally caused thermal throttling in some other workloads. Do you have any knowledge of those? > So, effectively, what we need to do is to make sure that RING frequency request from GT is _not_ below the request from IA. If IA requests 2GHz, we can't request 1.4GHz, we need request at least 2GHz. Multiplier patch was intended to do exactly that, but manually. Can we somehow automate that managing IA frequency requests to the RING? You are thinking of plugging into intel_pstate to make it smarter for ia freq transitions? That seems possible, certainly. I'm not sure if the ring frequency is actually poked from anywhere else in the kernel, would be interesting to find out. -Chris
>> I definitely asked what will be if GT request will be bigger than IA request. But that was a year ago and I don't remember the answer. Let me ask again. I will mail back in few days. Hi Chris, here is the response. Question was: "Whether we can meet with the RING transition penalty (at least theoretically) if we will have GT request higher than IA request with the dominant IA load and tiny GT load, i.e. reverted situation of what we have actually faced? For example, if we will try to pin IA frequency to 800MHz (x1 multiplier) and GT frequency to 700MHz (x2 multiplier): in that case we will have requests for ring 800 vs. 1400." Answer is: "In this case, if the GT will toggle between RC0 and RC6, it will force ring frequency to toggle between 800 and 1400, which in the toggling time will stall IA execution. This will lead to performance loss." However, this is a case if we have really few toggle events within few milliseconds. It is quite probable that GT driver will not allow such behavior to happen if it simply doesn't often toggle between RC0 and RC6. Considering that GT driver probably handles much less interrupts than IA, this can be the case. So, I think Chris, that's now question to you: how often you toggle between RC0 and RC6 to see the reverted issue to happen? If you don't toggle much, then RING will simply remain on 1400 almost all the time and you will see no issue. Again, I remind that's the talk about Gen9 only. Dmitry. -----Original Message----- From: Intel-gfx [mailto:intel-gfx-bounces@lists.freedesktop.org] On Behalf Of Rogozhkin, Dmitry V Sent: Tuesday, December 26, 2017 9:39 AM To: Chris Wilson <chris@chris-wilson.co.uk>; Li, Yaodong <yaodong.li@intel.com>; intel-gfx@lists.freedesktop.org Cc: Widawsky, Benjamin <benjamin.widawsky@intel.com> Subject: Re: [Intel-gfx] [RFC] drm/i915: Add a new modparam for customized ring multiplier >> To clarify, the HW will flip between the two GT/IA requests rather than stick to the highest? Yes, it will flip on Gen9. On Gen8 there was some mechanism (HW) which flattened that. But it was removed/substituted in Gen9. In Gen10 it was tuned to close the mentioned issue. >> Do you know anything about the opposite position. I heard a suggestion that simply increasing the ringfreq universally caused thermal throttling in some other workloads. Do you have any knowledge of those? Initially we tried to just increase GT multiplier to x3 and stepped into the throttling. Thus, we introduced parameter to be able to mitigate all that depending on the SKU and user needs. I definitely asked what will be if GT request will be bigger than IA request. But that was a year ago and I don't remember the answer. Let me ask again. I will mail back in few days. >> You are thinking of plugging into intel_pstate to make it smarter for ia freq transitions? Yep. This seems a correct step to give some automatic support instead of parameter/hardcoded multiplier. Dmitry. -----Original Message----- From: Chris Wilson [mailto:chris@chris-wilson.co.uk] Sent: Tuesday, December 26, 2017 8:59 AM To: Rogozhkin, Dmitry V <dmitry.v.rogozhkin@intel.com>; Li, Yaodong <yaodong.li@intel.com>; intel-gfx@lists.freedesktop.org Cc: Gong, Zhipeng <zhipeng.gong@intel.com>; Widawsky, Benjamin <benjamin.widawsky@intel.com>; Mateo Lozano, Oscar <oscar.mateo@intel.com>; Kamble, Sagar A <sagar.a.kamble@intel.com>; Li, Yaodong <yaodong.li@intel.com> Subject: RE: [RFC] drm/i915: Add a new modparam for customized ring multiplier Quoting Rogozhkin, Dmitry V (2017-12-26 16:39:23) > Clarification on the issue. Consider that you have a massive load on GT and just tiny one on IA. If GT will program the RING frequency to be lower than IA frequency, then you will fall into the situation when RING frequency constantly transits from GT to IA level and back. Each transition of a RING frequency is a full system stall. If you will have "good" transition rate with few transitions per few milliseconds you will lose ~10% of performance. That's the case for media workloads when you easily can step into this since 1) media utilizes few GPU engines and with few parallel workloads you can make sure that at least 1 engine is _always_ doing something, 2) media BB are relatively small, so you have regular wakeups of the IA to manage requests. This will affect Gen9 platforms due to HW design change (we've spot this in SKL). This will not happen in Gen8 (old HW design). This will be fixed in Gen10+ (CNL+). To clarify, the HW will flip between the two GT/IA requests rather than stick to the highest? Iirc, the expectation was that we were setting a requested minimum frequency for the ring/ia based off the gpu freq. > On SKL we ran into this with the GPU frequency pinned to 700MHz, CPU to 2GHz. Multipliers were x2 for GT, x1 for IA. Basically, with the GPU clocked to mid frequency, memory throughput is insufficient to keep the fixed functions occupied, and you need to increase the ring frequency. Is there ever a case where we don't need max ring frequency? (Perhaps we still need to set low frequency for GT idle?) I guess media is more susceptible to this as that workload should be sustainable at reduced clocks, GL et al are much more likely to keep the clocks ramped all the way up. Do you know anything about the opposite position. I heard a suggestion that simply increasing the ringfreq universally caused thermal throttling in some other workloads. Do you have any knowledge of those? > So, effectively, what we need to do is to make sure that RING frequency request from GT is _not_ below the request from IA. If IA requests 2GHz, we can't request 1.4GHz, we need request at least 2GHz. Multiplier patch was intended to do exactly that, but manually. Can we somehow automate that managing IA frequency requests to the RING? You are thinking of plugging into intel_pstate to make it smarter for ia freq transitions? That seems possible, certainly. I'm not sure if the ring frequency is actually poked from anywhere else in the kernel, would be interesting to find out. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>> You are thinking of plugging into intel_pstate to make it smarter for ia freq transitions? > Yep. This seems a correct step to give some automatic support instead of parameter/hardcoded multiplier. > Does this mean we should use cpufreq/intel_pstate based approach instead of the current modparam solution for Gen9? Some concerns and questions about intel_pstate approach: a) Currently, we cannot get the accurate pstate/target freq value from cpufreq in intel_pstate active mode since these values won't be exported to cpufreq layer, so if we won't change intel_pstate code then we only can get the max cpu freq of a new policy. b) intel_pstate policy is attached to each logic cpu, which means we will receive policy/freq transition notification for each logic cpu freq change. One question is how we are going to decide the freq of the ring? just use the max cpu freq reported? c) With the intel_pstate approach we may still run into thermal throttling, in this case, can a certain cooling device be triggered to lower the cpu freq? Thanks and Regards, -Jackie
Since ring frequency programming needs consideration of both IA and GT frequency requests I think keeping the logic to program the ring frequency table in driver that monitors both IA/GT busyness and power budgets like intel_ips will be more appropriate. intel_ips is relying on global load derived from all CPUs. I understand that power awareness and busyness based policy might be trickier but having that as tunable will give better flexibility. On 1/3/2018 11:51 PM, Yaodong Li wrote: > >>>> You are thinking of plugging into intel_pstate to make it smarter >>>> for ia freq transitions? >> Yep. This seems a correct step to give some automatic support instead >> of parameter/hardcoded multiplier. >> > Does this mean we should use cpufreq/intel_pstate based approach > instead of the current modparam solution for Gen9? > > Some concerns and questions about intel_pstate approach: > a) Currently, we cannot get the accurate pstate/target freq value from > cpufreq in intel_pstate active mode since > these values won't be exported to cpufreq layer, so if we won't > change intel_pstate code then we only can get > the max cpu freq of a new policy. > b) intel_pstate policy is attached to each logic cpu, which means we > will receive policy/freq transition notification > for each logic cpu freq change. One question is how we are going > to decide the freq of the ring? just use the max > cpu freq reported? > c) With the intel_pstate approach we may still run into thermal > throttling, in this case, can a certain cooling device > be triggered to lower the cpu freq? > > Thanks and Regards, > -Jackie >
On 01/03/2018 10:10 PM, Sagar Arun Kamble wrote: > Since ring frequency programming needs consideration of both IA and GT > frequency requests I think keeping the logic > to program the ring frequency table in driver that monitors both IA/GT > busyness and power budgets like intel_ips > will be more appropriate. intel_ips is relying on global load derived > from all CPUs. > I understand that power awareness and busyness based policy might be > trickier but having that as tunable will give better flexibility. > By just looking into the current code, the way intel_ips checks gpu busyness cannot reflect the actual workload of GT (e.g. gpu busy is true even if there's only one pending request), in this case, we shall not increase the ring freq if we want to use a "workload monitoring" based solution. so we need a more accurate way to monitor the current GT workload (e.g. when the pending request count reaches a center tunable threshold??). > On 1/3/2018 11:51 PM, Yaodong Li wrote: >> >>>>> You are thinking of plugging into intel_pstate to make it smarter >>>>> for ia freq transitions? >>> Yep. This seems a correct step to give some automatic support >>> instead of parameter/hardcoded multiplier. >>> >> Does this mean we should use cpufreq/intel_pstate based approach >> instead of the current modparam solution for Gen9? >> >> Some concerns and questions about intel_pstate approach: >> a) Currently, we cannot get the accurate pstate/target freq value >> from cpufreq in intel_pstate active mode since >> these values won't be exported to cpufreq layer, so if we won't >> change intel_pstate code then we only can get >> the max cpu freq of a new policy. >> b) intel_pstate policy is attached to each logic cpu, which means we >> will receive policy/freq transition notification >> for each logic cpu freq change. One question is how we are going >> to decide the freq of the ring? just use the max >> cpu freq reported? >> c) With the intel_pstate approach we may still run into thermal >> throttling, in this case, can a certain cooling device >> be triggered to lower the cpu freq? >> >> Thanks and Regards, >> -Jackie >> >
On 1/5/2018 3:22 AM, Yaodong Li wrote: > > On 01/03/2018 10:10 PM, Sagar Arun Kamble wrote: >> Since ring frequency programming needs consideration of both IA and >> GT frequency requests I think keeping the logic >> to program the ring frequency table in driver that monitors both >> IA/GT busyness and power budgets like intel_ips >> will be more appropriate. intel_ips is relying on global load derived >> from all CPUs. >> I understand that power awareness and busyness based policy might be >> trickier but having that as tunable will give better flexibility. >> > By just looking into the current code, the way intel_ips checks gpu > busyness cannot reflect the actual workload of GT > (e.g. gpu busy is true even if there's only one pending request), in > this case, we shall not increase the ring freq if we > want to use a "workload monitoring" based solution. so we need a more > accurate way to monitor the current GT workload > (e.g. when the pending request count reaches a center tunable > threshold??). Yes. May be we can share the PMU data about engine busyness with intel_ips. > >> On 1/3/2018 11:51 PM, Yaodong Li wrote: >>> >>>>>> You are thinking of plugging into intel_pstate to make it smarter >>>>>> for ia freq transitions? >>>> Yep. This seems a correct step to give some automatic support >>>> instead of parameter/hardcoded multiplier. >>>> >>> Does this mean we should use cpufreq/intel_pstate based approach >>> instead of the current modparam solution for Gen9? >>> >>> Some concerns and questions about intel_pstate approach: >>> a) Currently, we cannot get the accurate pstate/target freq value >>> from cpufreq in intel_pstate active mode since >>> these values won't be exported to cpufreq layer, so if we won't >>> change intel_pstate code then we only can get >>> the max cpu freq of a new policy. >>> b) intel_pstate policy is attached to each logic cpu, which means we >>> will receive policy/freq transition notification >>> for each logic cpu freq change. One question is how we are going >>> to decide the freq of the ring? just use the max >>> cpu freq reported? >>> c) With the intel_pstate approach we may still run into thermal >>> throttling, in this case, can a certain cooling device >>> be triggered to lower the cpu freq? >>> >>> Thanks and Regards, >>> -Jackie >>> >> >
On 01/05/2018 02:15 AM, Sagar Arun Kamble wrote: > > > On 1/5/2018 3:22 AM, Yaodong Li wrote: >> >> On 01/03/2018 10:10 PM, Sagar Arun Kamble wrote: >>> Since ring frequency programming needs consideration of both IA and >>> GT frequency requests I think keeping the logic >>> to program the ring frequency table in driver that monitors both >>> IA/GT busyness and power budgets like intel_ips >>> will be more appropriate. intel_ips is relying on global load >>> derived from all CPUs. >>> I understand that power awareness and busyness based policy might be >>> trickier but having that as tunable will give better flexibility. >>> >> By just looking into the current code, the way intel_ips checks gpu >> busyness cannot reflect the actual workload of GT >> (e.g. gpu busy is true even if there's only one pending request), in >> this case, we shall not increase the ring freq if we >> want to use a "workload monitoring" based solution. so we need a more >> accurate way to monitor the current GT workload >> (e.g. when the pending request count reaches a center tunable >> threshold??). > Yes. May be we can share the PMU data about engine busyness with > intel_ips. Thank you Sagar! Can you tell more about how we can get the gpu busyness from PMU data? I think the solution would be to set the ring freq table to use a ring freq >= current ia freq (for all possible gpu freq) once we found gpu workload is high (need to tune the threshold), and we will decrease the ring freq (use a 2x multiplier?) once we found the GT workload is low. The benefit to use the intel_ips is it can tell both the cpu & gpu busyness. However, we do need an accurate way to check at least the busyness of gpu for this issue. >>> On 1/3/2018 11:51 PM, Yaodong Li wrote: >>>> >>>>>>> You are thinking of plugging into intel_pstate to make it >>>>>>> smarter for ia freq transitions? >>>>> Yep. This seems a correct step to give some automatic support >>>>> instead of parameter/hardcoded multiplier. >>>>> >>>> Does this mean we should use cpufreq/intel_pstate based approach >>>> instead of the current modparam solution for Gen9? >>>> >>>> Some concerns and questions about intel_pstate approach: >>>> a) Currently, we cannot get the accurate pstate/target freq value >>>> from cpufreq in intel_pstate active mode since >>>> these values won't be exported to cpufreq layer, so if we >>>> won't change intel_pstate code then we only can get >>>> the max cpu freq of a new policy. >>>> b) intel_pstate policy is attached to each logic cpu, which means >>>> we will receive policy/freq transition notification >>>> for each logic cpu freq change. One question is how we are >>>> going to decide the freq of the ring? just use the max >>>> cpu freq reported? >>>> c) With the intel_pstate approach we may still run into thermal >>>> throttling, in this case, can a certain cooling device >>>> be triggered to lower the cpu freq? >>>> >>>> Thanks and Regards, >>>> -Jackie >>>> >>> >> >
On 1/6/2018 5:53 AM, Yaodong Li wrote: > On 01/05/2018 02:15 AM, Sagar Arun Kamble wrote: >> >> >> On 1/5/2018 3:22 AM, Yaodong Li wrote: >>> >>> On 01/03/2018 10:10 PM, Sagar Arun Kamble wrote: >>>> Since ring frequency programming needs consideration of both IA and >>>> GT frequency requests I think keeping the logic >>>> to program the ring frequency table in driver that monitors both >>>> IA/GT busyness and power budgets like intel_ips >>>> will be more appropriate. intel_ips is relying on global load >>>> derived from all CPUs. >>>> I understand that power awareness and busyness based policy might >>>> be trickier but having that as tunable will give better flexibility. >>>> >>> By just looking into the current code, the way intel_ips checks gpu >>> busyness cannot reflect the actual workload of GT >>> (e.g. gpu busy is true even if there's only one pending request), in >>> this case, we shall not increase the ring freq if we >>> want to use a "workload monitoring" based solution. so we need a >>> more accurate way to monitor the current GT workload >>> (e.g. when the pending request count reaches a center tunable >>> threshold??). >> Yes. May be we can share the PMU data about engine busyness with >> intel_ips. > Thank you Sagar! Can you tell more about how we can get the gpu > busyness from PMU data? > It seems to be not possible as of now since pmu is reporting accumulated busy time per engine without info about time they became active/idle, unless I am missing something. Also having pmu ON in release environment needs to be checked. Chris, could you please clarify on pmu usage here. Also thinking about below two options to get the GPU busyness: 1. Sharing i915 rps power zone (LOW_POWER, BETWEEN, HIGH_POWER) to intel_ips (kind of indicators of GPU load) 2. Sampling GT C0 counter to derive full GPU busyness. This though may not have been tested much so far across platforms. > I think the solution would be to set the ring freq table to use a ring > freq >= current ia freq (for all possible gpu freq) once we found > gpu workload is high (need to tune the threshold), and we will > decrease the ring freq (use a 2x multiplier?) once we found the GT > workload is low. The benefit to use the intel_ips is it can tell both > the cpu & gpu busyness. However, we do need an accurate way > to check at least the busyness of gpu for this issue. Agree. Thanks Sagar > >>>> On 1/3/2018 11:51 PM, Yaodong Li wrote: >>>>> >>>>>>>> You are thinking of plugging into intel_pstate to make it >>>>>>>> smarter for ia freq transitions? >>>>>> Yep. This seems a correct step to give some automatic support >>>>>> instead of parameter/hardcoded multiplier. >>>>>> >>>>> Does this mean we should use cpufreq/intel_pstate based approach >>>>> instead of the current modparam solution for Gen9? >>>>> >>>>> Some concerns and questions about intel_pstate approach: >>>>> a) Currently, we cannot get the accurate pstate/target freq value >>>>> from cpufreq in intel_pstate active mode since >>>>> these values won't be exported to cpufreq layer, so if we >>>>> won't change intel_pstate code then we only can get >>>>> the max cpu freq of a new policy. >>>>> b) intel_pstate policy is attached to each logic cpu, which means >>>>> we will receive policy/freq transition notification >>>>> for each logic cpu freq change. One question is how we are >>>>> going to decide the freq of the ring? just use the max >>>>> cpu freq reported? >>>>> c) With the intel_pstate approach we may still run into thermal >>>>> throttling, in this case, can a certain cooling device >>>>> be triggered to lower the cpu freq? >>>>> >>>>> Thanks and Regards, >>>>> -Jackie >>>>> >>>> >>> >> >
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c index 8dfea03..f3dd179 100644 --- a/drivers/gpu/drm/i915/i915_params.c +++ b/drivers/gpu/drm/i915/i915_params.c @@ -101,6 +101,10 @@ i915_param_named_unsafe(disable_power_well, int, 0400, "Disable display power wells when possible " "(-1=auto [default], 0=power wells always on, 1=power wells disabled when possible)"); +i915_param_named_unsafe(ring_multiplier, uint, 0400, + "Override Ring/GT frequency multiplier." + "(2=2x ring multiplier [defaut], 3=3x ring multiplier)"); + i915_param_named_unsafe(enable_ips, int, 0600, "Enable IPS (default: true)"); i915_param_named(fastboot, bool, 0600, diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h index 792ce26..6073d1c 100644 --- a/drivers/gpu/drm/i915/i915_params.h +++ b/drivers/gpu/drm/i915/i915_params.h @@ -43,6 +43,7 @@ param(int, enable_ppgtt, -1) \ param(int, enable_psr, -1) \ param(int, disable_power_well, -1) \ + param(unsigned int, ring_multiplier, 2) \ param(int, enable_ips, 1) \ param(int, invert_brightness, 0) \ param(int, enable_guc, 0) \ diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index a349c4f..080051b 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -6818,6 +6818,34 @@ static void gen6_enable_rps(struct drm_i915_private *dev_priv) intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); } +static inline void sanitize_ring_multiplier(struct drm_i915_private *i915) +{ + unsigned int m = i915_modparams.ring_multiplier; + + /* Currently, only support 2x or 3x multipliers */ + if (m != 2 && m != 3) + i915_modparams.ring_multiplier = 2; +} + +static inline void get_ring_multiplier(struct drm_i915_private *i915, + unsigned int *numer, + unsigned int *denom) +{ + sanitize_ring_multiplier(i915); + + if (IS_GEN9(i915)) { + *numer = i915_modparams.ring_multiplier; + *denom = 1; + } else if IS_HASWELL(i915) { + *numer = 5; + *denom = 2; + } else { + /* Use a 2x ring multiplier by default */ + *numer = 2; + *denom = 1; + } +} + static void gen6_update_ring_freq(struct drm_i915_private *dev_priv) { struct intel_rps *rps = &dev_priv->gt_pm.rps; @@ -6825,6 +6853,8 @@ static void gen6_update_ring_freq(struct drm_i915_private *dev_priv) unsigned int gpu_freq; unsigned int max_ia_freq, min_ring_freq; unsigned int max_gpu_freq, min_gpu_freq; + unsigned int ring_mul_numer, ring_mul_denom; + unsigned int ring_request_freq; int scaling_factor = 180; struct cpufreq_policy *policy; @@ -6858,6 +6888,9 @@ static void gen6_update_ring_freq(struct drm_i915_private *dev_priv) max_gpu_freq = rps->max_freq; } + + get_ring_multiplier(dev_priv, &ring_mul_numer, &ring_mul_denom); + /* * For each potential GPU frequency, load a ring frequency we'd like * to use for memory access. We do this by specifying the IA frequency @@ -6867,18 +6900,24 @@ static void gen6_update_ring_freq(struct drm_i915_private *dev_priv) int diff = max_gpu_freq - gpu_freq; unsigned int ia_freq = 0, ring_freq = 0; + /* + * ring_request_freq = ring_multiplier * GPU frequency. + * Ring freq is in 100MHz units while GPU frequency is in 50MHz + * units. + */ + ring_request_freq = mult_frac(gpu_freq, ring_mul_numer, + 2 * ring_mul_denom); + if (IS_GEN9_BC(dev_priv) || IS_CANNONLAKE(dev_priv)) { /* - * ring_freq = 2 * GT. ring_freq is in 100MHz units - * No floor required for ring frequency on SKL. + * ring_freq = ring request freq. No floor required for + * ring frequency on SKL. */ - ring_freq = gpu_freq; - } else if (INTEL_INFO(dev_priv)->gen >= 8) { - /* max(2 * GT, DDR). NB: GT is 50MHz units */ - ring_freq = max(min_ring_freq, gpu_freq); - } else if (IS_HASWELL(dev_priv)) { - ring_freq = mult_frac(gpu_freq, 5, 4); - ring_freq = max(min_ring_freq, ring_freq); + ring_freq = ring_request_freq; + } else if (INTEL_INFO(dev_priv)->gen >= 8 || + IS_HASWELL(dev_priv)) { + /* max(ring request freq, DDR). NB: GT is 50MHz units */ + ring_freq = max(min_ring_freq, ring_request_freq); /* leave ia_freq as the default, chosen by cpufreq */ } else { /* On older processors, there is no separate ring