Message ID | 20240327202740.3075378-1-swboyd@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | Fix a black screen on sc7180 Trogdor devices | expand |
Hi, On Wed, Mar 27, 2024 at 1:27 PM Stephen Boyd <swboyd@chromium.org> wrote: > > This patch series fixes a black screen seen at boot on Trogdor devices. > The details of that problem are in the second patch, but the TL;DR is > that shared RCGs report the wrong parent to the clk framework and shared > RCGs never get turned off if they're left force enabled out of boot, > wedging the display GDSC causing the display bridge to fail to probe > because it can't turn on DSI. > > The first patch is basically a hack. It avoids a problem where the mdss > driver probes, turns on and off the mdp clk, and hangs the rotator clk > because the rotator clk is using the same parent. We don't care about > this case on sc7180, so we simply disable the clk at driver probe so it > can't get stuck on. > > The second patch fixes the shared RCG implementation so that the parent > is properly reported and so that the force enable bit is cleared. Fixing > the parent causes the problem that the first patch is avoiding, which is > why that patch is first. Just applying this second patch will make it so > that disabling the mdp clk disables the display pll that the rotator clk > is also using, causing the rotator clk to be stuck on. > > This problem comes about because of a combination of issues. The clk > framework doesn't handle the case where two clks share the same parent > and are enabled at boot. The first clk to enable and disable will turn > off the parent. The mdss driver doesn't stay out of runtime suspend > while populating the child devices. In fact, of_platform_populate() > triggers runtime resume and suspend of the mdss device multiple times > while devices are being added. These patches side-step the larger issues > here with the goal of fixing Trogdor in the short-term. Long-term we > need to fix the clk framework and display driver so that shared parents > aren't disabled during boot and so that mdss can't runtime suspend until > the display pipeline/card is fully formed. > > Stephen Boyd (2): > clk: qcom: dispcc-sc7180: Force off rotator clk at probe > clk: qcom: Properly initialize shared RCGs upon registration > > drivers/clk/qcom/clk-rcg2.c | 19 +++++++++++++++++++ > drivers/clk/qcom/dispcc-sc7180.c | 14 ++++++++++++++ > 2 files changed, 33 insertions(+) I spent a bunch of time discussing this offline with Stephen and I'll try to summarize. Hopefully this isn't too much nonsense... 1. We'll likely land the patches downstream in ChromeOS for now while we're figuring things out since we're seeing actual breakages. Whether to land upstream is a question. The first patch is a bit of a hack but unlikely to cause any real problems. The second patch seems correct but it also feels like it's going to cause stuck clocks for a pile of other SoCs because we're not adding hacks similar to the sc7180 hack for all the other SoCs. I guess we could hope we get lucky or play whack-a-mole? ...or we try to find a more generic solution... Dunno what others think. 2. One thing we talked about would be adding something like `CLK_OPS_PARENT_ENABLE_FOR_DISABLE` for all the RCGs. That should get rid of the actual error we're seeing. Essentially what we're seeing today is: * RCG1 is parented on a PLL * RCG2 is parented on the same PLL * RCG1, RCG2, and the PLL are left enabled by the BIOS When we enable/disable RCG1 then the PLL turns off (since we propagate our disable to our parent and nothing else is keeping the PLL on). At this time RCG2 is implicitly off (it's not producing a clock) since its parent (the PLL) is off. ...but the hardware "force enable" bit for RCG2 is still set to on and the kernel still thinks the child of this clock is on. If, after this, we "disabled" a branch clock child of RCG2 (because it's unused and we get to the part of the kernel that disables unused clocks) then we were getting the error since you can't change the hardware "enable" bit for a branch clock if its parent is not clocking. ...so `CLK_OPS_PARENT_ENABLE_FOR_DISABLE` would fix this specific case because we'd turn the PLL on for the duration of the "disable" call. ...but there is still the concern here that there will be a period of time that the RCG2 child is "stuck" even if we're not paying attention to it. This would be the period of time between when we turned the PLL off and we got around to disabling the RCG2 child because it was unused. Stephen thought that perhaps something else in hardware (perhaps a GDSC) might notice that the RCG2 child's hardware "enable" bit was set and would assume that it was clocking. Then that other bit of hardware would be unhappy because no clock came out. This concern made us think that perhaps `CLK_OPS_PARENT_ENABLE_FOR_DISABLE` isn't the right way to go. 3. Another idea was essentially to implement the long talked about idea of "handoff". Essentially check the state of the clocks at boot and if they're enabled then propagate the enable to our parents. If we implement this then it would solve the problem because RCG1 and RCG2 would add an implicit request for the PLL to be on. If we enable/disable RCG1 at boot then the PLL will still stay on because there's a request from RCG2. If/when we disable children of RCG2 then the PLL will lose its request but that's fine. In no cases will we have a hardware "enable" bit set without the parent. This seems solid and probably the right solution. Stephen was a bit concerned, though, because you don't know when all your children have been registered. If a child shows up after "disable unused" runs then you can get back into the situation again. That probably isn't concern for the immediate problem at hand because all the clocks involved show up in early boot, but it means that the problem is half solved and that's not super satisfying. To solve this, we need to overall solve the "disable_unused" problem to be at the right time, after everything has had a chance to probe. To me this feels a bit like the "sync state" problem with all the baggage involved there. Specifically (like sync state) I think it would have to be heavily based on analysis of the device tree and links and that has the standard problem where you never enter sync state when DT has a node "enabled" but you didn't happen to enable the relevant driver in your kernel config. ...though I guess we've "sorta" solved that with the timeout. NOTE: if I understand correctly this would only be possible if drivers are using "struct clk_parent_data" and not if they're just using global names to match clocks. I'll also note that in general I believe Stephen isn't a fan of "disable_unused", but (my opinion) is that it's important to have something in the kernel that disables unused clocks when everything is done loading. Without this we add a lot of complexity to the firmware to turn off all the clocks that the SoC might have had on by default and that's non-ideal. -Doug
Quoting Doug Anderson (2024-03-28 09:39:54) > > I spent a bunch of time discussing this offline with Stephen and I'll > try to summarize. Hopefully this isn't too much nonsense... > > 1. We'll likely land the patches downstream in ChromeOS for now while > we're figuring things out since we're seeing actual breakages. Whether > to land upstream is a question. The first patch is a bit of a hack but > unlikely to cause any real problems. The second patch seems correct > but it also feels like it's going to cause stuck clocks for a pile of > other SoCs because we're not adding hacks similar to the sc7180 hack > for all the other SoCs. I guess we could hope we get lucky or play > whack-a-mole? ...or we try to find a more generic solution... Dunno > what others think. I think we should hope to get lucky or play whack-a-mole and merge something like this series. If we have to we can similarly turn off RCGs or branches during driver probe that are using shared parents like we have on sc7180. Put simply, the shared RCG implementation is broken because it reports the wrong parent for clk_ops::get_parent() and doesn't clear the force enable bit. With the current code we're switching the parent to XO when the clk is enabled the first time. That's an obvious bug that we should fix regardless of implementing proper clk handoff. We haven't implemented handoff in over a decade. Blocking this bug fix on implementing handoff isn't practical. Furthermore, we're relying on clk consumers to clear that force enable bit by enabling the clk once. That doesn't make any sense, although we could use that force enable bit to consider the RCG as enabled for clk_disable_unused. An alternative approach to this series would be to force all shared RCGs to be parented to XO at clk registration time, and continue to clear that RCG force enable bit. That's sort of what Dmitry was going for earlier. Doing this would break anything that's relying on the clks staying enabled at some frequency through boot, but that isn't supported anyway because clk handoff isn't implemented. It avoids the problem that the first patch is for too because XO doesn't turn off causing a clk to get stuck on. I can certainly craft this patch up if folks think that's better. To ease the transition we can make a new clk_ops for the RCG as well so that each SoC has to opt-in to use this behavior. Then we can be certain that other platforms aren't affected without being tested first. I'd prefer to not do that though because I fear we'll be leaving drivers in the broken state for some time.
Hi, On Tue, Apr 30, 2024 at 5:17 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2024-03-28 09:39:54) > > > > I spent a bunch of time discussing this offline with Stephen and I'll > > try to summarize. Hopefully this isn't too much nonsense... > > > > 1. We'll likely land the patches downstream in ChromeOS for now while > > we're figuring things out since we're seeing actual breakages. Whether > > to land upstream is a question. The first patch is a bit of a hack but > > unlikely to cause any real problems. The second patch seems correct > > but it also feels like it's going to cause stuck clocks for a pile of > > other SoCs because we're not adding hacks similar to the sc7180 hack > > for all the other SoCs. I guess we could hope we get lucky or play > > whack-a-mole? ...or we try to find a more generic solution... Dunno > > what others think. > > I think we should hope to get lucky or play whack-a-mole and merge > something like this series. If we have to we can similarly turn off RCGs > or branches during driver probe that are using shared parents like we > have on sc7180. This is OK w/ me, but of course I'm super biased since the only Qualcomm platform I'm involved in is sc7180 Chromebooks and that's handled by your series. If it helps, I suppose you could add: Reviewed-by: Douglas Anderson <dianders@chromium.org> IMO it would be good to get Bjorn or Dmitry to buy in and maybe post a PSA and/or request for testing to a few IRC channels where folks hang out (#linux-msm, #freedreno and #aarch64-laptops, maybe?) -Doug
On Wed, 1 May 2024 at 03:17, Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2024-03-28 09:39:54) > > > > I spent a bunch of time discussing this offline with Stephen and I'll > > try to summarize. Hopefully this isn't too much nonsense... > > > > 1. We'll likely land the patches downstream in ChromeOS for now while > > we're figuring things out since we're seeing actual breakages. Whether > > to land upstream is a question. The first patch is a bit of a hack but > > unlikely to cause any real problems. The second patch seems correct > > but it also feels like it's going to cause stuck clocks for a pile of > > other SoCs because we're not adding hacks similar to the sc7180 hack > > for all the other SoCs. I guess we could hope we get lucky or play > > whack-a-mole? ...or we try to find a more generic solution... Dunno > > what others think. > > I think we should hope to get lucky or play whack-a-mole and merge > something like this series. If we have to we can similarly turn off RCGs > or branches during driver probe that are using shared parents like we > have on sc7180. > > Put simply, the shared RCG implementation is broken because it reports > the wrong parent for clk_ops::get_parent() and doesn't clear the force > enable bit. With the current code we're switching the parent to XO when > the clk is enabled the first time. That's an obvious bug that we should > fix regardless of implementing proper clk handoff. We haven't > implemented handoff in over a decade. Blocking this bug fix on > implementing handoff isn't practical. Yes, that needs to be fixed. My approach was to drop the XO parent and consider the clock to be off if it is fed by the XO. > Furthermore, we're relying on clk > consumers to clear that force enable bit by enabling the clk once. That > doesn't make any sense, although we could use that force enable bit to > consider the RCG as enabled for clk_disable_unused. That patch seems fine to me. Will it work if we take the force-enable bit into account when considering the clock to be on or off? > > An alternative approach to this series would be to force all shared RCGs > to be parented to XO at clk registration time, and continue to clear > that RCG force enable bit. That's sort of what Dmitry was going for > earlier. Doing this would break anything that's relying on the clks > staying enabled at some frequency through boot, but that isn't supported > anyway because clk handoff isn't implemented. It avoids the problem that > the first patch is for too because XO doesn't turn off causing a clk to > get stuck on. I can certainly craft this patch up if folks think that's > better. I think this approach makes sense too (and might be preferable to boot-time hacks). On most of the platforms we are already resetting the MDSS as soon as the mdss (root device) is being probed. Then the display is going to be broken until DPU collects all the coonectors and outpus and finally creates the DRM device. But I think we should fix the get_parent() too irrespectively of this. > To ease the transition we can make a new clk_ops for the RCG as well so > that each SoC has to opt-in to use this behavior. Then we can be certain > that other platforms aren't affected without being tested first. I'd > prefer to not do that though because I fear we'll be leaving drivers in > the broken state for some time. SGTM -- With best wishes Dmitry
Quoting Dmitry Baryshkov (2024-05-01 11:11:14) > On Wed, 1 May 2024 at 03:17, Stephen Boyd <swboyd@chromium.org> wrote: > > > > Quoting Doug Anderson (2024-03-28 09:39:54) > > > > > > I spent a bunch of time discussing this offline with Stephen and I'll > > > try to summarize. Hopefully this isn't too much nonsense... > > > > > > 1. We'll likely land the patches downstream in ChromeOS for now while > > > we're figuring things out since we're seeing actual breakages. Whether > > > to land upstream is a question. The first patch is a bit of a hack but > > > unlikely to cause any real problems. The second patch seems correct > > > but it also feels like it's going to cause stuck clocks for a pile of > > > other SoCs because we're not adding hacks similar to the sc7180 hack > > > for all the other SoCs. I guess we could hope we get lucky or play > > > whack-a-mole? ...or we try to find a more generic solution... Dunno > > > what others think. > > > > I think we should hope to get lucky or play whack-a-mole and merge > > something like this series. If we have to we can similarly turn off RCGs > > or branches during driver probe that are using shared parents like we > > have on sc7180. > > > > Put simply, the shared RCG implementation is broken because it reports > > the wrong parent for clk_ops::get_parent() and doesn't clear the force > > enable bit. With the current code we're switching the parent to XO when > > the clk is enabled the first time. That's an obvious bug that we should > > fix regardless of implementing proper clk handoff. We haven't > > implemented handoff in over a decade. Blocking this bug fix on > > implementing handoff isn't practical. > > Yes, that needs to be fixed. My approach was to drop the XO parent and > consider the clock to be off if it is fed by the XO. > > > Furthermore, we're relying on clk > > consumers to clear that force enable bit by enabling the clk once. That > > doesn't make any sense, although we could use that force enable bit to > > consider the RCG as enabled for clk_disable_unused. > > That patch seems fine to me. Will it work if we take the force-enable > bit into account when considering the clock to be on or off? What is "that patch"? It would work to consider the rcg clk as on or off. During rcg clk registration if a branch child is found to be enabled we would go and write the force enable bit in the parent rcg. And similarly we would modify the rcg clk_ops to set that bit whenever the clk is enabled and clear it whenever it is disabled. This avoids the feedback mechanism from confusing us about the enable state of the rcg, at the cost of writing the register. It wouldn't fix the problem that we have on Trogdor though. That's because the display GDSC is turned off when the rotator clk is enabled and parented to the display PLL, which is also turned off. We have to either turn off the rotator clk, or switch the parent to something that is always on, XO, or keep the display GDSC on until the rotator clk can be turned off. On other SoCs we may need to turn off even more clks depending on which ones the display GDSC is controlling. > > > > > An alternative approach to this series would be to force all shared RCGs > > to be parented to XO at clk registration time, and continue to clear > > that RCG force enable bit. That's sort of what Dmitry was going for > > earlier. Doing this would break anything that's relying on the clks > > staying enabled at some frequency through boot, but that isn't supported > > anyway because clk handoff isn't implemented. It avoids the problem that > > the first patch is for too because XO doesn't turn off causing a clk to > > get stuck on. I can certainly craft this patch up if folks think that's > > better. > > I think this approach makes sense too (and might be preferable to > boot-time hacks). > On most of the platforms we are already resetting the MDSS as soon as > the mdss (root device) is being probed. Then the display is going to > be broken until DPU collects all the coonectors and outpus and finally > creates the DRM device. Ok. I'm leaning towards this approach now because it seems like keeping the clk at whatever it was set at boot isn't useful. If it becomes useful at some point we can implement a better handoff mechanism. I have some idea how to do that by using the 'clocks' DT property to find child clks that haven't probed yet. I'll test out this alternate approach to park shared clks at probe and send the patch. > > But I think we should fix the get_parent() too irrespectively of this. > Sure, but it becomes sorta moot if we force the parent to be XO.