drm/msm/dp: move link_ready out of HPD event thread

Message ID	20240308214532.1404038-1-quic_abhinavk@quicinc.com (mailing list archive)
State	Not Applicable
Headers	show Received: from mx0b-0031df01.pphosted.com (mx0b-0031df01.pphosted.com [205.220.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0766454BC8; Fri, 8 Mar 2024 21:45:58 +0000 (UTC) From: Abhinav Kumar <quic_abhinavk@quicinc.com> To: <freedreno@lists.freedesktop.org>, Rob Clark <robdclark@gmail.com>, Abhinav Kumar <quic_abhinavk@quicinc.com>, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, Sean Paul <sean@poorly.run>, Marijn Suijten <marijn.suijten@somainline.org>, David Airlie <airlied@gmail.com>, "Daniel Vetter" <daniel@ffwll.ch>, Kuogee Hsieh <quic_khsieh@quicinc.com> CC: <dri-devel@lists.freedesktop.org>, <swboyd@chromium.org>, <quic_jesszhan@quicinc.com>, <quic_parellan@quicinc.com>, <johan@kernel.org>, <quic_bjorande@quicinc.com>, Rob Clark <robdclark@chromium.org>, <linux-arm-msm@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH] drm/msm/dp: move link_ready out of HPD event thread Date: Fri, 8 Mar 2024 13:45:32 -0800 Message-ID: <20240308214532.1404038-1-quic_abhinavk@quicinc.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	drm/msm/dp: move link_ready out of HPD event thread \| expand drm/msm/dp: move link_ready out of HPD event thread

Abhinav Kumar March 8, 2024, 9:45 p.m. UTC

There are cases where the userspace might still send another
frame after the HPD disconnect causing a modeset cycle after
a disconnect. This messes the internal state machine of MSM DP driver
and can lead to a crash as there can be an imbalance between
bridge_disable() and bridge_enable().

This was also previously reported on [1] for which [2] was posted
and helped resolve the issue by rejecting commits if the DP is not
in connected state.

The change resolved the bug but there can also be another race condition.
If hpd_event_thread does not pick up the EV_USER_NOTIFICATION and process it
link_ready will also not be set to false allowing the frame to sneak in.

Lets move setting link_ready outside of hpd_event_thread() processing to
eliminate a window of race condition.

[1] : https://gitlab.freedesktop.org/drm/msm/-/issues/17
[2] : https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khsieh@quicinc.com/

Fixes: 8a3b4c17f863 ("drm/msm/dp: employ bridge mechanism for display enable and disable")
Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
---
 drivers/gpu/drm/msm/dp/dp_display.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Kuogee Hsieh March 11, 2024, 5:57 p.m. UTC | #1

On 3/8/2024 1:45 PM, Abhinav Kumar wrote:
> There are cases where the userspace might still send another
> frame after the HPD disconnect causing a modeset cycle after
> a disconnect. This messes the internal state machine of MSM DP driver
> and can lead to a crash as there can be an imbalance between
> bridge_disable() and bridge_enable().
>
> This was also previously reported on [1] for which [2] was posted
> and helped resolve the issue by rejecting commits if the DP is not
> in connected state.
>
> The change resolved the bug but there can also be another race condition.
> If hpd_event_thread does not pick up the EV_USER_NOTIFICATION and process it
> link_ready will also not be set to false allowing the frame to sneak in.
>
> Lets move setting link_ready outside of hpd_event_thread() processing to
> eliminate a window of race condition.
>
> [1] : https://gitlab.freedesktop.org/drm/msm/-/issues/17
> [2] : https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khsieh@quicinc.com/
>
> Fixes: 8a3b4c17f863 ("drm/msm/dp: employ bridge mechanism for display enable and disable")
> Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
Tested-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
> Reviewed-by: Kuogee Hsieh<quic_khsieh@quicinc.com>
> ---
>   drivers/gpu/drm/msm/dp/dp_display.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/dp/dp_display.c b/drivers/gpu/drm/msm/dp/dp_display.c
> index 068d44eeaa07..e00092904ccc 100644
> --- a/drivers/gpu/drm/msm/dp/dp_display.c
> +++ b/drivers/gpu/drm/msm/dp/dp_display.c
> @@ -345,8 +345,6 @@ static int dp_display_send_hpd_notification(struct dp_display_private *dp,
>   							 dp->panel->downstream_ports);
>   	}
>   
> -	dp->dp_display.link_ready = hpd;
> -
>   	drm_dbg_dp(dp->drm_dev, "type=%d hpd=%d\n",
>   			dp->dp_display.connector_type, hpd);
>   	drm_bridge_hpd_notify(bridge, dp->dp_display.link_ready);
> @@ -399,6 +397,8 @@ static int dp_display_process_hpd_high(struct dp_display_private *dp)
>   		goto end;
>   	}
>   
> +	dp->dp_display.link_ready = true;
> +
>   	dp_add_event(dp, EV_USER_NOTIFICATION, true, 0);
>   
>   end:
> @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device *dev)
>   {
>   	struct dp_display_private *dp = dev_get_dp_display_private(dev);
>   
> +	dp->dp_display.link_ready = false;
> +
>   	dp_add_event(dp, EV_USER_NOTIFICATION, false, 0);
>   
>   	return 0;
> @@ -487,6 +489,7 @@ static int dp_display_handle_port_status_changed(struct dp_display_private *dp)
>   		drm_dbg_dp(dp->drm_dev, "sink count is zero, nothing to do\n");
>   		if (dp->hpd_state != ST_DISCONNECTED) {
>   			dp->hpd_state = ST_DISCONNECT_PENDING;
> +			dp->dp_display.link_ready = false;
>   			dp_add_event(dp, EV_USER_NOTIFICATION, false, 0);
>   		}
>   	} else {

Johan Hovold March 12, 2024, 10:09 a.m. UTC | #2

On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote:
> There are cases where the userspace might still send another
> frame after the HPD disconnect causing a modeset cycle after
> a disconnect. This messes the internal state machine of MSM DP driver
> and can lead to a crash as there can be an imbalance between
> bridge_disable() and bridge_enable().

Can you be more specific here? What steps would lead to this issue and
how exactly does is mess with the state machine? Is there an easy way
to reproduce it (e.g. by instrumenting the code with some sleep)?

The hotplug code is really convoluted and having a clear description of
the problem is needed to evaluate the patch (including when revisiting
it some time from now when I've forgotten about how this state machine
works).

As you know, we ran into a related issue on sc8280xp (X13s) since
6.8-rc1, but that did not involve any user space interaction at all.

For reference, there are some more details in this thread:

	https://lore.kernel.org/all/Ze8Ke_M2xHyPYCu-@hovoldconsulting.com/

> This was also previously reported on [1] for which [2] was posted
> and helped resolve the issue by rejecting commits if the DP is not
> in connected state.
> 
> The change resolved the bug but there can also be another race condition.
> If hpd_event_thread does not pick up the EV_USER_NOTIFICATION and process it
> link_ready will also not be set to false allowing the frame to sneak in.

How could the event thread fail to pick up the notification event? Or do
you mean there's a race window before it has been processed?

> Lets move setting link_ready outside of hpd_event_thread() processing to
> eliminate a window of race condition.

As we discussed in thread above, this patch does not eliminate the race,
even if it may reduce the race window.

> [1] : https://gitlab.freedesktop.org/drm/msm/-/issues/17
> [2] : https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khsieh@quicinc.com/
> 
> Fixes: 8a3b4c17f863 ("drm/msm/dp: employ bridge mechanism for display enable and disable")
> Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>

> @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device *dev)
>  {
>  	struct dp_display_private *dp = dev_get_dp_display_private(dev);
>  
> +	dp->dp_display.link_ready = false;

As I also pointed out in the other thread, setting link_ready to false
here means that any spurious connect event (during physical disconnect)
will always be processed, something which can currently lead to a leaked
runtime pm reference.

Wasting some power is of course preferred over crashing the machine, but
please take it into consideration anyway.

Especially if your intention with this patch was to address the resets
we saw with sc8280xp which are gone since the HPD notify revert (which
fixed the hotplug detect issue that left the bridge in a
half-initialised state).

> +
>  	dp_add_event(dp, EV_USER_NOTIFICATION, false, 0);
>  
>  	return 0;

Johan

Johan Hovold March 12, 2024, 4:41 p.m. UTC | #3

On Tue, Mar 12, 2024 at 11:09:11AM +0100, Johan Hovold wrote:
> On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote:

> > @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device *dev)
> >  {
> >  	struct dp_display_private *dp = dev_get_dp_display_private(dev);
> >  
> > +	dp->dp_display.link_ready = false;
> 
> As I also pointed out in the other thread, setting link_ready to false
> here means that any spurious connect event (during physical disconnect)
> will always be processed, something which can currently lead to a leaked
> runtime pm reference.
> 
> Wasting some power is of course preferred over crashing the machine, but
> please take it into consideration anyway.
> 
> Especially if your intention with this patch was to address the resets
> we saw with sc8280xp which are gone since the HPD notify revert (which
> fixed the hotplug detect issue that left the bridge in a
> half-initialised state).

Heh. This is getting ridiculous. I just tried running with this patch
and it again breaks hotplug detect in a VT console and in X (where I
could enable a reconnected external display by running xrandr twice
before).

So, please, do not apply this one.

Johan

Johan Hovold March 12, 2024, 4:59 p.m. UTC | #4

On Tue, Mar 12, 2024 at 05:41:23PM +0100, Johan Hovold wrote:
> On Tue, Mar 12, 2024 at 11:09:11AM +0100, Johan Hovold wrote:
> > On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote:
> 
> > > @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device *dev)
> > >  {
> > >  	struct dp_display_private *dp = dev_get_dp_display_private(dev);
> > >  
> > > +	dp->dp_display.link_ready = false;
> > 
> > As I also pointed out in the other thread, setting link_ready to false
> > here means that any spurious connect event (during physical disconnect)
> > will always be processed, something which can currently lead to a leaked
> > runtime pm reference.
> > 
> > Wasting some power is of course preferred over crashing the machine, but
> > please take it into consideration anyway.
> > 
> > Especially if your intention with this patch was to address the resets
> > we saw with sc8280xp which are gone since the HPD notify revert (which
> > fixed the hotplug detect issue that left the bridge in a
> > half-initialised state).
> 
> Heh. This is getting ridiculous. I just tried running with this patch
> and it again breaks hotplug detect in a VT console and in X (where I
> could enable a reconnected external display by running xrandr twice
> before).
> 
> So, please, do not apply this one.

To make things worse, I indeed also hit the reset when disconnecting
after such a failed hotplug.

Johan

Abhinav Kumar March 12, 2024, 5:39 p.m. UTC | #5

On 3/12/2024 9:59 AM, Johan Hovold wrote:
> On Tue, Mar 12, 2024 at 05:41:23PM +0100, Johan Hovold wrote:
>> On Tue, Mar 12, 2024 at 11:09:11AM +0100, Johan Hovold wrote:
>>> On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote:
>>
>>>> @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device *dev)
>>>>   {
>>>>   	struct dp_display_private *dp = dev_get_dp_display_private(dev);
>>>>   
>>>> +	dp->dp_display.link_ready = false;
>>>
>>> As I also pointed out in the other thread, setting link_ready to false
>>> here means that any spurious connect event (during physical disconnect)
>>> will always be processed, something which can currently lead to a leaked
>>> runtime pm reference.
>>>
>>> Wasting some power is of course preferred over crashing the machine, but
>>> please take it into consideration anyway.
>>>
>>> Especially if your intention with this patch was to address the resets
>>> we saw with sc8280xp which are gone since the HPD notify revert (which
>>> fixed the hotplug detect issue that left the bridge in a
>>> half-initialised state).
>>
>> Heh. This is getting ridiculous. I just tried running with this patch
>> and it again breaks hotplug detect in a VT console and in X (where I
>> could enable a reconnected external display by running xrandr twice
>> before).
>>
>> So, please, do not apply this one.
> 
> To make things worse, I indeed also hit the reset when disconnecting
> after such a failed hotplug.
> 
> Johan

Ack, I will hold off till I analyze your issues more which you have 
listed in separate replies. Especially about the spurious connect, I 
believe you are trying to mention that, by adding logs, you are able to 
delay the processing of a connect event to *make* it like a spurious 
one? In case, I got this part wrong, can you pls explain the spurious 
connect scenario again?

A short response on why this change was made is that commit can be 
issued by userspace or the fbdev client. So userspace involvement only 
makes commit happen from a different path. It would be incorrect to 
assume the issues from the earlier bug and the current one are different 
only because there was userspace involvement in that one and not this.

Because in the end, it manifests itself in the same way that 
atomic_enable() did not go through after an atomic_disable() and the 
next atomic_disable() crashes.

Reverting the hpd_notify patch only eliminated some paths but I think 
both you and me agree the issue can still happen and in the very same 
way. So till someone else reports this issue, till HPD is reworked, I 
wanted to do whats possible to avoid this situation. If users are fine 
with what we have, I have no inclination to push this.

Johan Hovold March 13, 2024, 8:18 a.m. UTC | #6

On Tue, Mar 12, 2024 at 10:39:46AM -0700, Abhinav Kumar wrote:
> On 3/12/2024 9:59 AM, Johan Hovold wrote:

> >> Heh. This is getting ridiculous. I just tried running with this patch
> >> and it again breaks hotplug detect in a VT console and in X (where I
> >> could enable a reconnected external display by running xrandr twice
> >> before).
> >>
> >> So, please, do not apply this one.
> > 
> > To make things worse, I indeed also hit the reset when disconnecting
> > after such a failed hotplug.

> Ack, I will hold off till I analyze your issues more which you have 
> listed in separate replies. Especially about the spurious connect, I 
> believe you are trying to mention that, by adding logs, you are able to 
> delay the processing of a connect event to *make* it like a spurious 
> one? In case, I got this part wrong, can you pls explain the spurious 
> connect scenario again?

No, I only mentioned the debug printks in passing as instrumentation
like that may affect race conditions (but I'm also hitting the resets
also with no printks in place).

The spurious connect event comes directly from the pmic firmware, and
even if we may optimise things by implementing some kind of debounce,
the hotplug implementation needs to be robust enough to not kill the
machine if such an event gets through.

Basically what I see is that during physical disconnect there can be
multiple hpd notify events (e.g. connect, disconnect, connect):

[  146.910195] usb 5-1: USB disconnect, device number 4
[  146.931026] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
[  146.934785] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
[  146.938114] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
[  146.940245] [CONNECTOR:35:DP-2] status updated from disconnected to connected
[  146.955193] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2

And it is the spurious connect event while the link is being tore down
that triggers the hotplug processing that leads to the reset.

Similarly, I've seen spurious disconnect events while the plug in being
inserted.

> A short response on why this change was made is that commit can be 
> issued by userspace or the fbdev client. So userspace involvement only 
> makes commit happen from a different path. It would be incorrect to 
> assume the issues from the earlier bug and the current one are different 
> only because there was userspace involvement in that one and not this.
>
> Because in the end, it manifests itself in the same way that 
> atomic_enable() did not go through after an atomic_disable() and the 
> next atomic_disable() crashes.

Right, but your proposed fix would not actually fix anything and judging
from the sparse commit message and diff itself it is clearly only meant
to mitigate the case where user space is involved, which is *not* the
case here.

Johan

Abhinav Kumar March 13, 2024, 5:24 p.m. UTC | #7

On 3/13/2024 1:18 AM, Johan Hovold wrote:
> On Tue, Mar 12, 2024 at 10:39:46AM -0700, Abhinav Kumar wrote:
>> On 3/12/2024 9:59 AM, Johan Hovold wrote:
> 
>>>> Heh. This is getting ridiculous. I just tried running with this patch
>>>> and it again breaks hotplug detect in a VT console and in X (where I
>>>> could enable a reconnected external display by running xrandr twice
>>>> before).
>>>>
>>>> So, please, do not apply this one.
>>>
>>> To make things worse, I indeed also hit the reset when disconnecting
>>> after such a failed hotplug.
> 
>> Ack, I will hold off till I analyze your issues more which you have
>> listed in separate replies. Especially about the spurious connect, I
>> believe you are trying to mention that, by adding logs, you are able to
>> delay the processing of a connect event to *make* it like a spurious
>> one? In case, I got this part wrong, can you pls explain the spurious
>> connect scenario again?
> 
> No, I only mentioned the debug printks in passing as instrumentation
> like that may affect race conditions (but I'm also hitting the resets
> also with no printks in place).
> 
> The spurious connect event comes directly from the pmic firmware, and
> even if we may optimise things by implementing some kind of debounce,
> the hotplug implementation needs to be robust enough to not kill the
> machine if such an event gets through.
> 
> Basically what I see is that during physical disconnect there can be
> multiple hpd notify events (e.g. connect, disconnect, connect):
> 
> [  146.910195] usb 5-1: USB disconnect, device number 4
> [  146.931026] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
> [  146.934785] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
> [  146.938114] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
> [  146.940245] [CONNECTOR:35:DP-2] status updated from disconnected to connected
> [  146.955193] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
> 
> And it is the spurious connect event while the link is being tore down
> that triggers the hotplug processing that leads to the reset.
> 
> Similarly, I've seen spurious disconnect events while the plug in being
> inserted.
> 

This is quite weird and also explains why most of the issues were seen 
only with sc8280xp. pmic spurious events are busting the hpd logic.

Agreed, that DP driver should be robust enough to handle this but this 
will also bust usermode to send down unnecessary frames. Someone should 
address why these spurious events are coming.

>> A short response on why this change was made is that commit can be
>> issued by userspace or the fbdev client. So userspace involvement only
>> makes commit happen from a different path. It would be incorrect to
>> assume the issues from the earlier bug and the current one are different
>> only because there was userspace involvement in that one and not this.
>>
>> Because in the end, it manifests itself in the same way that
>> atomic_enable() did not go through after an atomic_disable() and the
>> next atomic_disable() crashes.
> 
> Right, but your proposed fix would not actually fix anything and judging
> from the sparse commit message and diff itself it is clearly only meant
> to mitigate the case where user space is involved, which is *not* the
> case here.
> 

No, I think there is some disconnect in the way you are reading that 
patch perhaps due to some missing details OR I am missing your point.

Like I said, drm_atomic_commit() can be issued by userspace or the fbdev 
client in the driver. Thats the only userspace involvement.

Now, why the patch was made or was expected to work.

There can be a race condition between the time the DP driver gets the 
hpd disconnect event and when the hpd thread processes that event 
allowing the commit to sneak in. This is something which has always been 
there even without pm_runtime series and remains even today.

In this race condition, the setting of "link_ready" to false can be a 
bit delayed if we go through the HPD event processing increasing the 
race condition window.

If link_ready is false, atomic_check() fails, thereby failing any 
commits and hence not allowing the atomic_disable() / atomic_enable() 
cycle and hence avoiding this reset.

The patch is moving the setting of link_ready to false earlier by not 
putting it through the HPD event thread and hence trying to reduce the 
window of the issue.

> Johan

Johan Hovold March 14, 2024, 3:38 p.m. UTC | #8

On Wed, Mar 13, 2024 at 10:24:08AM -0700, Abhinav Kumar wrote:
> On 3/13/2024 1:18 AM, Johan Hovold wrote:

> > Right, but your proposed fix would not actually fix anything and judging
> > from the sparse commit message and diff itself it is clearly only meant
> > to mitigate the case where user space is involved, which is *not* the
> > case here.

> There can be a race condition between the time the DP driver gets the 
> hpd disconnect event and when the hpd thread processes that event 
> allowing the commit to sneak in. This is something which has always been 
> there even without pm_runtime series and remains even today.
> 
> In this race condition, the setting of "link_ready" to false can be a 
> bit delayed if we go through the HPD event processing increasing the 
> race condition window.
> 
> If link_ready is false, atomic_check() fails, thereby failing any 
> commits and hence not allowing the atomic_disable() / atomic_enable() 
> cycle and hence avoiding this reset.
> 
> The patch is moving the setting of link_ready to false earlier by not 
> putting it through the HPD event thread and hence trying to reduce the 
> window of the issue.

Perhaps I'm missing something in the race that you are trying to
describe (and which I've asked you to describe in more detail so that I
don't have to spend more time trying to come up with a reproducer
myself).

I do understand how your patch works, but my point is that it does
not fix the race that we are hitting on sc8280xp and, unless I'm missing
something, it is not even sufficient to fix the race you are talking
about as user space can still trigger that ioctl() before you clear the
link_ready flag.

That's why I said that it is only papering over the issue by making the
race window smaller (and this should also be highlighted in the commit
message).

For some reason it also made things worse on sc8280xp, but I did not
spend time on tracking down exactly why.

Johan

Abhinav Kumar March 14, 2024, 4:30 p.m. UTC | #9

On 3/14/2024 8:38 AM, Johan Hovold wrote:
> On Wed, Mar 13, 2024 at 10:24:08AM -0700, Abhinav Kumar wrote:
>> On 3/13/2024 1:18 AM, Johan Hovold wrote:
> 
>>> Right, but your proposed fix would not actually fix anything and judging
>>> from the sparse commit message and diff itself it is clearly only meant
>>> to mitigate the case where user space is involved, which is *not* the
>>> case here.
> 
>> There can be a race condition between the time the DP driver gets the
>> hpd disconnect event and when the hpd thread processes that event
>> allowing the commit to sneak in. This is something which has always been
>> there even without pm_runtime series and remains even today.
>>
>> In this race condition, the setting of "link_ready" to false can be a
>> bit delayed if we go through the HPD event processing increasing the
>> race condition window.
>>
>> If link_ready is false, atomic_check() fails, thereby failing any
>> commits and hence not allowing the atomic_disable() / atomic_enable()
>> cycle and hence avoiding this reset.
>>
>> The patch is moving the setting of link_ready to false earlier by not
>> putting it through the HPD event thread and hence trying to reduce the
>> window of the issue.
> 
> Perhaps I'm missing something in the race that you are trying to
> describe (and which I've asked you to describe in more detail so that I
> don't have to spend more time trying to come up with a reproducer
> myself).
> 

The race condition is between the time we get disconnect event and set 
link_ready to false, a commit can come in. Because setting link_ready to 
false happens in the event thread so it could be slightly delayed.

It will be hard to reproduce this. Only way I can think of is to delay 
the EV_NOTIFICATION for sometime and see in dp_bridge_hpd_notify()

         else if (dp_display->link_ready && status == 
connector_status_disconnected)
                 dp_add_event(dp, EV_HPD_UNPLUG_INT, 0, 0);

as dp_add_event() will add the event, then wakeup the event_q.

Before the event thread wakes up and processes this unplug event, the 
commit can come in. This is the race condition i was thinking of.

> I do understand how your patch works, but my point is that it does
> not fix the race that we are hitting on sc8280xp and, unless I'm missing
> something, it is not even sufficient to fix the race you are talking
> about as user space can still trigger that ioctl() before you clear the
> link_ready flag.
> 
> That's why I said that it is only papering over the issue by making the
> race window smaller (and this should also be highlighted in the commit
> message).
> 

Yes, I have already accepted this part. It only reduces the race window 
smaller.

> For some reason it also made things worse on sc8280xp, but I did not
> spend time on tracking down exactly why.
> 

This part I agree. I need to check why sc8280xp again does not like this 
patch. You dont have to spend time, I will do it and till then I will 
hold this patch off.

> Johan

Johan Hovold March 15, 2024, 3:57 p.m. UTC | #10

On Thu, Mar 14, 2024 at 09:30:57AM -0700, Abhinav Kumar wrote:
> On 3/14/2024 8:38 AM, Johan Hovold wrote:
> > On Wed, Mar 13, 2024 at 10:24:08AM -0700, Abhinav Kumar wrote:

> > Perhaps I'm missing something in the race that you are trying to
> > describe (and which I've asked you to describe in more detail so that I
> > don't have to spend more time trying to come up with a reproducer
> > myself).

> The race condition is between the time we get disconnect event and set 
> link_ready to false, a commit can come in. Because setting link_ready to 
> false happens in the event thread so it could be slightly delayed.

I get this part, just not why, or rather when, that becomes a problem.

Once the disconnect event is processed, dp_hpd_unplug_handle() will
update the state to ST_DISCONNECT_PENDING, and queue a notification
event. link_ready is (before this patch) still set to 1.

Here a commit comes in; what exactly are you suggesting would trigger
that? And in such a way that it breaks the state machine?

One way this could cause trouble is if you end up with a call to
dp_bridge_atomic_post_disable() which updates the state to
ST_DISCONNECTED. (1)

This would then need to be followed by another call to
dp_bridge_atomic_enable() which bails out early with the link clock
disabled. (2) (And if link_ready were to be set to 0 sooner, the
likelihood of this is reduced.)

This in turn, would trigger a reset when dp_bridge_atomic_disable() is
later called.

This is the kind of description of the race I expect to see in the
commit message, and I'm still not sure what would trigger the call to
dp_bridge_atomic_post_disable() and dp_bridge_atomic_enable() (i.e. (1)
and (2) above) and whether this is a real issue or not.

Also note that the above scenario is quite different from the one I've
hit and described earlier.

> It will be hard to reproduce this. Only way I can think of is to delay 
> the EV_NOTIFICATION for sometime and see in dp_bridge_hpd_notify()
> 
>          else if (dp_display->link_ready && status == 
> connector_status_disconnected)
>                  dp_add_event(dp, EV_HPD_UNPLUG_INT, 0, 0);
> 
> as dp_add_event() will add the event, then wakeup the event_q.

Sure that would increase the race window with the current code, but that
alone isn't enough to trigger the bug AFAICT.

> Before the event thread wakes up and processes this unplug event, the 
> commit can come in. This is the race condition i was thinking of.

Yes, but what triggers the commit? And why would it lead to a mode set
that disables the bridge?

Johan

Abhinav Kumar March 18, 2024, 6:01 p.m. UTC | #11

On 3/15/2024 8:57 AM, Johan Hovold wrote:
> On Thu, Mar 14, 2024 at 09:30:57AM -0700, Abhinav Kumar wrote:
>> On 3/14/2024 8:38 AM, Johan Hovold wrote:
>>> On Wed, Mar 13, 2024 at 10:24:08AM -0700, Abhinav Kumar wrote:
> 
>>> Perhaps I'm missing something in the race that you are trying to
>>> describe (and which I've asked you to describe in more detail so that I
>>> don't have to spend more time trying to come up with a reproducer
>>> myself).
> 
>> The race condition is between the time we get disconnect event and set
>> link_ready to false, a commit can come in. Because setting link_ready to
>> false happens in the event thread so it could be slightly delayed.
> 
> I get this part, just not why, or rather when, that becomes a problem.
> 
> Once the disconnect event is processed, dp_hpd_unplug_handle() will
> update the state to ST_DISCONNECT_PENDING, and queue a notification
> event. link_ready is (before this patch) still set to 1.
> 

This is the case I am thinking of:

1) Disconnect event happens which will call dp_hpd_unplug_handle() but 
link_ready is not false yet.

2) There is a commit with a modeset, which shall trigger 
atomic_disable() followed by an atomic_enable()

atomic_disable() will go through disable clocks and set hpd_state to 
ST_DISCONNECTED.

3) atomic_enable() will not go through because we will bail out because 
state was ST_DISCONNECTED.

         if (state != ST_DISPLAY_OFF && state != ST_MAINLINK_READY) {
                 mutex_unlock(&dp_display->event_mutex);
                 return;
         }

4) Now, if there is another commit with a modeset, it will go and crash 
at atomic_disable()

> Here a commit comes in; what exactly are you suggesting would trigger
> that? And in such a way that it breaks the state machine?
> 

Like we have seen, the commit can either come directly from userspace as 
one last frame (the original bug I had given the link to) or from the 
__drm_fb_helper_restore_fbdev_mode_unlocked() which happened in 
sc8280xp's case. This is totally independent of the hpd_thread() with no 
mutual exclusion.

This commit() can come before the link_ready was set to false. If it had 
come after link_ready was set to false, atomic_check() would have failed 
and no issue would have been seen.

My change is making the link_ready false sooner in the disconnect case.

> One way this could cause trouble is if you end up with a call to
> dp_bridge_atomic_post_disable() which updates the state to
> ST_DISCONNECTED. (1)
> 
> This would then need to be followed by another call to
> dp_bridge_atomic_enable() which bails out early with the link clock
> disabled. (2) (And if link_ready were to be set to 0 sooner, the
> likelihood of this is reduced.)
> 
> This in turn, would trigger a reset when dp_bridge_atomic_disable() is
> later called.
> 

Yes, this is exactly what I have written above.

> This is the kind of description of the race I expect to see in the
> commit message, and I'm still not sure what would trigger the call to
> dp_bridge_atomic_post_disable() and dp_bridge_atomic_enable() (i.e. (1)
> and (2) above) and whether this is a real issue or not.
> 

I have explained what triggers the disable/enable call below.

> Also note that the above scenario is quite different from the one I've
> hit and described earlier.
> 

Why is that so? Eventually it will also translate to the same scenario. 
I would like to understand why this is different. I think in your case, 
probably we do not know what triggers the modeset, but its a minor 
detail like I have written before.

>> It will be hard to reproduce this. Only way I can think of is to delay
>> the EV_NOTIFICATION for sometime and see in dp_bridge_hpd_notify()
>>
>>           else if (dp_display->link_ready && status ==
>> connector_status_disconnected)
>>                   dp_add_event(dp, EV_HPD_UNPLUG_INT, 0, 0);
>>
>> as dp_add_event() will add the event, then wakeup the event_q.
> 
> Sure that would increase the race window with the current code, but that
> alone isn't enough to trigger the bug AFAICT.
> 
>> Before the event thread wakes up and processes this unplug event, the
>> commit can come in. This is the race condition i was thinking of.
> 
> Yes, but what triggers the commit? And why would it lead to a mode set
> that disables the bridge?
> 

Commit was triggered from the userspace as it did not process the 
disconnect event on time and the userspace was triggering a couple of 
modesets by by changing the mode on the CRTC from 1080P to NONE to 1080P.

[drm:drm_atomic_helper_check_modeset] [CRTC:60:crtc-1] mode changed

> Johan

Johan Hovold March 21, 2024, 4:41 p.m. UTC | #12

On Mon, Mar 18, 2024 at 11:01:25AM -0700, Abhinav Kumar wrote:
> On 3/15/2024 8:57 AM, Johan Hovold wrote:
> > On Thu, Mar 14, 2024 at 09:30:57AM -0700, Abhinav Kumar wrote:

> >> The race condition is between the time we get disconnect event and set
> >> link_ready to false, a commit can come in. Because setting link_ready to
> >> false happens in the event thread so it could be slightly delayed.
> > 
> > I get this part, just not why, or rather when, that becomes a problem.
> > 
> > Once the disconnect event is processed, dp_hpd_unplug_handle() will
> > update the state to ST_DISCONNECT_PENDING, and queue a notification
> > event. link_ready is (before this patch) still set to 1.

> This is the case I am thinking of:
> 
> 1) Disconnect event happens which will call dp_hpd_unplug_handle() but 
> link_ready is not false yet.
> 
> 2) There is a commit with a modeset, which shall trigger 
> atomic_disable() followed by an atomic_enable()
> 
> atomic_disable() will go through disable clocks and set hpd_state to 
> ST_DISCONNECTED.
> 
> 3) atomic_enable() will not go through because we will bail out because 
> state was ST_DISCONNECTED.
> 
>          if (state != ST_DISPLAY_OFF && state != ST_MAINLINK_READY) {
>                  mutex_unlock(&dp_display->event_mutex);
>                  return;
>          }
> 
> 4) Now, if there is another commit with a modeset, it will go and crash 
> at atomic_disable()

Right, that's what I described in the mail you replied to but that still
doesn't answer what triggers those mode sets.
 
> > Here a commit comes in; what exactly are you suggesting would trigger
> > that? And in such a way that it breaks the state machine?

> Like we have seen, the commit can either come directly from userspace as 
> one last frame (the original bug I had given the link to) or from the 
> __drm_fb_helper_restore_fbdev_mode_unlocked() which happened in 
> sc8280xp's case. This is totally independent of the hpd_thread() with no 
> mutual exclusion.

Right . Still not sure about the details about "that last frame" issue,
that you saw in the past, and if that's still an issue or not. You
claimed that you had fixed that, right?

> This commit() can come before the link_ready was set to false. If it had 
> come after link_ready was set to false, atomic_check() would have failed 
> and no issue would have been seen.
> 
> My change is making the link_ready false sooner in the disconnect case.

Yes, but again, and as you have confirmed, you're only papering over the
issue at such a mode set can still come in before you set link_state to
false.
 
> > One way this could cause trouble is if you end up with a call to
> > dp_bridge_atomic_post_disable() which updates the state to
> > ST_DISCONNECTED. (1)
> > 
> > This would then need to be followed by another call to
> > dp_bridge_atomic_enable() which bails out early with the link clock
> > disabled. (2) (And if link_ready were to be set to 0 sooner, the
> > likelihood of this is reduced.)
> > 
> > This in turn, would trigger a reset when dp_bridge_atomic_disable() is
> > later called.

> Yes, this is exactly what I have written above.

Thanks for confirming.

> > This is the kind of description of the race I expect to see in the
> > commit message, and I'm still not sure what would trigger the call to
> > dp_bridge_atomic_post_disable() and dp_bridge_atomic_enable() (i.e. (1)
> > and (2) above) and whether this is a real issue or not.
> 
> I have explained what triggers the disable/enable call below.
> 
> > Also note that the above scenario is quite different from the one I've
> > hit and described earlier.

> Why is that so? Eventually it will also translate to the same scenario. 
> I would like to understand why this is different. I think in your case, 
> probably we do not know what triggers the modeset, but its a minor 
> detail like I have written before.

The state transitions are different and the enable event comes in
before the bridge has been fully tore down unlike in the scenario we
outlined above.

And it's certainly not a minor detail, as in the sc8280xp VT case,
those spurious hotplug events that trigger the atomic_enable would not
have caused any trouble if it wasn't for the case that the bridge was
stuck in the ST_MAINLINK_READY state.

That explains why the hotplug notification revert in rc7 made a
difference on sc8280xp. 

You're talking about an entirely different and, as far as I can tell,
hypothetical scenario where are user executes a modeset while pulling
the plug. This is certainly not why we had a number of user suddenly
starting to hit this crash after they upgraded to 6.8-rc1.

And, just to be clear, we know what triggers the modeset in the VT case,
and I posted a detailed explanation with a strack trace here:

	https://lore.kernel.org/lkml/Ze8Ke_M2xHyPYCu-@hovoldconsulting.com/

> >> It will be hard to reproduce this. Only way I can think of is to delay
> >> the EV_NOTIFICATION for sometime and see in dp_bridge_hpd_notify()
> >>
> >>           else if (dp_display->link_ready && status ==
> >> connector_status_disconnected)
> >>                   dp_add_event(dp, EV_HPD_UNPLUG_INT, 0, 0);
> >>
> >> as dp_add_event() will add the event, then wakeup the event_q.
> > 
> > Sure that would increase the race window with the current code, but that
> > alone isn't enough to trigger the bug AFAICT.
> > 
> >> Before the event thread wakes up and processes this unplug event, the
> >> commit can come in. This is the race condition i was thinking of.
> > 
> > Yes, but what triggers the commit? And why would it lead to a mode set
> > that disables the bridge?

> Commit was triggered from the userspace as it did not process the 
> disconnect event on time and the userspace was triggering a couple of 
> modesets by by changing the mode on the CRTC from 1080P to NONE to 1080P.
> 
> [drm:drm_atomic_helper_check_modeset] [CRTC:60:crtc-1] mode changed

But *why* would user space do that? Pushing out another frame would
generally not trigger a modeset, right?

And as I've alluded to repeatedly, your patch only seems concerned with
something like the above, where a hypothetical user space is triggering
modesets after receiving a notification.

And we know that that is not relevant for the crashes I've seen as there
is no user space processing any events in my VT or X setup.

Johan

drm/msm/dp: move link_ready out of HPD event thread

Commit Message

Comments

Patch