diff mbox series

x86: irq: Do not BUG_ON multiple unbind calls for shared pirqs

Message ID 1562165173-31383-1-git-send-email-vrd@amazon.de (mailing list archive)
State New, archived
Headers show
Series x86: irq: Do not BUG_ON multiple unbind calls for shared pirqs | expand

Commit Message

Varad Gautam July 3, 2019, 2:46 p.m. UTC
It is possible to receive multiple __pirq_guest_unbind calls for the same pirq
if the pirq has not yet been removed from the domain's pirq_tree. For a shared
pirq (nr_guests > 1), the first call zaps the current domain from the pirq's
guests[] list, but the action handler is never freed as there are other guests
using this pirq. As a result, on the second call, __pirq_guest_unbind tries
search for the current domain which has been removed from the guests[] list,
and hits a BUG_ON.

Allow xen to continue if a shared pirq has already been unbound from this guest.
It will be cleaned up from the domain's pirq_tree during the destruction in
complete_domain_destroy.

Signed-off-by: Varad Gautam <vrd@amazon.de>

---

To apply stable-4.11 onwards.

 xen/arch/x86/irq.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Jan Beulich July 3, 2019, 3:09 p.m. UTC | #1
On 03.07.2019 16:46, Varad Gautam wrote:
> It is possible to receive multiple __pirq_guest_unbind calls for the same pirq
> if the pirq has not yet been removed from the domain's pirq_tree.

I'd appreciate if you would make clear under what conditions this can
happen, as I'm getting the impression that it's not the BUG_ON() that
wants removing here, but that instead some caller may need fixing, or
that instead the pirq tree removal needs to happen earlier. Afaict the
call from evtchn_close() can't happen more than once, for example, and
I wouldn't be surprised at all if one of the callers from
xen/drivers/passthrough/ wasn't sufficiently gated.

> To apply stable-4.11 onwards.

That's based on you having found a broken commit in the 4.11 development
window (if so, please name the commit), or simply because 4.10 is about
to go out of general support?

> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -1711,7 +1711,15 @@ static irq_guest_action_t *__pirq_guest_unbind(
>   
>       for ( i = 0; (i < action->nr_guests) && (action->guest[i] != d); i++ )
>           continue;
> -    BUG_ON(i == action->nr_guests);
> +    if ( i == action->nr_guests ) {
> +        /* In case the pirq was shared, unbound for this domain in an earlier call, but still
> +         * existed on the domain's pirq_tree, we still reach here if there are any later
> +         * unbind calls on the same pirq. Return if such an unbind happens. */

Style: The opening brace is misplaced, plus the comment is not
properly formatted and has overly long lines.

> +        if ( action->nr_guests > 0 && action->shareable )
> +            return NULL;

Why does the sharable aspect matter here? Or asked differently, why
can the same situation (multiple unbind requests) not arise for
non-sharable IRQs?

Similary, why would this same situation not arise for the last guest
getting unbound from the IRQ? There is an "action == NULL" check
earlier on, but if multiple calls were legitimate, then the
dprintk() there should go away (or be gated suitably) as well.

Jan
Roger Pau Monné July 16, 2019, 9:38 a.m. UTC | #2
On Wed, Jul 03, 2019 at 04:46:13PM +0200, Varad Gautam wrote:
> It is possible to receive multiple __pirq_guest_unbind calls for the same pirq
> if the pirq has not yet been removed from the domain's pirq_tree. For a shared
> pirq (nr_guests > 1), the first call zaps the current domain from the pirq's
> guests[] list, but the action handler is never freed as there are other guests
> using this pirq. As a result, on the second call, __pirq_guest_unbind tries
> search for the current domain which has been removed from the guests[] list,
> and hits a BUG_ON.

Thanks for digging into this, the passthrough code is quite complex
and not easy to debug.

As raised by Jan, I'm not sure I see how the above can happen. For
PIRQs bound to event channels the freeing can only happen once, as
the event channel is afterwards closed and further calls won't make
progress (regardless of whether the underlying PIRQ is shared or not).

So I assume this is something that you have seen with HVM guests and
dpci?

Getting the full trace that leads to the BUG_ON would be very
helpful, because as pointed out this is likely something that wants
to be fixed in the caller of pirq_guest_unbind.

Do you have a reliable way to reproduce?

> Allow xen to continue if a shared pirq has already been unbound from this guest.
> It will be cleaned up from the domain's pirq_tree during the destruction in
> complete_domain_destroy.
> 
> Signed-off-by: Varad Gautam <vrd@amazon.de>
> 
> ---
> 
> To apply stable-4.11 onwards.

Do you also have a bisection of when the issue was introduced?

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 094c3c5..256f503 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1711,7 +1711,15 @@  static irq_guest_action_t *__pirq_guest_unbind(
 
     for ( i = 0; (i < action->nr_guests) && (action->guest[i] != d); i++ )
         continue;
-    BUG_ON(i == action->nr_guests);
+    if ( i == action->nr_guests ) {
+        /* In case the pirq was shared, unbound for this domain in an earlier call, but still
+         * existed on the domain's pirq_tree, we still reach here if there are any later
+         * unbind calls on the same pirq. Return if such an unbind happens. */
+        if ( action->nr_guests > 0 && action->shareable )
+            return NULL;
+        BUG();
+    }
+
     memmove(&action->guest[i], &action->guest[i+1],
             (action->nr_guests-i-1) * sizeof(action->guest[0]));
     action->nr_guests--;