diff mbox

RFC/PATCH: xen: race during domain destruction [Re: [xen-4.7-testing test] 105948: regressions - FAIL]

Message ID 1487952877.5548.26.camel@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dario Faggioli Feb. 24, 2017, 4:14 p.m. UTC
[Adding Juergen]

On Wed, 2017-02-22 at 01:46 -0700, Jan Beulich wrote:
> > > > On 22.02.17 at 01:02, <andrew.cooper3@citrix.com> wrote:
> > (XEN) Xen call trace:
> > (XEN)    [<ffff82d080126e70>]
> > sched_credit2.c#vcpu_is_migrateable+0x22/0x9a
> > (XEN)    [<ffff82d080129763>]
> > sched_credit2.c#csched2_schedule+0x823/0xb4e
> > (XEN)    [<ffff82d08012c17e>] schedule.c#schedule+0x108/0x609
> > (XEN)    [<ffff82d08012f8bd>] softirq.c#__do_softirq+0x7f/0x8a
> > (XEN)    [<ffff82d08012f912>] do_softirq+0x13/0x15
> > (XEN)    [<ffff82d080164b17>] domain.c#idle_loop+0x55/0x62
> > (XEN)
> > (XEN)
> > (XEN) ****************************************
> > (XEN) Panic on CPU 14:
> > (XEN) Assertion 'd->cpupool != NULL' failed at
> > ...5948.build-amd64/xen/xen/include/xen/sched-if.h:200
> > (XEN) ****************************************
> > (XEN)
> > (XEN) Manual reset required ('noreboot' specified)
> > 
> > I am guessing the most recent credit2 backports weren't quite so
> > safe?
> 
Well, what I'd say we're facing is the surfacing of a latent bug.

> However, comparing with the staging version of the file
> (which is heavily different), the immediate code involved here isn't
> all that different, so I wonder whether (a) this is a problem on
> staging too or (b) we're missing another backport. Dario?
> 
So, according to my investigation, this is a genuine race. It affects
this branch as well as staging, but it manifests less frequently (or, I
should say, very rarely) in the latter.

The problem is that the Credit2's load balancer operates not only on
runnable vCPUs, but also on blocked, sleeping, and paused ones (and
that's by design).

In this case, the original domain is in the process of being destroyed,
 after migration completed, and reaches the point where, within
domain_destroy(), we call cpupool_rm_domain(). This remove the domain
from any cpupool, and sets d->cpupool = NULL.
Then, on another pCPU --since the vCPUs of the domain are still around
(until we call sched_destroy_vcpu(), which happens much later-- and
they also are still assigned to a Credit2 runqueue, balance_load()
picks up one of them for moving to another runqueue, and things explode
when we realize that the vCPU is actually out of any pool!

So, I've thought quite a bit of how to solve this. Possibilities are to
act at the Credit2 level, or outside of it.

I drafted a couple of solutions only affecting sched_credit2.c, but
could not be satisfied with the results. And that's because I
ultimately think it should be safe for a scheduler that it can play
with a vCPU that it can reach out to, and that means the vCPU must be
in a pool.

And that's why I came up with the patch below.

This is a draft and is on top of staging-4.7. I will properly submit it
against staging, if you agree with me it's an ok thing to do.

Basically, I anticipate a little bit calling sched_destroy_vcpu(), so
that it happens before cpupool_rm_domain(). This ensures that vCPUs
have valid cpupool information until the very last moment that they are
accessible from a scheduler.

---
Let me know.

Thanks and Regards,
Dario

Comments

Dario Faggioli Feb. 26, 2017, 3:53 p.m. UTC | #1
On Fri, 2017-02-24 at 17:14 +0100, Dario Faggioli wrote:
> On Wed, 2017-02-22 at 01:46 -0700, Jan Beulich wrote:
> > However, comparing with the staging version of the file
> > (which is heavily different), the immediate code involved here
> > isn't
> > all that different, so I wonder whether (a) this is a problem on
> > staging too or (b) we're missing another backport. Dario?
> > 
> So, according to my investigation, this is a genuine race. It affects
> this branch as well as staging, but it manifests less frequently (or,
> I
> should say, very rarely) in the latter.
> 
Actually, this is probably wrong. It looks like the following commit:

 f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b
 xen: fix a (latent) cpupool-related race during domain destroy

is not in staging-4.7.

At some point, while investigating, I thought I had seen it there, but
I was wrong!

So, I'd say that the proper solution is to backport that change, and
ignore the drafted patch I sent before.

In any case, I'll try doing the backport myself and test the result on
Monday (tomorrow). And I will let you know.

Regards,
Dario
Dario Faggioli Feb. 27, 2017, 3:18 p.m. UTC | #2
On Sun, 2017-02-26 at 16:53 +0100, Dario Faggioli wrote:
> On Fri, 2017-02-24 at 17:14 +0100, Dario Faggioli wrote:
> > On Wed, 2017-02-22 at 01:46 -0700, Jan Beulich wrote:
> > > 
> > > However, comparing with the staging version of the file
> > > (which is heavily different), the immediate code involved here
> > > isn't
> > > all that different, so I wonder whether (a) this is a problem on
> > > staging too or (b) we're missing another backport. Dario?
> > > 
> > So, according to my investigation, this is a genuine race. It
> > affects
> > this branch as well as staging, but it manifests less frequently
> > (or,
> > I
> > should say, very rarely) in the latter.
> > 
> Actually, this is probably wrong. It looks like the following commit:
> 
>  f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b
>  xen: fix a (latent) cpupool-related race during domain destroy
> 
> is not in staging-4.7.
> 
And my testing confirms that backporting the changeset above (which
just applies cleanly on staging-4.7, AFAICT) make the problem go away.

As the changelog of that commit says, I've even seen something similar
happening already during my development... Sorry I did not recognise it
sooner, and for failing to request backport of that change in the first
place.

I'm therefore doing that now: I ask for backport of:

 f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b
 xen: fix a (latent) cpupool-related race during domain destroy

to 4.7.

Regards,
Dario
Jan Beulich Feb. 28, 2017, 9:48 a.m. UTC | #3
>>> On 27.02.17 at 16:18, <dario.faggioli@citrix.com> wrote:
> I'm therefore doing that now: I ask for backport of:
> 
>  f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b
>  xen: fix a (latent) cpupool-related race during domain destroy
> 
> to 4.7.

Thanks for working this out! Applied to 4.7-staging.

Jan
diff mbox

Patch

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 45273d4..4db7750 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -643,7 +643,10 @@  int domain_kill(struct domain *d)
         if ( cpupool_move_domain(d, cpupool0) )
             return -ERESTART;
         for_each_vcpu ( d, v )
+        {
             unmap_vcpu_info(v);
+            sched_destroy_vcpu(v);
+        }
         d->is_dying = DOMDYING_dead;
         /* Mem event cleanup has to go here because the rings 
          * have to be put before we call put_domain. */
@@ -807,7 +810,6 @@  static void complete_domain_destroy(struct rcu_head *head)
             continue;
         tasklet_kill(&v->continue_hypercall_tasklet);
         vcpu_destroy(v);
-        sched_destroy_vcpu(v);
         destroy_waitqueue_vcpu(v);
     }
---