Message ID | 1487952877.5548.26.camel@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 2017-02-24 at 17:14 +0100, Dario Faggioli wrote: > On Wed, 2017-02-22 at 01:46 -0700, Jan Beulich wrote: > > However, comparing with the staging version of the file > > (which is heavily different), the immediate code involved here > > isn't > > all that different, so I wonder whether (a) this is a problem on > > staging too or (b) we're missing another backport. Dario? > > > So, according to my investigation, this is a genuine race. It affects > this branch as well as staging, but it manifests less frequently (or, > I > should say, very rarely) in the latter. > Actually, this is probably wrong. It looks like the following commit: f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b xen: fix a (latent) cpupool-related race during domain destroy is not in staging-4.7. At some point, while investigating, I thought I had seen it there, but I was wrong! So, I'd say that the proper solution is to backport that change, and ignore the drafted patch I sent before. In any case, I'll try doing the backport myself and test the result on Monday (tomorrow). And I will let you know. Regards, Dario
On Sun, 2017-02-26 at 16:53 +0100, Dario Faggioli wrote: > On Fri, 2017-02-24 at 17:14 +0100, Dario Faggioli wrote: > > On Wed, 2017-02-22 at 01:46 -0700, Jan Beulich wrote: > > > > > > However, comparing with the staging version of the file > > > (which is heavily different), the immediate code involved here > > > isn't > > > all that different, so I wonder whether (a) this is a problem on > > > staging too or (b) we're missing another backport. Dario? > > > > > So, according to my investigation, this is a genuine race. It > > affects > > this branch as well as staging, but it manifests less frequently > > (or, > > I > > should say, very rarely) in the latter. > > > Actually, this is probably wrong. It looks like the following commit: > > f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b > xen: fix a (latent) cpupool-related race during domain destroy > > is not in staging-4.7. > And my testing confirms that backporting the changeset above (which just applies cleanly on staging-4.7, AFAICT) make the problem go away. As the changelog of that commit says, I've even seen something similar happening already during my development... Sorry I did not recognise it sooner, and for failing to request backport of that change in the first place. I'm therefore doing that now: I ask for backport of: f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b xen: fix a (latent) cpupool-related race during domain destroy to 4.7. Regards, Dario
>>> On 27.02.17 at 16:18, <dario.faggioli@citrix.com> wrote: > I'm therefore doing that now: I ask for backport of: > > f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b > xen: fix a (latent) cpupool-related race during domain destroy > > to 4.7. Thanks for working this out! Applied to 4.7-staging. Jan
diff --git a/xen/common/domain.c b/xen/common/domain.c index 45273d4..4db7750 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -643,7 +643,10 @@ int domain_kill(struct domain *d) if ( cpupool_move_domain(d, cpupool0) ) return -ERESTART; for_each_vcpu ( d, v ) + { unmap_vcpu_info(v); + sched_destroy_vcpu(v); + } d->is_dying = DOMDYING_dead; /* Mem event cleanup has to go here because the rings * have to be put before we call put_domain. */ @@ -807,7 +810,6 @@ static void complete_domain_destroy(struct rcu_head *head) continue; tasklet_kill(&v->continue_hypercall_tasklet); vcpu_destroy(v); - sched_destroy_vcpu(v); destroy_waitqueue_vcpu(v); } ---