Message ID | 1403765395-16978-1-git-send-email-pgaikwad@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello, On Thu, Jun 26, 2014 at 07:49:55AM +0100, Prashant Gaikwad wrote: > Unconditional copying cpu_online_mask to affinity > may result in migrating affinity to wrong CPU. We have a bug, but I don't follow your reasoning. > For example, IRQ 5 affinity mask contains CPU 4-7, Ok, so d->affinity is 0xf0... > it was affined to CPU4 and CPU 0-7 are online. ...and cpu_online_mask is 0xff. > Now if we hot-unplug CPU4 then with current > implementation affinity mask will contain > CPU 0-3,5-7 and IRQ 5 will be affined to CPU0. cpumask_any_and(affinity, cpu_online_mask) will give return < nr_cpu_ids since there is an intersection of 0xf0. That means ret is false. The bug is that we then do affinity = cpu_online_mask; unconditionally, but we *won't* do the cpumask_copy, since ret is false. You can fix this by simply bringing the arm64 code into line with the arm code, which begs the question as to why this has to exist in the arch/ backend at all! Will
On Thu, 2014-06-26 at 15:50 +0530, Will Deacon wrote: > Hello, > > On Thu, Jun 26, 2014 at 07:49:55AM +0100, Prashant Gaikwad wrote: > > Unconditional copying cpu_online_mask to affinity > > may result in migrating affinity to wrong CPU. > > We have a bug, but I don't follow your reasoning. > > > For example, IRQ 5 affinity mask contains CPU 4-7, > > Ok, so d->affinity is 0xf0... > > > it was affined to CPU4 and CPU 0-7 are online. > > ...and cpu_online_mask is 0xff. > > > Now if we hot-unplug CPU4 then with current > > implementation affinity mask will contain > > CPU 0-3,5-7 and IRQ 5 will be affined to CPU0. > > cpumask_any_and(affinity, cpu_online_mask) will give return < nr_cpu_ids > since there is an intersection of 0xf0. That means ret is false. > > The bug is that we then do affinity = cpu_online_mask; unconditionally, > but we *won't* do the cpumask_copy, since ret is false. > We do not copy but the affinity mask passed to irq_set_affinity function is nothing but cpu_online_mask. So in GIC it will set affinity to CPU0. > You can fix this by simply bringing the arm64 code into line with the arm > code, which begs the question as to why this has to exist in the arch/ > backend at all! Where can we move this code? > > Will
On Thu, Jun 26, 2014 at 01:00:24PM +0100, Prashant Gaikwad wrote: > On Thu, 2014-06-26 at 15:50 +0530, Will Deacon wrote: > > On Thu, Jun 26, 2014 at 07:49:55AM +0100, Prashant Gaikwad wrote: > > > Unconditional copying cpu_online_mask to affinity > > > may result in migrating affinity to wrong CPU. > > > > We have a bug, but I don't follow your reasoning. > > > > > For example, IRQ 5 affinity mask contains CPU 4-7, > > > > Ok, so d->affinity is 0xf0... > > > > > it was affined to CPU4 and CPU 0-7 are online. > > > > ...and cpu_online_mask is 0xff. > > > > > Now if we hot-unplug CPU4 then with current > > > implementation affinity mask will contain > > > CPU 0-3,5-7 and IRQ 5 will be affined to CPU0. > > > > cpumask_any_and(affinity, cpu_online_mask) will give return < nr_cpu_ids > > since there is an intersection of 0xf0. That means ret is false. > > > > The bug is that we then do affinity = cpu_online_mask; unconditionally, > > but we *won't* do the cpumask_copy, since ret is false. > > > > We do not copy but the affinity mask passed to irq_set_affinity function > is nothing but cpu_online_mask. So in GIC it will set affinity to CPU0. Exactly, but your proposed patch changed more than that. > > You can fix this by simply bringing the arm64 code into line with the arm > > code, which begs the question as to why this has to exist in the arch/ > > backend at all! > > Where can we move this code? kernel/irq/migration.c? Will
On Thu, 2014-06-26 at 18:41 +0530, Will Deacon wrote: > On Thu, Jun 26, 2014 at 01:00:24PM +0100, Prashant Gaikwad wrote: > > On Thu, 2014-06-26 at 15:50 +0530, Will Deacon wrote: > > > On Thu, Jun 26, 2014 at 07:49:55AM +0100, Prashant Gaikwad wrote: > > > > Unconditional copying cpu_online_mask to affinity > > > > may result in migrating affinity to wrong CPU. > > > > > > We have a bug, but I don't follow your reasoning. > > > > > > > For example, IRQ 5 affinity mask contains CPU 4-7, > > > > > > Ok, so d->affinity is 0xf0... > > > > > > > it was affined to CPU4 and CPU 0-7 are online. > > > > > > ...and cpu_online_mask is 0xff. > > > > > > > Now if we hot-unplug CPU4 then with current > > > > implementation affinity mask will contain > > > > CPU 0-3,5-7 and IRQ 5 will be affined to CPU0. > > > > > > cpumask_any_and(affinity, cpu_online_mask) will give return < nr_cpu_ids > > > since there is an intersection of 0xf0. That means ret is false. > > > > > > The bug is that we then do affinity = cpu_online_mask; unconditionally, > > > but we *won't* do the cpumask_copy, since ret is false. > > > > > > > We do not copy but the affinity mask passed to irq_set_affinity function > > is nothing but cpu_online_mask. So in GIC it will set affinity to CPU0. > > Exactly, but your proposed patch changed more than that. > I am changing the force flag to false. That is because after I fix this behavior we have another bug where the IRQ affinity is set to offline CPU. When cpumask_any_and(affinity, cpu_online_mask) return < nr_cpu_ids we pass the affinity mask as it is which contains the offline CPU too and if force flag is true then GIC driver skips online CPU check. If CPU0 is going down then the affinity mask will have CPU0 and GIC driver will keep the affinity to CPU0. Changing force flag to false ensures that GIC driver checks for online CPU. > > > You can fix this by simply bringing the arm64 code into line with the arm > > > code, which begs the question as to why this has to exist in the arch/ > > > backend at all! > > > > Where can we move this code? > > kernel/irq/migration.c? > > Will
Hi Will, On 26/06/14 11:20, Will Deacon wrote: > Hello, > > On Thu, Jun 26, 2014 at 07:49:55AM +0100, Prashant Gaikwad wrote: >> Unconditional copying cpu_online_mask to affinity >> may result in migrating affinity to wrong CPU. > > We have a bug, but I don't follow your reasoning. > >> For example, IRQ 5 affinity mask contains CPU 4-7, > > Ok, so d->affinity is 0xf0... > >> it was affined to CPU4 and CPU 0-7 are online. > > ...and cpu_online_mask is 0xff. > >> Now if we hot-unplug CPU4 then with current >> implementation affinity mask will contain >> CPU 0-3,5-7 and IRQ 5 will be affined to CPU0. > > cpumask_any_and(affinity, cpu_online_mask) will give return < nr_cpu_ids > since there is an intersection of 0xf0. That means ret is false. > > The bug is that we then do affinity = cpu_online_mask; unconditionally, > but we *won't* do the cpumask_copy, since ret is false. > > You can fix this by simply bringing the arm64 code into line with the arm > code, which begs the question as to why this has to exist in the arch/ > backend at all! > The unconditional assignment was added by me to fix CPU0 hotplug issue explained in commit 601c942176d8 which is wrong and evident from the above usecase. It was added to retain the forced irq_set_affinity. The difference between arm and arm64 is because the arm doesn't have the patch [1] We can move to irq_set_affinity without force option as this patch does. I had mentioned similar solution[2], but Russell wants to get feedback from tglx[3] And yes I see similar implementations for many architectures, definitely can be unified. Regards, Sudeep [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-May/254838.html [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-May/259255.html [3] http://www.spinics.net/lists/arm-kernel/msg340279.html
Hi, On 26/06/14 14:40, Prashant Gaikwad wrote: > On Thu, 2014-06-26 at 18:41 +0530, Will Deacon wrote: >> On Thu, Jun 26, 2014 at 01:00:24PM +0100, Prashant Gaikwad wrote: >>> On Thu, 2014-06-26 at 15:50 +0530, Will Deacon wrote: >>>> On Thu, Jun 26, 2014 at 07:49:55AM +0100, Prashant Gaikwad wrote: >>>>> Unconditional copying cpu_online_mask to affinity >>>>> may result in migrating affinity to wrong CPU. >>>> >>>> We have a bug, but I don't follow your reasoning. >>>> >>>>> For example, IRQ 5 affinity mask contains CPU 4-7, >>>> >>>> Ok, so d->affinity is 0xf0... >>>> >>>>> it was affined to CPU4 and CPU 0-7 are online. >>>> >>>> ...and cpu_online_mask is 0xff. >>>> >>>>> Now if we hot-unplug CPU4 then with current >>>>> implementation affinity mask will contain >>>>> CPU 0-3,5-7 and IRQ 5 will be affined to CPU0. >>>> >>>> cpumask_any_and(affinity, cpu_online_mask) will give return < nr_cpu_ids >>>> since there is an intersection of 0xf0. That means ret is false. >>>> >>>> The bug is that we then do affinity = cpu_online_mask; unconditionally, >>>> but we *won't* do the cpumask_copy, since ret is false. >>>> >>> >>> We do not copy but the affinity mask passed to irq_set_affinity function >>> is nothing but cpu_online_mask. So in GIC it will set affinity to CPU0. >> >> Exactly, but your proposed patch changed more than that. >> > > I am changing the force flag to false. That is because after I fix this > behavior we have another bug where the IRQ affinity is set to offline > CPU. > That's correct, it's the original issue I saw and fixed incorrectly which triggered the bug you have now. The main reason to retain the force flag as true is that the implementation is irqchip specific. GIC implements the way you explained but what if some other irqchip implementation has something different. I believe that's the reason why Russell wants to get feedback from tglx. Regards, Sudeep
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c index 0f08dfd..dfa6e3e 100644 --- a/arch/arm64/kernel/irq.c +++ b/arch/arm64/kernel/irq.c @@ -97,19 +97,15 @@ static bool migrate_one_irq(struct irq_desc *desc) if (irqd_is_per_cpu(d) || !cpumask_test_cpu(smp_processor_id(), affinity)) return false; - if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) + if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) { + affinity = cpu_online_mask; ret = true; + } - /* - * when using forced irq_set_affinity we must ensure that the cpu - * being offlined is not present in the affinity mask, it may be - * selected as the target CPU otherwise - */ - affinity = cpu_online_mask; c = irq_data_get_irq_chip(d); if (!c->irq_set_affinity) pr_debug("IRQ%u: unable to set affinity\n", d->irq); - else if (c->irq_set_affinity(d, affinity, true) == IRQ_SET_MASK_OK && ret) + else if (c->irq_set_affinity(d, affinity, false) == IRQ_SET_MASK_OK && ret) cpumask_copy(d->affinity, affinity); return ret;
Unconditional copying cpu_online_mask to affinity may result in migrating affinity to wrong CPU. For example, IRQ 5 affinity mask contains CPU 4-7, it was affined to CPU4 and CPU 0-7 are online. Now if we hot-unplug CPU4 then with current implementation affinity mask will contain CPU 0-3,5-7 and IRQ 5 will be affined to CPU0. Instead copy cpu_online_mask to affinity only if no online CPU is present in affinity mask and do not force affinity seeting which would do the CPU online check. Signed-off-by: Prashant Gaikwad <pgaikwad@nvidia.com> --- arch/arm64/kernel/irq.c | 12 ++++-------- 1 files changed, 4 insertions(+), 8 deletions(-)