[GIT,PULL] arm64 fixes for 5.15-rc5

Message ID	YWCPyK+xotTgUMy/@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=fGbl=O4=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 522BB60F92 Date: Fri, 8 Oct 2021 19:36:56 +0100 From: Catalin Marinas <catalin.marinas@arm.com> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [GIT PULL] arm64 fixes for 5.15-rc5 Message-ID: <YWCPyK+xotTgUMy/@arm.com> MIME-Version: 1.0 Content-Disposition: inline Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	[GIT,PULL] arm64 fixes for 5.15-rc5 \| expand [GIT,PULL] arm64 fixes for 5.15-rc5

Catalin Marinas Oct. 8, 2021, 6:36 p.m. UTC

Hi Linus,

Please pull the arm64 fixes below. Thanks.

The following changes since commit 22b70e6f2da0a4c8b1421b00cfc3016bc9d4d9d4:

  arm64: Restore forced disabling of KPTI on ThunderX (2021-09-23 15:59:15 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-fixes

for you to fetch changes up to 0350419b14b98fd6f36801583360c36c8820c2e6:

  arm64/hugetlb: fix CMA gigantic page order for non-4K PAGE_SIZE (2021-10-06 11:11:13 +0100)

----------------------------------------------------------------
arm64 fixes:

- Avoid double accounting of irq_{entry,exit}_rcu().

- Fix CMA gigantic page order for 16K/64K page sizes.

----------------------------------------------------------------
Mark Rutland (1):
      arm64: entry: refactor EL1 interrupt entry logic

Mike Kravetz (1):
      arm64/hugetlb: fix CMA gigantic page order for non-4K PAGE_SIZE

Pingfan Liu (2):
      kernel/irq: make irq_{enter,exit}() in handle_domain_irq() arch optional
      arm64: entry: avoid double-accounting IRQ RCU entry

 arch/arm64/Kconfig               |  1 +
 arch/arm64/kernel/entry-common.c | 47 +++++++++++++++++++++-------------------
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 kernel/irq/Kconfig               |  3 +++
 kernel/irq/irqdesc.c             |  4 ++++
 5 files changed, 34 insertions(+), 23 deletions(-)

Linus Torvalds Oct. 8, 2021, 8:25 p.m. UTC | #1

On Fri, Oct 8, 2021 at 11:37 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Pingfan Liu (2):
>       kernel/irq: make irq_{enter,exit}() in handle_domain_irq() arch optional
>       arm64: entry: avoid double-accounting IRQ RCU entry

Ugh. This is *really* ugly. And it seems to be going exactly the wrong way.

I read the commit descriptions, and it still doesn't answer the
fundamental question of why arm64 needs to do the accounting in
arch-specific code, and disable the generic code.

It says

    To fix this, we must perform all the accounting from the architecture
    code. We prevent the IRQ domain code from performing any accounting by
    selecting HAVE_ARCH_IRQENTRY, and must call irq_enter_rcu() and
    irq_exit_rcu() around invoking the root IRQ handler.

but at no point does it actually explain *why* all the accounting
needs to be done by the architecture code.

Yes, yes, I read the previous paragraph. But why isn't the fix to just
stop doing the double accounting in the arm64 specific code?

Instead it doubles down on that "let's do this non-arch-specific
accounting in arch-specific code", making the common code uglier and
weaker.

I initially pulled this, and then I just unpulled in disgust.

Please explain why arm64 does this bad thing, and why the fix isn't
"fix arm64", but instead "make the generic code uglier and harder to
maintain and follow".

Really, from all the explanations those commits give, the natural
thing to do would be "just fix arm64".

So if that really isn't the answer, then the explanations are clearly lacking.

                   Linus

Mark Rutland Oct. 11, 2021, 10:47 a.m. UTC | #2

Hi Linus,

[adding Paul and Peter, just in case we need to discuss the RCU and
accounting bits further]

On Fri, Oct 08, 2021 at 01:25:51PM -0700, Linus Torvalds wrote:
> On Fri, Oct 8, 2021 at 11:37 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > Pingfan Liu (2):
> >       kernel/irq: make irq_{enter,exit}() in handle_domain_irq() arch optional
> >       arm64: entry: avoid double-accounting IRQ RCU entry
> 
> Ugh. This is *really* ugly. And it seems to be going exactly the wrong way.
> 
> I read the commit descriptions, and it still doesn't answer the
> fundamental question of why arm64 needs to do the accounting in
> arch-specific code, and disable the generic code.
> 
> It says
> 
>     To fix this, we must perform all the accounting from the architecture
>     code. We prevent the IRQ domain code from performing any accounting by
>     selecting HAVE_ARCH_IRQENTRY, and must call irq_enter_rcu() and
>     irq_exit_rcu() around invoking the root IRQ handler.
> 
> but at no point does it actually explain *why* all the accounting
> needs to be done by the architecture code.

Sorry; I agree that commit messages don't explain this thoroughly. I can
go and rework the commit messages to clarify things, if you'd like?

The TL;DR is that a bunch of constraints conspire such that we can't
defer accounting things to the irqdomain or irqchip code without making
this even more complicated/fragile, and many architectures get this
subtly wrong today -- it's not that arm64 is necessarily much different
from other architectures using this code, just that we're trying to
solve this first.

More specifically, the problem here is the combination of:

* During entry, rcu_irq_enter() must be called *before*
  trace_hardirqs_off_finish(), because the latter needs RCU to be
  watching, but we need to have informed lockdep before poking RCU or
  lockdep will complain withing the RCU code. To handle that,
  kernel/entry/common.c and arch/arm64/kernel/entry-common.c have the
  sequence:

  lockdep_hardirqs_off(CALLER_ADDR0);
  rcu_irq_enter(): // or rcu_irq_enter_check_tick();
  trace_hardirqs_off_finish();

  ... and since we don't want something to come in the middle of the
  sequence, we want those sandwiched together in a noinstr function
  called before we invoke the root irqchip's handler function.

  A bunch of architectures don't follow this sequence, and are
  potentially subtly broken today in some configurations.

  Note: the plan is to move arm64 over to the generic entry code, but
  that has no effect on this specific issue.

* rcu_irq_enter() must be called *precisely once* upon taking an
  interrupt exception, or we can end up mis-accounting quiescent periods
  as non-quiescent (since this maintains a nesting count in
  rcu_data::dynticks_nmi_nesting, checked by
  rcu_is_cpu_rrupt_from_idle()).

* Prior to these patches, we call rcu_irq_enter() at least twice on
  arm64, because the architectural entry code must call rcu_irq_enter()
  for the lockdep bits, then when we invoke the irqchip (e.g. GICv3), we
  do:

  gic_handle_irq()
    handle_domain_irq()
      irq_enter()
        rcu_irq_enter()
	irq_enter_rcu()

  ... and so if we take a sched clock IRQ and call
  rcu_sched_clock_irq(), rcu_is_cpu_rrupt_from_idle() may return false
  when it should return true(), and we don't decide to preempt the task,
  leading to stalls.

  Note that since irqchips can be chained (e.g. there can be a second
  interrupt controller fed into the GIC), handle_domain_irq() can be
  nested, so even if we somehow removed the accounting from arch code
  we'd need to handle the nested calls to handle_domain_irq() somehow.

  AFAICT it's far simpler to do that *once* outside of the irqchip code
  than it is to try to handle that nesting.

  Note that x86 doesn't use handle_domain_irq(), and just maps the raw
  irqnrs itself, and just calls irq_enter_rcu() when transitioning to
  the IRQ stack.

> Yes, yes, I read the previous paragraph. But why isn't the fix to just
> stop doing the double accounting in the arm64 specific code?

As above, that'd require moving *some* of the low-level entry logic into
the irqchip/irqdomain code, and handling nesting, which is *more*
fragile.

FWIW, we do need to fix the other architectures too, but that involves
more substantial rework to each of those (e.g. moving to generic entry),
since there are a bunch of problems today. The thinking was that this
gave a way to fix each of those on its own, then remove the old
behaviour.

Does that all make sense to you?

Thanks,
Mark.

Linus Torvalds Oct. 11, 2021, 7:54 p.m. UTC | #3

On Mon, Oct 11, 2021 at 3:47 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Sorry; I agree that commit messages don't explain this thoroughly. I can
> go and rework the commit messages to clarify things, if you'd like?

So I really hate that patch, even with explanation.

Having the Kconfig option for "do the right thing" is not how this
should be fixed.

Either the generic code is generic, or it isn't.

And if the generic code doesn't work for arm64, we shouldn't add a
random Kconfig option for "this architecture wants different
behavior".

> The TL;DR is that a bunch of constraints conspire such that we can't
> defer accounting things to the irqdomain or irqchip code without making
> this even more complicated/fragile, and many architectures get this
> subtly wrong today -- it's not that arm64 is necessarily much different
> from other architectures using this code, just that we're trying to
> solve this first.

So then I think the fix is to just say "accounting in this generic
function is wrong, and the accounting needs to be moved to the
callers".

That's particularly true if you think arm64 is only the only actually
_tested_ case that gets this wrong, and other architectures most
likely have the exact same issue, but you only fixed it for arm64.

So do it unconditionally - possibly even using a coccinelle script if
there are lots of callers.

Because generic code that just isn't generic, but randomly does
different things based on subtle Kconfig options that are different
for different architectures is bad, bad, bad.

And yes, I realize that we've had that kind of stuff before (and we
have odd Kconfig option testing in that irqdesc.c file elsewhere), but
the Kconfig options we currently have are mostly either

 (a) actual real honest-to-goodness config options with semantic
meaning (ie things like CONFIG_SMP and CONFIG_NUMA)

 (b) really ugly workarounds for odd special module exports (that
CONFIG_KVM_BOOK3S_64_HV_MODULE thing for irq_to_desc() that we *tried*
but failed to remove).

And so the reason I really hate that patch is that it introduces a new
"different architectures randomly and inexplicably do different
things, and the generic behavior is very different on arm64 than it is
elsewhere".

That's just the worst kind of hack to me.

And in this case, it's really *horribly* hard to see what the call
chain is. It all ends up being actively obfuscated and obscured
through that 'handle_arch_irq' function pointer, that is sometimes set
through set_handle_irq(), and sometimes set directly.

I really think that if the rule is "we can't do accounting in
handle_domain_irq(), because it's too late for arm64", then the fix
really should be to just not do that.

Move the irq_enter()/irq_exit() to the callers - quite possibly far up
the call chain to the root of it all, and just say "architecture code
needs to do this in the low-level code before calling
handle_arch_irq".

Or, if it turns out that 99% of callers do want it - and don't have
the issues arm64 has - maybe we can have a helper wrapper that does
the irq_enter/irq_exit, and another version that doesn't do it, and at
least it's clear to the callers which one it is. But it looks like
particularly with the odd indirection through that handle_arch_irq
function, it's best to just say "this needs to be done".

What architectures actually care? From what I can follow, it's really
GENERIC_IRQ_MULTI_HANDLER that ends up doing this all, and then arm64
has it's own special case for no obvious reason (why isn't it using
GENERIC_IRQ_MULTI_HANDLER that seems to do the EXACT same thing in
generic code except for a special "default != NULL" case?)

Anyway, it _looks_ to me like the pattern is very simple:

Step 1:
 - remove irq_enter/irq_exit from handle_domain_irq(), move it to all callers.

This clearly doesn't change anything at all, but also doesn't fix the
problem you have. But it's easy to verify that the code is the same
before-and-after.

Step 2 is the pattern matching step:

 - if the caller of handle_domain_irq() ends up being a function that
is registered with set_handle_irq(), then we
   (a) remove the irq_enter/irq_exit from it
   (b) add it to the architectures that use handle_arch_irq.
   (c) make sure that if there are other callers of it (not through
handle_arch_irq) we move that irq_enter/irq_exit into them too

I _suspect_ - but didn't check - that Step 2(c) doesn't actually
exist. But who knows.

It really looks like there is a very tight connection between "uses
handle_domain_irq()" and "uses handle_arch_irq/set_handle_irq()". No?

Is this a bit more work? Yes.

Would this fix arm64? Yes - it ends up effectively doing what you did.

Would this fix _other_ architectures doing the same thing that you
suspect are broken? YES. Instead of leaving them randomly broken.

And most importantly, it would make the rules for "generic" code
clear, and actually generic.

Now, it may be that I'm wrong. I'm willing to be convinced, and if you
beat me over the head enough I guess I can take that pull request and
just be unhappy about it.

So I'm not trying to draw some line in the sand and be an arsehole and
say "you have to do it this way". But I look at that patch in that
pull request, and it really screams "this is horribly wrong" to me.

And I _think_ the above suggestion can be done almost mechanically.

And maybe it's easier to combine step1/step2 into one thing, if that
handle_domain_irq/handle_arch_irq" use is really as tightly bound as
it seems to be from a quick look.

So I'm not saying that they have to be separate steps, I wrote them
out that way mainly to kind of show the logic of the change - and in
case there are situations where you end up calling things without
going through that handle_arch_irq thing.

                       Linus

Thomas Gleixner Oct. 12, 2021, 1:18 p.m. UTC | #4

Linus,

On Mon, Oct 11 2021 at 12:54, Linus Torvalds wrote:
> On Mon, Oct 11, 2021 at 3:47 AM Mark Rutland <mark.rutland@arm.com> wrote:
> And so the reason I really hate that patch is that it introduces a new
> "different architectures randomly and inexplicably do different
> things, and the generic behavior is very different on arm64 than it is
> elsewhere".
>
> That's just the worst kind of hack to me.
>
> And in this case, it's really *horribly* hard to see what the call
> chain is. It all ends up being actively obfuscated and obscured
> through that 'handle_arch_irq' function pointer, that is sometimes set
> through set_handle_irq(), and sometimes set directly.
>
> I really think that if the rule is "we can't do accounting in
> handle_domain_irq(), because it's too late for arm64", then the fix
> really should be to just not do that.
>
> Move the irq_enter()/irq_exit() to the callers - quite possibly far up
> the call chain to the root of it all, and just say "architecture code
> needs to do this in the low-level code before calling
> handle_arch_irq".

That's where it belongs. It's mandatory to have it there for NOHZ_FULL
to work correctly vs. instrumentation etc. I've pointed that out back
then after we fed the X86 entry code into the mincer and added noinstr
sections to keep tracers, BPF and kprobes away from it.

Looking at the architectures which "support" that by selecting
HAVE_CONTEXT_TRACKING:

arch/arm/Kconfig:	select HAVE_CONTEXT_TRACKING
arch/arm64/Kconfig:	select HAVE_CONTEXT_TRACKING
arch/csky/Kconfig:	select HAVE_CONTEXT_TRACKING
arch/mips/Kconfig:	select HAVE_CONTEXT_TRACKING
arch/powerpc/Kconfig:	select HAVE_CONTEXT_TRACKING		if PPC64
arch/riscv/Kconfig:	select HAVE_CONTEXT_TRACKING
arch/sparc/Kconfig:	select HAVE_CONTEXT_TRACKING
arch/x86/Kconfig:	select HAVE_CONTEXT_TRACKING		if X86_64

S390 and X86 are (mostly) complete and use the generic entry code. S390
does not even select HAVE_CONTEXT_TRACKING!

PPC64 has done quite some work to fix that, but it looks not yet complete. 

Mark is working on ARM64.

There is some effort underway to convert MIPS over to generic entry.

The rest needs all the fundamental architecture side changes.

> Anyway, it _looks_ to me like the pattern is very simple:
>
> Step 1:
>  - remove irq_enter/irq_exit from handle_domain_irq(), move it to all callers.
>
> This clearly doesn't change anything at all, but also doesn't fix the
> problem you have. But it's easy to verify that the code is the same
> before-and-after.
>
> Step 2 is the pattern matching step:
>
>  - if the caller of handle_domain_irq() ends up being a function that
> is registered with set_handle_irq(), then we
>    (a) remove the irq_enter/irq_exit from it
>    (b) add it to the architectures that use handle_arch_irq.
>    (c) make sure that if there are other callers of it (not through
> handle_arch_irq) we move that irq_enter/irq_exit into them too
>
> I _suspect_ - but didn't check - that Step 2(c) doesn't actually
> exist. But who knows.

It only exists with chained handlers, but they do not need that at all
because:

        irq_enter()
        arch_handle_irq()
          handle_domain_irq()
            chained_handler()
              handle_domain_irq()

which is still the same interrupt context and not a nested interrupt.

> It really looks like there is a very tight connection between "uses
> handle_domain_irq()" and "uses handle_arch_irq/set_handle_irq()". No?

Looks like. That might conflict with the MIPS rework though. I don't
know how far that came already. Cc'ed the MIPS people.

Thanks,

        tglx

Mark Rutland Oct. 12, 2021, 2:02 p.m. UTC | #5

Hi Linus, Thomas,

A couple of minor comments below -- I'll have a go at the rework Linus
suggested and see what blows up.

On Tue, Oct 12, 2021 at 03:18:16PM +0200, Thomas Gleixner wrote:
> On Mon, Oct 11 2021 at 12:54, Linus Torvalds wrote:
> > On Mon, Oct 11, 2021 at 3:47 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > And so the reason I really hate that patch is that it introduces a new
> > "different architectures randomly and inexplicably do different
> > things, and the generic behavior is very different on arm64 than it is
> > elsewhere".
> >
> > That's just the worst kind of hack to me.
> >
> > And in this case, it's really *horribly* hard to see what the call
> > chain is. It all ends up being actively obfuscated and obscured
> > through that 'handle_arch_irq' function pointer, that is sometimes set
> > through set_handle_irq(), and sometimes set directly.
> >
> > I really think that if the rule is "we can't do accounting in
> > handle_domain_irq(), because it's too late for arm64", then the fix
> > really should be to just not do that.
> >
> > Move the irq_enter()/irq_exit() to the callers - quite possibly far up
> > the call chain to the root of it all, and just say "architecture code
> > needs to do this in the low-level code before calling
> > handle_arch_irq".
> 
> That's where it belongs. It's mandatory to have it there for NOHZ_FULL
> to work correctly vs. instrumentation etc. I've pointed that out back
> then after we fed the X86 entry code into the mincer and added noinstr
> sections to keep tracers, BPF and kprobes away from it.
> 
> Looking at the architectures which "support" that by selecting
> HAVE_CONTEXT_TRACKING:
> 
> arch/arm/Kconfig:	select HAVE_CONTEXT_TRACKING
> arch/arm64/Kconfig:	select HAVE_CONTEXT_TRACKING
> arch/csky/Kconfig:	select HAVE_CONTEXT_TRACKING
> arch/mips/Kconfig:	select HAVE_CONTEXT_TRACKING
> arch/powerpc/Kconfig:	select HAVE_CONTEXT_TRACKING		if PPC64
> arch/riscv/Kconfig:	select HAVE_CONTEXT_TRACKING
> arch/sparc/Kconfig:	select HAVE_CONTEXT_TRACKING
> arch/x86/Kconfig:	select HAVE_CONTEXT_TRACKING		if X86_64
> 
> S390 and X86 are (mostly) complete and use the generic entry code. S390
> does not even select HAVE_CONTEXT_TRACKING!
> 
> PPC64 has done quite some work to fix that, but it looks not yet complete. 
> 
> Mark is working on ARM64.
> 
> There is some effort underway to convert MIPS over to generic entry.
> 
> The rest needs all the fundamental architecture side changes.
> 
> > Anyway, it _looks_ to me like the pattern is very simple:
> >
> > Step 1:
> >  - remove irq_enter/irq_exit from handle_domain_irq(), move it to all callers.
> >
> > This clearly doesn't change anything at all, but also doesn't fix the
> > problem you have. But it's easy to verify that the code is the same
> > before-and-after.

I'm happy with this in principle. The only reason we didn't go down that
route initially is because the callers are (typically) in the bowels of
arch asm or platform code, they all need to be fixed in one go to avoid
breaking anything, and it's a headache if we collide with any rework
(e.g. MIPS moving to generic entry).

As above, I'll have a go and see what blows up.

It should be possible to have C wrappers for the common cases, and have
the asm call that instead of branching directly to whichever irqchip
handler handle_arch_irq points at.

> > Step 2 is the pattern matching step:
> >
> >  - if the caller of handle_domain_irq() ends up being a function that
> > is registered with set_handle_irq(), then we
> >    (a) remove the irq_enter/irq_exit from it
> >    (b) add it to the architectures that use handle_arch_irq.
> >    (c) make sure that if there are other callers of it (not through
> > handle_arch_irq) we move that irq_enter/irq_exit into them too
> >
> > I _suspect_ - but didn't check - that Step 2(c) doesn't actually
> > exist. But who knows.
> 
> It only exists with chained handlers, but they do not need that at all
> because:
> 
>         irq_enter()
>         arch_handle_irq()
>           handle_domain_irq()
>             chained_handler()
>               handle_domain_irq()

... and this is exactly the shape of case where we need to avoid the
nested calls so as to not break RCU's view of nesting.

> which is still the same interrupt context and not a nested interrupt.
> > It really looks like there is a very tight connection between "uses
> > handle_domain_irq()" and "uses handle_arch_irq/set_handle_irq()". No?
> 
> Looks like. That might conflict with the MIPS rework though. I don't
> know how far that came already. Cc'ed the MIPS people.

There's also a bunch of old platforms on arch/arm which have a
hard-coded handler (so not using handle_arch_irq/set_handle_irq()) which
calls handle_domain_irq() -- those can be fixed up.

Thanks,
Mark.

Thomas Gleixner Oct. 12, 2021, 3:39 p.m. UTC | #6

On Tue, Oct 12 2021 at 15:02, Mark Rutland wrote:
> On Tue, Oct 12, 2021 at 03:18:16PM +0200, Thomas Gleixner wrote:
>> On Mon, Oct 11 2021 at 12:54, Linus Torvalds wrote:
> I'm happy with this in principle. The only reason we didn't go down that
> route initially is because the callers are (typically) in the bowels of
> arch asm or platform code, they all need to be fixed in one go to avoid
> breaking anything, and it's a headache if we collide with any rework
> (e.g. MIPS moving to generic entry).

mips-next looks pretty empty vs. that.

>> > It really looks like there is a very tight connection between "uses
>> > handle_domain_irq()" and "uses handle_arch_irq/set_handle_irq()". No?
>> 
>> Looks like. That might conflict with the MIPS rework though. I don't
>> know how far that came already. Cc'ed the MIPS people.
>
> There's also a bunch of old platforms on arch/arm which have a
> hard-coded handler (so not using handle_arch_irq/set_handle_irq()) which
> calls handle_domain_irq() -- those can be fixed up.

If that turns out to be ugly, then somehting like the below might be
less horrible as a stop gap.

Thanks,

        tglx
---

--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -268,20 +268,16 @@ void xen_send_IPI_allbutself(int vector)
 
 static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id)
 {
-	irq_enter();
 	generic_smp_call_function_interrupt();
 	inc_irq_stat(irq_call_count);
-	irq_exit();
 
 	return IRQ_HANDLED;
 }
 
 static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id)
 {
-	irq_enter();
 	generic_smp_call_function_single_interrupt();
 	inc_irq_stat(irq_call_count);
-	irq_exit();
 
 	return IRQ_HANDLED;
 }
--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -458,10 +458,8 @@ static void xen_pv_stop_other_cpus(int w
 
 static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id)
 {
-	irq_enter();
 	irq_work_run();
 	inc_irq_stat(apic_irq_work_irqs);
-	irq_exit();
 
 	return IRQ_HANDLED;
 }
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -33,6 +33,9 @@ config HOTPLUG_SMT
 config GENERIC_ENTRY
        bool
 
+config ARCH_ENTRY_RCU_CLEAN
+       bool
+
 config KPROBES
 	bool "Kprobes"
 	depends on MODULES
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -66,6 +66,7 @@ config X86
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE)
 	select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
+	select ARCH_ENTRY_RCU_CLEAN
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
 	select ARCH_HAS_CACHE_LINE_SIZE
 	select ARCH_HAS_DEBUG_VIRTUAL
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -677,24 +677,13 @@ int generic_handle_domain_irq(struct irq
 EXPORT_SYMBOL_GPL(generic_handle_domain_irq);
 
 #ifdef CONFIG_HANDLE_DOMAIN_IRQ
-/**
- * handle_domain_irq - Invoke the handler for a HW irq belonging to a domain,
- *                     usually for a root interrupt controller
- * @domain:	The domain where to perform the lookup
- * @hwirq:	The HW irq number to convert to a logical one
- * @regs:	Register file coming from the low-level handling code
- *
- * Returns:	0 on success, or -EINVAL if conversion has failed
- */
-int handle_domain_irq(struct irq_domain *domain,
-		      unsigned int hwirq, struct pt_regs *regs)
+static int __handle_domain_irq(struct irq_domain *domain,
+			       unsigned int hwirq, struct pt_regs *regs)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 	struct irq_desc *desc;
 	int ret = 0;
 
-	irq_enter();
-
 	/* The irqdomain code provides boundary checks */
 	desc = irq_resolve_mapping(domain, hwirq);
 	if (likely(desc))
@@ -702,12 +691,41 @@ int handle_domain_irq(struct irq_domain
 	else
 		ret = -EINVAL;
 
-	irq_exit();
 	set_irq_regs(old_regs);
 	return ret;
 }
 
 /**
+ * handle_domain_irq - Invoke the handler for a HW irq belonging to a domain,
+ *                     usually for a root interrupt controller
+ * @domain:	The domain where to perform the lookup
+ * @hwirq:	The HW irq number to convert to a logical one
+ * @regs:	Register file coming from the low-level handling code
+ *
+ * Returns:	0 on success, or -EINVAL if conversion has failed
+ */
+#ifdef CONFIG_ARCH_ENTRY_RCU_CLEAN
+int handle_domain_irq(struct irq_domain *domain,
+		      unsigned int hwirq, struct pt_regs *regs)
+{
+	__handle_domain_irq(domain, hwirq, regs);
+}
+#else
+int handle_domain_irq(struct irq_domain *domain,
+		      unsigned int hwirq, struct pt_regs *regs)
+{
+	/*
+	 * irq_enter()/exit() has to be done in low level
+	 * architecture code. Bandaid for not yet fixed
+	 * architectures.
+	 */
+	irq_enter();
+	__handle_domain_irq(domain, hwirq, regs);
+	irq_exit();
+}
+#endif
+
+/**
  * handle_domain_nmi - Invoke the handler for a HW irq belonging to a domain
  * @domain:	The domain where to perform the lookup
  * @hwirq:	The HW irq number to convert to a logical one
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -601,6 +601,7 @@ void irq_enter_rcu(void)
 	account_hardirq_enter(current);
 }
 
+#ifndef ARCH_ENTRY_RCU_CLEAN
 /**
  * irq_enter - Enter an interrupt context including RCU update
  */
@@ -609,6 +610,7 @@ void irq_enter(void)
 	rcu_irq_enter();
 	irq_enter_rcu();
 }
+#endif
 
 static inline void tick_irq_exit(void)
 {
@@ -650,6 +652,7 @@ void irq_exit_rcu(void)
 	lockdep_hardirq_exit();
 }
 
+#ifndef ARCH_ENTRY_RCU_CLEAN
 /**
  * irq_exit - Exit an interrupt context, update RCU and lockdep
  *
@@ -662,6 +665,7 @@ void irq_exit(void)
 	 /* must be last! */
 	lockdep_hardirq_exit();
 }
+#endif
 
 /*
  * This function must run with irqs disabled!

Pingfan Liu Oct. 17, 2021, 7:42 a.m. UTC | #7

Hi Linus,

I am not in the Cc list, so just aware of it the day before yesterday.

On Mon, Oct 11, 2021 at 12:54:49PM -0700, Linus Torvalds wrote:
> On Mon, Oct 11, 2021 at 3:47 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Sorry; I agree that commit messages don't explain this thoroughly. I can
> > go and rework the commit messages to clarify things, if you'd like?
> 
> So I really hate that patch, even with explanation.
> 
> Having the Kconfig option for "do the right thing" is not how this
> should be fixed.
> 
> Either the generic code is generic, or it isn't.
> 
> And if the generic code doesn't work for arm64, we shouldn't add a
> random Kconfig option for "this architecture wants different
> behavior".
> 
> > The TL;DR is that a bunch of constraints conspire such that we can't
> > defer accounting things to the irqdomain or irqchip code without making
> > this even more complicated/fragile, and many architectures get this
> > subtly wrong today -- it's not that arm64 is necessarily much different
> > from other architectures using this code, just that we're trying to
> > solve this first.
> 
> So then I think the fix is to just say "accounting in this generic
> function is wrong, and the accounting needs to be moved to the
> callers".
> 
> That's particularly true if you think arm64 is only the only actually
> _tested_ case that gets this wrong, and other architectures most
> likely have the exact same issue, but you only fixed it for arm64.
> 
> So do it unconditionally - possibly even using a coccinelle script if
> there are lots of callers.
> 
> Because generic code that just isn't generic, but randomly does
> different things based on subtle Kconfig options that are different
> for different architectures is bad, bad, bad.
> 

When composing the patch, I failed to realize it, but now, I learn.

> And yes, I realize that we've had that kind of stuff before (and we
> have odd Kconfig option testing in that irqdesc.c file elsewhere), but
> the Kconfig options we currently have are mostly either
> 
>  (a) actual real honest-to-goodness config options with semantic
> meaning (ie things like CONFIG_SMP and CONFIG_NUMA)
> 
>  (b) really ugly workarounds for odd special module exports (that
> CONFIG_KVM_BOOK3S_64_HV_MODULE thing for irq_to_desc() that we *tried*
> but failed to remove).
> 
> And so the reason I really hate that patch is that it introduces a new
> "different architectures randomly and inexplicably do different
> things, and the generic behavior is very different on arm64 than it is
> elsewhere".
> 
> That's just the worst kind of hack to me.
> 
> And in this case, it's really *horribly* hard to see what the call
> chain is. It all ends up being actively obfuscated and obscured
> through that 'handle_arch_irq' function pointer, that is sometimes set
> through set_handle_irq(), and sometimes set directly.
> 
> I really think that if the rule is "we can't do accounting in
> handle_domain_irq(), because it's too late for arm64", then the fix
> really should be to just not do that.
> 
> Move the irq_enter()/irq_exit() to the callers - quite possibly far up
> the call chain to the root of it all, and just say "architecture code
> needs to do this in the low-level code before calling
> handle_arch_irq".
> 
> Or, if it turns out that 99% of callers do want it - and don't have
> the issues arm64 has - maybe we can have a helper wrapper that does
> the irq_enter/irq_exit, and another version that doesn't do it, and at
> least it's clear to the callers which one it is. But it looks like
> particularly with the odd indirection through that handle_arch_irq
> function, it's best to just say "this needs to be done".
> 
> What architectures actually care? From what I can follow, it's really
> GENERIC_IRQ_MULTI_HANDLER that ends up doing this all, and then arm64
> has it's own special case for no obvious reason (why isn't it using
> GENERIC_IRQ_MULTI_HANDLER that seems to do the EXACT same thing in
> generic code except for a special "default != NULL" case?)
> 
> Anyway, it _looks_ to me like the pattern is very simple:
> 
> Step 1:
>  - remove irq_enter/irq_exit from handle_domain_irq(), move it to all callers.
> 
> This clearly doesn't change anything at all, but also doesn't fix the
> problem you have. But it's easy to verify that the code is the same
> before-and-after.
> 
> Step 2 is the pattern matching step:
> 
>  - if the caller of handle_domain_irq() ends up being a function that
> is registered with set_handle_irq(), then we
>    (a) remove the irq_enter/irq_exit from it
>    (b) add it to the architectures that use handle_arch_irq.
>    (c) make sure that if there are other callers of it (not through
> handle_arch_irq) we move that irq_enter/irq_exit into them too
> 
> I _suspect_ - but didn't check - that Step 2(c) doesn't actually
> exist. But who knows.
> 
> It really looks like there is a very tight connection between "uses
> handle_domain_irq()" and "uses handle_arch_irq/set_handle_irq()". No?
> 
> Is this a bit more work? Yes.
> 
> Would this fix arm64? Yes - it ends up effectively doing what you did.
> 
> Would this fix _other_ architectures doing the same thing that you
> suspect are broken? YES. Instead of leaving them randomly broken.
> 
> And most importantly, it would make the rules for "generic" code
> clear, and actually generic.
> 
> Now, it may be that I'm wrong. I'm willing to be convinced, and if you
> beat me over the head enough I guess I can take that pull request and
> just be unhappy about it.
> 

I had thought if any arch adapts GENERIC_ENTRY, then handle_domain_irq()
can be a bug. As GENERIC_ENTRY is a not _random_ config option, so try
to re-anchor the config depending on GENERIC_ENTRY.

But finally I gave up, since there is no direct link between them at a
glance. And what is more, as you said, "the rules for "generic" code
clear, and actually generic", so it is better to go in that way.

I think Mark has started the work or I will be happy to re-work on these
patches.

Thanks,

	Pingfan

Mark Rutland Oct. 21, 2021, 3:57 p.m. UTC | #8

On Tue, Oct 12, 2021 at 03:02:43PM +0100, Mark Rutland wrote:
> On Tue, Oct 12, 2021 at 03:18:16PM +0200, Thomas Gleixner wrote:
> > On Mon, Oct 11 2021 at 12:54, Linus Torvalds wrote:
> > > On Mon, Oct 11, 2021 at 3:47 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > And so the reason I really hate that patch is that it introduces a new
> > > "different architectures randomly and inexplicably do different
> > > things, and the generic behavior is very different on arm64 than it is
> > > elsewhere".
> > >
> > > That's just the worst kind of hack to me.
> > >
> > > And in this case, it's really *horribly* hard to see what the call
> > > chain is. It all ends up being actively obfuscated and obscured
> > > through that 'handle_arch_irq' function pointer, that is sometimes set
> > > through set_handle_irq(), and sometimes set directly.
> > >
> > > I really think that if the rule is "we can't do accounting in
> > > handle_domain_irq(), because it's too late for arm64", then the fix
> > > really should be to just not do that.
> > >
> > > Move the irq_enter()/irq_exit() to the callers - quite possibly far up
> > > the call chain to the root of it all, and just say "architecture code
> > > needs to do this in the low-level code before calling
> > > handle_arch_irq".

I've spent the last few days attacking this, and I have a series which
reworks things to pull irq_{enter,exit}() out of the irqchip code and
into arch/entry code where it belongs, removig CONFIG_HANDLE_DOMAIN_IRQ
entirely in the process. I'll post that out soon once I've cleaned up
the commit messages and given it a decent cover letter.

> > > Anyway, it _looks_ to me like the pattern is very simple:
> > >
> > > Step 1:
> > >  - remove irq_enter/irq_exit from handle_domain_irq(), move it to all callers.
> > >
> > > This clearly doesn't change anything at all, but also doesn't fix the
> > > problem you have. But it's easy to verify that the code is the same
> > > before-and-after.
> > >
> > > Step 2 is the pattern matching step:
> > >
> > >  - if the caller of handle_domain_irq() ends up being a function that
> > > is registered with set_handle_irq(), then we
> > >    (a) remove the irq_enter/irq_exit from it
> > >    (b) add it to the architectures that use handle_arch_irq.
> > >    (c) make sure that if there are other callers of it (not through
> > > handle_arch_irq) we move that irq_enter/irq_exit into them too
> > >
> > > I _suspect_ - but didn't check - that Step 2(c) doesn't actually

I had a go with the approach suggested above, but that didn't really
work out and I ended up splitting the problem a different way. Comments
belwo for the sake of posterity.

Attacking this as a per-caller issue is *really* chury, and
interdependencies force you to fix all drivers and all architectures in
one go, which makes it really hard to see the wood for the trees.

The underlying issue was with CONFIG_HANDLE_DOMAIN_IRQ, so just looking
as set_handle_irq (which indicates CONFIG_GENERIC_IRQ_MULTI_HANDLER)
also wasn't sufficient, and I had to go digging through each of the
affected architectures' entry code.

Instead, I've added a temporary shim, migrated each architecture in
turn, then removed the shim and CONFIG_HANDLE_DOMAIN_IRQ entirely, which
also ends up simplifying the drivers a bit.

Thanks,
Mark.

[GIT,PULL] arm64 fixes for 5.15-rc5

Pull-request

Message

Comments