mbox series

[v1,0/5] arm64: avoid out-of-line ll/sc atomics

Message ID 20190516155344.24060-1-andrew.murray@arm.com (mailing list archive)
Headers show
Series arm64: avoid out-of-line ll/sc atomics | expand

Message

Andrew Murray May 16, 2019, 3:53 p.m. UTC
When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
or toolchain doesn't support it the existing code will fallback to ll/sc
atomics. It achieves this by branching from inline assembly to a function
that is built with specical compile flags. Further this results in the
clobbering of registers even when the fallback isn't used increasing
register pressure.

Let's improve this by providing inline implementatins of both LSE and
ll/sc and use a static key to select between them. This allows for the
compiler to generate better atomics code.

Build and boot tested, along with atomic_64_test.

Following is the assembly of a function that has three consecutive
atomic_add calls when built with LSE and this patchset:

Dump of assembler code for function atomics_test:
   0xffff000010084338 <+0>:     stp     x29, x30, [sp, #-32]!
   0xffff00001008433c <+4>:     adrp    x0, 0xffff0000112dd000 <crypto_ft_tab+2368>
   0xffff000010084340 <+8>:     add     x1, x0, #0x6c8
   0xffff000010084344 <+12>:    mov     x29, sp
   0xffff000010084348 <+16>:    ldr     x2, [x1]
   0xffff00001008434c <+20>:    str     x2, [x29, #24]
   0xffff000010084350 <+24>:    mov     x2, #0x0                        // #0
   0xffff000010084354 <+28>:    b       0xffff000010084394 <atomics_test+92>
   0xffff000010084358 <+32>:    b       0xffff000010084394 <atomics_test+92>
   0xffff00001008435c <+36>:    mov     w1, #0x18                       // #24
   0xffff000010084360 <+40>:    add     x2, x29, #0x14
   0xffff000010084364 <+44>:    stadd   w1, [x2]
   0xffff000010084368 <+48>:    b       0xffff0000100843b0 <atomics_test+120>
   0xffff00001008436c <+52>:    b       0xffff0000100843b0 <atomics_test+120>
   0xffff000010084370 <+56>:    mov     w1, #0x18                       // #24
   0xffff000010084374 <+60>:    add     x2, x29, #0x14
   0xffff000010084378 <+64>:    stadd   w1, [x2]
   0xffff00001008437c <+68>:    b       0xffff0000100843cc <atomics_test+148>
   0xffff000010084380 <+72>:    b       0xffff0000100843cc <atomics_test+148>
   0xffff000010084384 <+76>:    mov     w1, #0x18                       // #24
   0xffff000010084388 <+80>:    add     x2, x29, #0x14
   0xffff00001008438c <+84>:    stadd   w1, [x2]
   0xffff000010084390 <+88>:    b       0xffff0000100843e4 <atomics_test+172>
   0xffff000010084394 <+92>:    add     x3, x29, #0x14
   0xffff000010084398 <+96>:    prfm    pstl1strm, [x3]
   0xffff00001008439c <+100>:   ldxr    w1, [x3]
   0xffff0000100843a0 <+104>:   add     w1, w1, #0x18
   0xffff0000100843a4 <+108>:   stxr    w2, w1, [x3]
   0xffff0000100843a8 <+112>:   cbnz    w2, 0xffff00001008439c <atomics_test+100>
   0xffff0000100843ac <+116>:   b       0xffff000010084368 <atomics_test+48>
   0xffff0000100843b0 <+120>:   add     x3, x29, #0x14
   0xffff0000100843b4 <+124>:   prfm    pstl1strm, [x3]
   0xffff0000100843b8 <+128>:   ldxr    w1, [x3]
   0xffff0000100843bc <+132>:   add     w1, w1, #0x18
   0xffff0000100843c0 <+136>:   stxr    w2, w1, [x3]
   0xffff0000100843c4 <+140>:   cbnz    w2, 0xffff0000100843b8 <atomics_test+128>
   0xffff0000100843c8 <+144>:   b       0xffff00001008437c <atomics_test+68>
   0xffff0000100843cc <+148>:   add     x3, x29, #0x14
   0xffff0000100843d0 <+152>:   prfm    pstl1strm, [x3]
   0xffff0000100843d4 <+156>:   ldxr    w1, [x3]
   0xffff0000100843d8 <+160>:   add     w1, w1, #0x18
   0xffff0000100843dc <+164>:   stxr    w2, w1, [x3]
   0xffff0000100843e0 <+168>:   cbnz    w2, 0xffff0000100843d4 <atomics_test+156>
   0xffff0000100843e4 <+172>:   add     x0, x0, #0x6c8
   0xffff0000100843e8 <+176>:   ldr     x1, [x29, #24]
   0xffff0000100843ec <+180>:   ldr     x0, [x0]
   0xffff0000100843f0 <+184>:   eor     x0, x1, x0
   0xffff0000100843f4 <+188>:   cbnz    x0, 0xffff000010084400 <atomics_test+200>
   0xffff0000100843f8 <+192>:   ldp     x29, x30, [sp], #32
   0xffff0000100843fc <+196>:   ret
   0xffff000010084400 <+200>:   bl      0xffff0000100db740 <__stack_chk_fail>
End of assembler dump.

The two branches before each section of atomics relates to the two static
keys which both become nop's when LSE is available. When LSE isn't
available the branches are used to run the slowpath fallback LL/SC atomics.

Where CONFIG_ARM64_LSE_ATOMICS isn't enabled then the same function is as
follows:

Dump of assembler code for function atomics_test:
   0xffff000010084338 <+0>:     stp     x29, x30, [sp, #-32]!
   0xffff00001008433c <+4>:     adrp    x0, 0xffff00001126d000 <crypto_ft_tab+2368>
   0xffff000010084340 <+8>:     add     x0, x0, #0x6c8
   0xffff000010084344 <+12>:    mov     x29, sp
   0xffff000010084348 <+16>:    add     x3, x29, #0x14
   0xffff00001008434c <+20>:    ldr     x1, [x0]
   0xffff000010084350 <+24>:    str     x1, [x29, #24]
   0xffff000010084354 <+28>:    mov     x1, #0x0                        // #0
   0xffff000010084358 <+32>:    prfm    pstl1strm, [x3]
   0xffff00001008435c <+36>:    ldxr    w1, [x3]
   0xffff000010084360 <+40>:    add     w1, w1, #0x18
   0xffff000010084364 <+44>:    stxr    w2, w1, [x3]
   0xffff000010084368 <+48>:    cbnz    w2, 0xffff00001008435c <atomics_test+36>
   0xffff00001008436c <+52>:    prfm    pstl1strm, [x3]
   0xffff000010084370 <+56>:    ldxr    w1, [x3]
   0xffff000010084374 <+60>:    add     w1, w1, #0x18
   0xffff000010084378 <+64>:    stxr    w2, w1, [x3]
   0xffff00001008437c <+68>:    cbnz    w2, 0xffff000010084370 <atomics_test+56>
   0xffff000010084380 <+72>:    prfm    pstl1strm, [x3]
   0xffff000010084384 <+76>:    ldxr    w1, [x3]
   0xffff000010084388 <+80>:    add     w1, w1, #0x18
   0xffff00001008438c <+84>:    stxr    w2, w1, [x3]
   0xffff000010084390 <+88>:    cbnz    w2, 0xffff000010084384 <atomics_test+76>
   0xffff000010084394 <+92>:    ldr     x1, [x29, #24]
   0xffff000010084398 <+96>:    ldr     x0, [x0]
   0xffff00001008439c <+100>:   eor     x0, x1, x0
   0xffff0000100843a0 <+104>:   cbnz    x0, 0xffff0000100843ac <atomics_test+116>
   0xffff0000100843a4 <+108>:   ldp     x29, x30, [sp], #32
   0xffff0000100843a8 <+112>:   ret
   0xffff0000100843ac <+116>:   bl      0xffff0000100da4f0 <__stack_chk_fail>
End of assembler dump.

These changes add a small amount of bloat on defconfig according to
bloat-o-meter:

text:
  add/remove: 1/108 grow/shrink: 3448/20 up/down: 272768/-4320 (268448)
  Total: Before=12363112, After=12631560, chg +2.17%

data:
  add/remove: 0/95 grow/shrink: 2/0 up/down: 40/-3251 (-3211)
  Total: Before=4628123, After=4624912, chg -0.07%


Andrew Murray (5):
  jump_label: Don't warn on __exit jump entries
  arm64: Use correct ll/sc atomic constraints
  arm64: atomics: avoid out-of-line ll/sc atomics
  arm64: avoid using hard-coded registers for LSE atomics
  arm64: atomics: remove atomic_ll_sc compilation unit

 arch/arm64/include/asm/atomic.h       |  11 +-
 arch/arm64/include/asm/atomic_arch.h  | 154 ++++++++++
 arch/arm64/include/asm/atomic_ll_sc.h | 164 +++++------
 arch/arm64/include/asm/atomic_lse.h   | 395 +++++++++-----------------
 arch/arm64/include/asm/cmpxchg.h      |   2 +-
 arch/arm64/include/asm/lse.h          |  11 -
 arch/arm64/lib/Makefile               |  19 --
 arch/arm64/lib/atomic_ll_sc.c         |   3 -
 kernel/jump_label.c                   |  16 +-
 9 files changed, 375 insertions(+), 400 deletions(-)
 create mode 100644 arch/arm64/include/asm/atomic_arch.h
 delete mode 100644 arch/arm64/lib/atomic_ll_sc.c

Comments

Peter Zijlstra May 17, 2019, 7:24 a.m. UTC | #1
On Thu, May 16, 2019 at 04:53:39PM +0100, Andrew Murray wrote:
> When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
> or toolchain doesn't support it the existing code will fallback to ll/sc
> atomics. It achieves this by branching from inline assembly to a function
> that is built with specical compile flags. Further this results in the
> clobbering of registers even when the fallback isn't used increasing
> register pressure.
> 
> Let's improve this by providing inline implementatins of both LSE and
> ll/sc and use a static key to select between them. This allows for the
> compiler to generate better atomics code.

Don't you guys have alternatives? That would avoid having both versions
in the code, and thus significantly cuts back on the bloat.

> These changes add a small amount of bloat on defconfig according to
> bloat-o-meter:
> 
> text:
>   add/remove: 1/108 grow/shrink: 3448/20 up/down: 272768/-4320 (268448)
>   Total: Before=12363112, After=12631560, chg +2.17%

I'd say 2% is quite significant bloat.
Andrew Murray May 17, 2019, 10:08 a.m. UTC | #2
On Fri, May 17, 2019 at 09:24:01AM +0200, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 04:53:39PM +0100, Andrew Murray wrote:
> > When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
> > or toolchain doesn't support it the existing code will fallback to ll/sc
> > atomics. It achieves this by branching from inline assembly to a function
> > that is built with specical compile flags. Further this results in the
> > clobbering of registers even when the fallback isn't used increasing
> > register pressure.
> > 
> > Let's improve this by providing inline implementatins of both LSE and
> > ll/sc and use a static key to select between them. This allows for the
> > compiler to generate better atomics code.
> 
> Don't you guys have alternatives? That would avoid having both versions
> in the code, and thus significantly cuts back on the bloat.

Yes we do.

Prior to patch 3 of this series, the ARM64_LSE_ATOMIC_INSN macro used
ALTERNATIVE to either bl to a fallback ll/sc function (and nops) - or execute
some LSE instructions.

But this approach limits the compilers ability to optimise the code due to
the asm clobber list being the superset of both ll/sc and LSE - and the gcc
compiler flags used on the ll/sc functions.

I think the alternative solution (excuse the pun) that you are suggesting
is to put the body of the ll/sc or LSE code in the ALTERNATIVE oldinstr/newinstr
blocks (i.e. drop the fallback branches). However this still gives us some
bloat (but less than my current solution) because we're still now inlining the
larger fallback ll/sc whereas previously they were non-inline'd functions. We
still end up with potentially unnecessary clobbers for LSE code with this
approach.

Approach prior to this series:

   BL 1 or NOP <- single alternative instruction
   LSE
   LSE
   ...

1: LL/SC <- LL/SC fallback not inlined so reused
   LL/SC
   LL/SC
   LL/SC

Approach proposed by this series:

   BL 1 or NOP <- single alternative instruction
   LSE
   LSE
   BL 2
1: LL/SC <- inlined LL/SC and thus duplicated
   LL/SC
   LL/SC
   LL/SC
2: ..

Approach using alternative without braces:

   LSE
   LSE
   NOP
   NOP

or

   LL/SC <- inlined LL/SC and thus duplicated
   LL/SC
   LL/SC
   LL/SC

I guess there is a balance here between bloat and code optimisation.

> 
> > These changes add a small amount of bloat on defconfig according to
> > bloat-o-meter:
> > 
> > text:
> >   add/remove: 1/108 grow/shrink: 3448/20 up/down: 272768/-4320 (268448)
> >   Total: Before=12363112, After=12631560, chg +2.17%
> 
> I'd say 2% is quite significant bloat.

Thanks,

Andrew Murray
Ard Biesheuvel May 17, 2019, 10:29 a.m. UTC | #3
On Fri, 17 May 2019 at 12:08, Andrew Murray <andrew.murray@arm.com> wrote:
>
> On Fri, May 17, 2019 at 09:24:01AM +0200, Peter Zijlstra wrote:
> > On Thu, May 16, 2019 at 04:53:39PM +0100, Andrew Murray wrote:
> > > When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
> > > or toolchain doesn't support it the existing code will fallback to ll/sc
> > > atomics. It achieves this by branching from inline assembly to a function
> > > that is built with specical compile flags. Further this results in the
> > > clobbering of registers even when the fallback isn't used increasing
> > > register pressure.
> > >
> > > Let's improve this by providing inline implementatins of both LSE and
> > > ll/sc and use a static key to select between them. This allows for the
> > > compiler to generate better atomics code.
> >
> > Don't you guys have alternatives? That would avoid having both versions
> > in the code, and thus significantly cuts back on the bloat.
>
> Yes we do.
>
> Prior to patch 3 of this series, the ARM64_LSE_ATOMIC_INSN macro used
> ALTERNATIVE to either bl to a fallback ll/sc function (and nops) - or execute
> some LSE instructions.
>
> But this approach limits the compilers ability to optimise the code due to
> the asm clobber list being the superset of both ll/sc and LSE - and the gcc
> compiler flags used on the ll/sc functions.
>
> I think the alternative solution (excuse the pun) that you are suggesting
> is to put the body of the ll/sc or LSE code in the ALTERNATIVE oldinstr/newinstr
> blocks (i.e. drop the fallback branches). However this still gives us some
> bloat (but less than my current solution) because we're still now inlining the
> larger fallback ll/sc whereas previously they were non-inline'd functions. We
> still end up with potentially unnecessary clobbers for LSE code with this
> approach.
>
> Approach prior to this series:
>
>    BL 1 or NOP <- single alternative instruction
>    LSE
>    LSE
>    ...
>
> 1: LL/SC <- LL/SC fallback not inlined so reused
>    LL/SC
>    LL/SC
>    LL/SC
>
> Approach proposed by this series:
>
>    BL 1 or NOP <- single alternative instruction
>    LSE
>    LSE
>    BL 2
> 1: LL/SC <- inlined LL/SC and thus duplicated
>    LL/SC
>    LL/SC
>    LL/SC
> 2: ..
>
> Approach using alternative without braces:
>
>    LSE
>    LSE
>    NOP
>    NOP
>
> or
>
>    LL/SC <- inlined LL/SC and thus duplicated
>    LL/SC
>    LL/SC
>    LL/SC
>
> I guess there is a balance here between bloat and code optimisation.
>


So there are two separate questions here:
1) whether or not we should merge the inline asm blocks so that the
compiler sees a single set of constraints and operands
2) whether the LL/SC sequence should be inlined and/or duplicated.

This approach appears to be based on the assumption that reserving one
or sometimes two additional registers for the LL/SC fallback has a
more severe impact on performance than the unconditional branch.
However, it seems to me that any call site that uses the atomics has
to deal with the possibility of either version being invoked, and so
the additional registers need to be freed up in any case. Or am I
missing something?

As for the duplication: a while ago, I suggested an approach [0] using
alternatives and asm subsections, which moved the duplicated LL/SC
fallbacks out of the hot path. This does not remove the bloat, but it
does mitigate its impact on I-cache efficiency when running on
hardware that does not require the fallbacks.


[0] https://lore.kernel.org/linux-arm-kernel/20181113233923.20098-1-ard.biesheuvel@linaro.org/



> >
> > > These changes add a small amount of bloat on defconfig according to
> > > bloat-o-meter:
> > >
> > > text:
> > >   add/remove: 1/108 grow/shrink: 3448/20 up/down: 272768/-4320 (268448)
> > >   Total: Before=12363112, After=12631560, chg +2.17%
> >
> > I'd say 2% is quite significant bloat.
>
> Thanks,
>
> Andrew Murray
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Peter Zijlstra May 17, 2019, 12:05 p.m. UTC | #4
On Fri, May 17, 2019 at 11:08:03AM +0100, Andrew Murray wrote:

> I think the alternative solution (excuse the pun) that you are suggesting
> is to put the body of the ll/sc or LSE code in the ALTERNATIVE oldinstr/newinstr
> blocks (i.e. drop the fallback branches). However this still gives us some
> bloat (but less than my current solution) because we're still now inlining the
> larger fallback ll/sc whereas previously they were non-inline'd functions. We
> still end up with potentially unnecessary clobbers for LSE code with this
> Approach prior to this series:

> Approach using alternative without braces:
> 
>    LSE
>    LSE
>    NOP
>    NOP
> 
> or
> 
>    LL/SC <- inlined LL/SC and thus duplicated
>    LL/SC
>    LL/SC
>    LL/SC

Yes that. And if you worry about the extra clobber for LL/SC, you could
always stuck a few PUSH/POPs around the LL/SC block. Although I'm not
exactly sure where the x16,x17,x30 clobbers come from; then I look at
the LL/SC code, there aren't any hard-coded regs in there.

Also, the safe approach is to emit LL/SC as the default and only patch
in LSE when you know the machine supports them.
Ard Biesheuvel May 17, 2019, 12:19 p.m. UTC | #5
On Fri, 17 May 2019 at 14:05, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, May 17, 2019 at 11:08:03AM +0100, Andrew Murray wrote:
>
> > I think the alternative solution (excuse the pun) that you are suggesting
> > is to put the body of the ll/sc or LSE code in the ALTERNATIVE oldinstr/newinstr
> > blocks (i.e. drop the fallback branches). However this still gives us some
> > bloat (but less than my current solution) because we're still now inlining the
> > larger fallback ll/sc whereas previously they were non-inline'd functions. We
> > still end up with potentially unnecessary clobbers for LSE code with this
> > Approach prior to this series:
>
> > Approach using alternative without braces:
> >
> >    LSE
> >    LSE
> >    NOP
> >    NOP
> >
> > or
> >
> >    LL/SC <- inlined LL/SC and thus duplicated
> >    LL/SC
> >    LL/SC
> >    LL/SC
>
> Yes that. And if you worry about the extra clobber for LL/SC, you could
> always stuck a few PUSH/POPs around the LL/SC block.

Patching in pushes and pops replaces a potential performance hit in
the LSE code with a guaranteed performance hit in the LL/SC code, and
you may end up pushing and popping dead registers. So it would be nice
to see some justification for disproportionately penalizing the LL/SC
code (which will be used on low end cores where stack accesses are
relatively expensive) relative to the LSE code, rather than assuming
that relieving the register pressure on the current hot paths will
result in a measurable performance improvement on LSE systems.

>  Although I'm not
> exactly sure where the x16,x17,x30 clobbers come from; then I look at
> the LL/SC code, there aren't any hard-coded regs in there.
>

The out of line LL/SC code is invoked as a function call, and so we
need to preserve x30 which contains the return value.

x16 and x17 are used by the PLT branching code, in case the module
invoking the atomics is too far away from the core kernel for an
ordinary relative branch.

> Also, the safe approach is to emit LL/SC as the default and only patch
> in LSE when you know the machine supports them.
>

Given that it is not only the safe approach, but the only working
approach, we are obviously already doing that both in the old and the
new version of the code.
Andrew Murray May 22, 2019, 10:45 a.m. UTC | #6
On Fri, May 17, 2019 at 12:29:54PM +0200, Ard Biesheuvel wrote:
> On Fri, 17 May 2019 at 12:08, Andrew Murray <andrew.murray@arm.com> wrote:
> >
> > On Fri, May 17, 2019 at 09:24:01AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 16, 2019 at 04:53:39PM +0100, Andrew Murray wrote:
> > > > When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
> > > > or toolchain doesn't support it the existing code will fallback to ll/sc
> > > > atomics. It achieves this by branching from inline assembly to a function
> > > > that is built with specical compile flags. Further this results in the
> > > > clobbering of registers even when the fallback isn't used increasing
> > > > register pressure.
> > > >
> > > > Let's improve this by providing inline implementatins of both LSE and
> > > > ll/sc and use a static key to select between them. This allows for the
> > > > compiler to generate better atomics code.
> > >
> > > Don't you guys have alternatives? That would avoid having both versions
> > > in the code, and thus significantly cuts back on the bloat.
> >
> > Yes we do.
> >
> > Prior to patch 3 of this series, the ARM64_LSE_ATOMIC_INSN macro used
> > ALTERNATIVE to either bl to a fallback ll/sc function (and nops) - or execute
> > some LSE instructions.
> >
> > But this approach limits the compilers ability to optimise the code due to
> > the asm clobber list being the superset of both ll/sc and LSE - and the gcc
> > compiler flags used on the ll/sc functions.
> >
> > I think the alternative solution (excuse the pun) that you are suggesting
> > is to put the body of the ll/sc or LSE code in the ALTERNATIVE oldinstr/newinstr
> > blocks (i.e. drop the fallback branches). However this still gives us some
> > bloat (but less than my current solution) because we're still now inlining the
> > larger fallback ll/sc whereas previously they were non-inline'd functions. We
> > still end up with potentially unnecessary clobbers for LSE code with this
> > approach.
> >
> > Approach prior to this series:
> >
> >    BL 1 or NOP <- single alternative instruction
> >    LSE
> >    LSE
> >    ...
> >
> > 1: LL/SC <- LL/SC fallback not inlined so reused
> >    LL/SC
> >    LL/SC
> >    LL/SC
> >
> > Approach proposed by this series:
> >
> >    BL 1 or NOP <- single alternative instruction
> >    LSE
> >    LSE
> >    BL 2
> > 1: LL/SC <- inlined LL/SC and thus duplicated
> >    LL/SC
> >    LL/SC
> >    LL/SC
> > 2: ..
> >
> > Approach using alternative without braces:
> >
> >    LSE
> >    LSE
> >    NOP
> >    NOP
> >
> > or
> >
> >    LL/SC <- inlined LL/SC and thus duplicated
> >    LL/SC
> >    LL/SC
> >    LL/SC
> >
> > I guess there is a balance here between bloat and code optimisation.
> >
> 
> 
> So there are two separate questions here:
> 1) whether or not we should merge the inline asm blocks so that the
> compiler sees a single set of constraints and operands
> 2) whether the LL/SC sequence should be inlined and/or duplicated.
> 
> This approach appears to be based on the assumption that reserving one
> or sometimes two additional registers for the LL/SC fallback has a
> more severe impact on performance than the unconditional branch.
> However, it seems to me that any call site that uses the atomics has
> to deal with the possibility of either version being invoked, and so
> the additional registers need to be freed up in any case. Or am I
> missing something?

Yes at compile time the compiler doesn't know which atomics path will
be taken so code has to be generated for both (thus optimisation is
limited). However due to this approach we no longer use hard-coded
registers or restrict which/how registers can be used and therefore the
compiler ought to have greater freedom to optimise.

> 
> As for the duplication: a while ago, I suggested an approach [0] using
> alternatives and asm subsections, which moved the duplicated LL/SC
> fallbacks out of the hot path. This does not remove the bloat, but it
> does mitigate its impact on I-cache efficiency when running on
> hardware that does not require the fallbacks.#

I've seen this. I guess its possible to incorporate subsections into the
inline assembly in the __ll_sc_* functions of this series. If we wanted
the ll/sc fallbacks not to be inlined, then I suppose we can put these
functions in their own section to achieve the same goal.

My toolchain knowledge is a limited here - but in order to use subsections
you require a branch - in this case does the compiler optimise across the
sub sections? If not then I guess there is no benefit to inlining the code
in which case you may as well have a branch to a function (in its own
section) and then you get both the icache gain and also avoid bloat. Does
that make any sense?

Thanks,

Andrew Murray

> 
> 
> [0] https://lore.kernel.org/linux-arm-kernel/20181113233923.20098-1-ard.biesheuvel@linaro.org/
> 
> 
> 
> > >
> > > > These changes add a small amount of bloat on defconfig according to
> > > > bloat-o-meter:
> > > >
> > > > text:
> > > >   add/remove: 1/108 grow/shrink: 3448/20 up/down: 272768/-4320 (268448)
> > > >   Total: Before=12363112, After=12631560, chg +2.17%
> > >
> > > I'd say 2% is quite significant bloat.
> >
> > Thanks,
> >
> > Andrew Murray
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel May 22, 2019, 11:44 a.m. UTC | #7
On Wed, 22 May 2019 at 11:45, Andrew Murray <andrew.murray@arm.com> wrote:
>
> On Fri, May 17, 2019 at 12:29:54PM +0200, Ard Biesheuvel wrote:
> > On Fri, 17 May 2019 at 12:08, Andrew Murray <andrew.murray@arm.com> wrote:
> > >
> > > On Fri, May 17, 2019 at 09:24:01AM +0200, Peter Zijlstra wrote:
> > > > On Thu, May 16, 2019 at 04:53:39PM +0100, Andrew Murray wrote:
> > > > > When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
> > > > > or toolchain doesn't support it the existing code will fallback to ll/sc
> > > > > atomics. It achieves this by branching from inline assembly to a function
> > > > > that is built with specical compile flags. Further this results in the
> > > > > clobbering of registers even when the fallback isn't used increasing
> > > > > register pressure.
> > > > >
> > > > > Let's improve this by providing inline implementatins of both LSE and
> > > > > ll/sc and use a static key to select between them. This allows for the
> > > > > compiler to generate better atomics code.
> > > >
> > > > Don't you guys have alternatives? That would avoid having both versions
> > > > in the code, and thus significantly cuts back on the bloat.
> > >
> > > Yes we do.
> > >
> > > Prior to patch 3 of this series, the ARM64_LSE_ATOMIC_INSN macro used
> > > ALTERNATIVE to either bl to a fallback ll/sc function (and nops) - or execute
> > > some LSE instructions.
> > >
> > > But this approach limits the compilers ability to optimise the code due to
> > > the asm clobber list being the superset of both ll/sc and LSE - and the gcc
> > > compiler flags used on the ll/sc functions.
> > >
> > > I think the alternative solution (excuse the pun) that you are suggesting
> > > is to put the body of the ll/sc or LSE code in the ALTERNATIVE oldinstr/newinstr
> > > blocks (i.e. drop the fallback branches). However this still gives us some
> > > bloat (but less than my current solution) because we're still now inlining the
> > > larger fallback ll/sc whereas previously they were non-inline'd functions. We
> > > still end up with potentially unnecessary clobbers for LSE code with this
> > > approach.
> > >
> > > Approach prior to this series:
> > >
> > >    BL 1 or NOP <- single alternative instruction
> > >    LSE
> > >    LSE
> > >    ...
> > >
> > > 1: LL/SC <- LL/SC fallback not inlined so reused
> > >    LL/SC
> > >    LL/SC
> > >    LL/SC
> > >
> > > Approach proposed by this series:
> > >
> > >    BL 1 or NOP <- single alternative instruction
> > >    LSE
> > >    LSE
> > >    BL 2
> > > 1: LL/SC <- inlined LL/SC and thus duplicated
> > >    LL/SC
> > >    LL/SC
> > >    LL/SC
> > > 2: ..
> > >
> > > Approach using alternative without braces:
> > >
> > >    LSE
> > >    LSE
> > >    NOP
> > >    NOP
> > >
> > > or
> > >
> > >    LL/SC <- inlined LL/SC and thus duplicated
> > >    LL/SC
> > >    LL/SC
> > >    LL/SC
> > >
> > > I guess there is a balance here between bloat and code optimisation.
> > >
> >
> >
> > So there are two separate questions here:
> > 1) whether or not we should merge the inline asm blocks so that the
> > compiler sees a single set of constraints and operands
> > 2) whether the LL/SC sequence should be inlined and/or duplicated.
> >
> > This approach appears to be based on the assumption that reserving one
> > or sometimes two additional registers for the LL/SC fallback has a
> > more severe impact on performance than the unconditional branch.
> > However, it seems to me that any call site that uses the atomics has
> > to deal with the possibility of either version being invoked, and so
> > the additional registers need to be freed up in any case. Or am I
> > missing something?
>
> Yes at compile time the compiler doesn't know which atomics path will
> be taken so code has to be generated for both (thus optimisation is
> limited). However due to this approach we no longer use hard-coded
> registers or restrict which/how registers can be used and therefore the
> compiler ought to have greater freedom to optimise.
>

Yes, I agree that is an improvement. But that doesn't require the
LL/SC and LSE asm sequences to be distinct.

> >
> > As for the duplication: a while ago, I suggested an approach [0] using
> > alternatives and asm subsections, which moved the duplicated LL/SC
> > fallbacks out of the hot path. This does not remove the bloat, but it
> > does mitigate its impact on I-cache efficiency when running on
> > hardware that does not require the fallbacks.#
>
> I've seen this. I guess its possible to incorporate subsections into the
> inline assembly in the __ll_sc_* functions of this series. If we wanted
> the ll/sc fallbacks not to be inlined, then I suppose we can put these
> functions in their own section to achieve the same goal.
>
> My toolchain knowledge is a limited here - but in order to use subsections
> you require a branch - in this case does the compiler optimise across the
> sub sections? If not then I guess there is no benefit to inlining the code
> in which case you may as well have a branch to a function (in its own
> section) and then you get both the icache gain and also avoid bloat. Does
> that make any sense?
>


Not entirely. A function call requires an additional register to be
preserved, and the bl and ret instructions are both indirect branches,
while subsections use direct unconditional branches only.

Another reason we want to get rid of the current approach (and the
reason I looked into it in the first place) is that we are introducing
hidden branches, which affects the reliability of backtraces and this
is an issue for livepatch.

> >
> >
> > [0] https://lore.kernel.org/linux-arm-kernel/20181113233923.20098-1-ard.biesheuvel@linaro.org/
> >
> >
> >
> > > >
> > > > > These changes add a small amount of bloat on defconfig according to
> > > > > bloat-o-meter:
> > > > >
> > > > > text:
> > > > >   add/remove: 1/108 grow/shrink: 3448/20 up/down: 272768/-4320 (268448)
> > > > >   Total: Before=12363112, After=12631560, chg +2.17%
> > > >
> > > > I'd say 2% is quite significant bloat.
> > >
> > > Thanks,
> > >
> > > Andrew Murray
> > >
> > > _______________________________________________
> > > linux-arm-kernel mailing list
> > > linux-arm-kernel@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Andrew Murray May 22, 2019, 3:36 p.m. UTC | #8
On Wed, May 22, 2019 at 12:44:35PM +0100, Ard Biesheuvel wrote:
> On Wed, 22 May 2019 at 11:45, Andrew Murray <andrew.murray@arm.com> wrote:
> >
> > On Fri, May 17, 2019 at 12:29:54PM +0200, Ard Biesheuvel wrote:
> > > On Fri, 17 May 2019 at 12:08, Andrew Murray <andrew.murray@arm.com> wrote:
> > > >
> > > > On Fri, May 17, 2019 at 09:24:01AM +0200, Peter Zijlstra wrote:
> > > > > On Thu, May 16, 2019 at 04:53:39PM +0100, Andrew Murray wrote:
> > > > > > When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
> > > > > > or toolchain doesn't support it the existing code will fallback to ll/sc
> > > > > > atomics. It achieves this by branching from inline assembly to a function
> > > > > > that is built with specical compile flags. Further this results in the
> > > > > > clobbering of registers even when the fallback isn't used increasing
> > > > > > register pressure.
> > > > > >
> > > > > > Let's improve this by providing inline implementatins of both LSE and
> > > > > > ll/sc and use a static key to select between them. This allows for the
> > > > > > compiler to generate better atomics code.
> > > > >
> > > > > Don't you guys have alternatives? That would avoid having both versions
> > > > > in the code, and thus significantly cuts back on the bloat.
> > > >
> > > > Yes we do.
> > > >
> > > > Prior to patch 3 of this series, the ARM64_LSE_ATOMIC_INSN macro used
> > > > ALTERNATIVE to either bl to a fallback ll/sc function (and nops) - or execute
> > > > some LSE instructions.
> > > >
> > > > But this approach limits the compilers ability to optimise the code due to
> > > > the asm clobber list being the superset of both ll/sc and LSE - and the gcc
> > > > compiler flags used on the ll/sc functions.
> > > >
> > > > I think the alternative solution (excuse the pun) that you are suggesting
> > > > is to put the body of the ll/sc or LSE code in the ALTERNATIVE oldinstr/newinstr
> > > > blocks (i.e. drop the fallback branches). However this still gives us some
> > > > bloat (but less than my current solution) because we're still now inlining the
> > > > larger fallback ll/sc whereas previously they were non-inline'd functions. We
> > > > still end up with potentially unnecessary clobbers for LSE code with this
> > > > approach.
> > > >
> > > > Approach prior to this series:
> > > >
> > > >    BL 1 or NOP <- single alternative instruction
> > > >    LSE
> > > >    LSE
> > > >    ...
> > > >
> > > > 1: LL/SC <- LL/SC fallback not inlined so reused
> > > >    LL/SC
> > > >    LL/SC
> > > >    LL/SC
> > > >
> > > > Approach proposed by this series:
> > > >
> > > >    BL 1 or NOP <- single alternative instruction
> > > >    LSE
> > > >    LSE
> > > >    BL 2
> > > > 1: LL/SC <- inlined LL/SC and thus duplicated
> > > >    LL/SC
> > > >    LL/SC
> > > >    LL/SC
> > > > 2: ..
> > > >
> > > > Approach using alternative without braces:
> > > >
> > > >    LSE
> > > >    LSE
> > > >    NOP
> > > >    NOP
> > > >
> > > > or
> > > >
> > > >    LL/SC <- inlined LL/SC and thus duplicated
> > > >    LL/SC
> > > >    LL/SC
> > > >    LL/SC
> > > >
> > > > I guess there is a balance here between bloat and code optimisation.
> > > >
> > >
> > >
> > > So there are two separate questions here:
> > > 1) whether or not we should merge the inline asm blocks so that the
> > > compiler sees a single set of constraints and operands
> > > 2) whether the LL/SC sequence should be inlined and/or duplicated.
> > >
> > > This approach appears to be based on the assumption that reserving one
> > > or sometimes two additional registers for the LL/SC fallback has a
> > > more severe impact on performance than the unconditional branch.
> > > However, it seems to me that any call site that uses the atomics has
> > > to deal with the possibility of either version being invoked, and so
> > > the additional registers need to be freed up in any case. Or am I
> > > missing something?
> >
> > Yes at compile time the compiler doesn't know which atomics path will
> > be taken so code has to be generated for both (thus optimisation is
> > limited). However due to this approach we no longer use hard-coded
> > registers or restrict which/how registers can be used and therefore the
> > compiler ought to have greater freedom to optimise.
> >
> 
> Yes, I agree that is an improvement. But that doesn't require the
> LL/SC and LSE asm sequences to be distinct.
> 
> > >
> > > As for the duplication: a while ago, I suggested an approach [0] using
> > > alternatives and asm subsections, which moved the duplicated LL/SC
> > > fallbacks out of the hot path. This does not remove the bloat, but it
> > > does mitigate its impact on I-cache efficiency when running on
> > > hardware that does not require the fallbacks.#
> >
> > I've seen this. I guess its possible to incorporate subsections into the
> > inline assembly in the __ll_sc_* functions of this series. If we wanted
> > the ll/sc fallbacks not to be inlined, then I suppose we can put these
> > functions in their own section to achieve the same goal.
> >
> > My toolchain knowledge is a limited here - but in order to use subsections
> > you require a branch - in this case does the compiler optimise across the
> > sub sections? If not then I guess there is no benefit to inlining the code
> > in which case you may as well have a branch to a function (in its own
> > section) and then you get both the icache gain and also avoid bloat. Does
> > that make any sense?
> >
> 
> 
> Not entirely. A function call requires an additional register to be
> preserved, and the bl and ret instructions are both indirect branches,
> while subsections use direct unconditional branches only.
> 
> Another reason we want to get rid of the current approach (and the
> reason I looked into it in the first place) is that we are introducing
> hidden branches, which affects the reliability of backtraces and this
> is an issue for livepatch.

I guess we don't have enough information to determine the performance effect
of this.

I think I'll spend some time comparing the effect of some of these factors
on typical code with objdump to get a better feel for the likely effect
on performance and post my findings.

Thanks for the feedback.

Thanks,

Andrew Murray

> 
> > >
> > >
> > > [0] https://lore.kernel.org/linux-arm-kernel/20181113233923.20098-1-ard.biesheuvel@linaro.org/
> > >
> > >
> > >
> > > > >
> > > > > > These changes add a small amount of bloat on defconfig according to
> > > > > > bloat-o-meter:
> > > > > >
> > > > > > text:
> > > > > >   add/remove: 1/108 grow/shrink: 3448/20 up/down: 272768/-4320 (268448)
> > > > > >   Total: Before=12363112, After=12631560, chg +2.17%
> > > > >
> > > > > I'd say 2% is quite significant bloat.
> > > >
> > > > Thanks,
> > > >
> > > > Andrew Murray
> > > >
> > > > _______________________________________________
> > > > linux-arm-kernel mailing list
> > > > linux-arm-kernel@lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel