Message ID | 20180802231753.86336-1-trong@android.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: build CONFIG_ARM64_LSE_ATOMICS without -ffixed- flags | expand |
On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: > Remove -ffixed- compile flags as they are not required for correct > behavior of out-of-line ll/sc implementations of atomic functions. > Registers (except x0 which contains return value) only need to be > callee-saved, not necessarily fixed. > > For readability purposes, represent "callee-saves-all-registers" calling > convention with a combination of -fcall-used- and -fcall-saved- flags only. > > Signed-off-by: Tri Vo <trong@android.com> Wouldn't this permit the compiler to stack/unstack these registers in the implementations of the out of line LL/SC fallbacks. > --- > arch/arm64/lib/Makefile | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile > index c86b7909ef31..05ba3d276c15 100644 > --- a/arch/arm64/lib/Makefile > +++ b/arch/arm64/lib/Makefile > @@ -11,9 +11,9 @@ lib-y := bitops.o clear_user.o delay.o copy_from_user.o \ > # when supported by the CPU. Result and argument registers are handled > # correctly, based on the function prototype. > lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o > -CFLAGS_atomic_ll_sc.o := -fcall-used-x0 -ffixed-x1 -ffixed-x2 \ > - -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6 \ > - -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9 \ > - -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12 \ > - -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15 \ > - -fcall-saved-x18 > +CFLAGS_atomic_ll_sc.o := -fcall-used-x0 -fcall-saved-x1 \ > + -fcall-saved-x2 -fcall-saved-x3 -fcall-saved-x4 \ > + -fcall-saved-x5 -fcall-saved-x6 -fcall-saved-x7 \ > + -fcall-saved-x8 -fcall-saved-x9 -fcall-saved-x10 \ > + -fcall-saved-x11 -fcall-saved-x12 -fcall-saved-x13 \ > + -fcall-saved-x14 -fcall-saved-x15 -fcall-saved-x18 > -- > 2.18.0.597.ga71716f1ad-goog > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: > > Remove -ffixed- compile flags as they are not required for correct > > behavior of out-of-line ll/sc implementations of atomic functions. > > Registers (except x0 which contains return value) only need to be > > callee-saved, not necessarily fixed. > > > > For readability purposes, represent "callee-saves-all-registers" calling > > convention with a combination of -fcall-used- and -fcall-saved- flags only. > > > > Signed-off-by: Tri Vo <trong@android.com> > > Wouldn't this permit the compiler to stack/unstack these registers in > the implementations of the out of line LL/SC fallbacks. I'm curious to understand more about how this LSE patching works, and what the constraints exactly are. Are there some docs I should read that you can point me to? I see arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line implementations? Does that make arch/arm64/include/asm/atomic.h the out of line definitions? What are the restrictions around preventing the compiler to push/pop those registers?
On 3 August 2018 at 19:17, Nick Desaulniers <ndesaulniers@google.com> wrote: > On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >> >> On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: >> > Remove -ffixed- compile flags as they are not required for correct >> > behavior of out-of-line ll/sc implementations of atomic functions. >> > Registers (except x0 which contains return value) only need to be >> > callee-saved, not necessarily fixed. >> > >> > For readability purposes, represent "callee-saves-all-registers" calling >> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >> > >> > Signed-off-by: Tri Vo <trong@android.com> >> >> Wouldn't this permit the compiler to stack/unstack these registers in >> the implementations of the out of line LL/SC fallbacks. > > I'm curious to understand more about how this LSE patching works, and > what the constraints exactly are. Are there some docs I should read > that you can point me to? I see > arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line > implementations? Does that make arch/arm64/include/asm/atomic.h the > out of line definitions? What are the restrictions around preventing > the compiler to push/pop those registers? > The idea is that LSE atomics can each be patched into a bl instruction that calls the appropriate LL/SC fallback, but without the overhead of an ordinary function call. (actually the bl is there by default, but patched into LSE ops on h/w that supports it) So that is why the 'bl' is in an asm() block, and why x30 is used as a temporary/clobber in the LSE sequences: the function call is completely hidden from the compiler. Obviously, for performance, you want to prevent these fallbacks from touching the stack. Note that x16 and x17 are allowed as temporaries: this is due to the fact that function calls may be taken through a PLT entry, which may clobber x16 and x17 (as per the AAPCS).
On Fri, Aug 3, 2018 at 10:27 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 3 August 2018 at 19:17, Nick Desaulniers <ndesaulniers@google.com> wrote: >> On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>> >>> On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: >>> > Remove -ffixed- compile flags as they are not required for correct >>> > behavior of out-of-line ll/sc implementations of atomic functions. >>> > Registers (except x0 which contains return value) only need to be >>> > callee-saved, not necessarily fixed. >>> > >>> > For readability purposes, represent "callee-saves-all-registers" calling >>> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >>> > >>> > Signed-off-by: Tri Vo <trong@android.com> >>> >>> Wouldn't this permit the compiler to stack/unstack these registers in >>> the implementations of the out of line LL/SC fallbacks. Yes. However, the number of registers that get stacked/unstacked should remain the same since compiler should only preserve registers that are used by callee (assuming callee saves all). >> >> I'm curious to understand more about how this LSE patching works, and >> what the constraints exactly are. Are there some docs I should read >> that you can point me to? I see >> arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line >> implementations? Does that make arch/arm64/include/asm/atomic.h the >> out of line definitions? What are the restrictions around preventing >> the compiler to push/pop those registers? >> > > The idea is that LSE atomics can each be patched into a bl instruction > that calls the appropriate LL/SC fallback, but without the overhead of > an ordinary function call. (actually the bl is there by default, but > patched into LSE ops on h/w that supports it) > > So that is why the 'bl' is in an asm() block, and why x30 is used as a > temporary/clobber in the LSE sequences: the function call is > completely hidden from the compiler. Obviously, for performance, you > want to prevent these fallbacks from touching the stack. The function call is hidden from the compiler on the caller side using 'bl' in the asm block. I suspect that is done to make alternative instructions the same byte length (constraints of dynamic patching). However, on the callee side, LL/SC fallbacks (in atomic_ll_sc.h) are implemented as C functions. So the compiler still handles the function call details and register allocation within those functions. Since the caller doesn't save any registers, callee stacks/unstacks all the registers it uses. So, for example, LL/SC fallback implementation of atomic_add preserves 2 registers with or without this change. I might be wrong in my understaing though. > > Note that x16 and x17 are allowed as temporaries: this is due to the > fact that function calls may be taken through a PLT entry, which may > clobber x16 and x17 (as per the AAPCS).
On 3 August 2018 at 21:08, Tri Vo <trong@android.com> wrote: > On Fri, Aug 3, 2018 at 10:27 AM, Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: >> On 3 August 2018 at 19:17, Nick Desaulniers <ndesaulniers@google.com> wrote: >>> On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>> >>>> On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: >>>> > Remove -ffixed- compile flags as they are not required for correct >>>> > behavior of out-of-line ll/sc implementations of atomic functions. >>>> > Registers (except x0 which contains return value) only need to be >>>> > callee-saved, not necessarily fixed. >>>> > >>>> > For readability purposes, represent "callee-saves-all-registers" calling >>>> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >>>> > >>>> > Signed-off-by: Tri Vo <trong@android.com> >>>> >>>> Wouldn't this permit the compiler to stack/unstack these registers in >>>> the implementations of the out of line LL/SC fallbacks. > > Yes. However, the number of registers that get stacked/unstacked > should remain the same since compiler should only preserve registers > that are used by callee (assuming callee saves all). >>> >>> I'm curious to understand more about how this LSE patching works, and >>> what the constraints exactly are. Are there some docs I should read >>> that you can point me to? I see >>> arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line >>> implementations? Does that make arch/arm64/include/asm/atomic.h the >>> out of line definitions? What are the restrictions around preventing >>> the compiler to push/pop those registers? >>> >> >> The idea is that LSE atomics can each be patched into a bl instruction >> that calls the appropriate LL/SC fallback, but without the overhead of >> an ordinary function call. (actually the bl is there by default, but >> patched into LSE ops on h/w that supports it) >> >> So that is why the 'bl' is in an asm() block, and why x30 is used as a >> temporary/clobber in the LSE sequences: the function call is >> completely hidden from the compiler. Obviously, for performance, you >> want to prevent these fallbacks from touching the stack. > > The function call is hidden from the compiler on the caller side > using 'bl' in the asm block. I suspect that is done to make > alternative instructions the same byte length (constraints of dynamic > patching). However, on the callee side, LL/SC fallbacks > (in atomic_ll_sc.h) are implemented as C functions. So the compiler > still handles the function call details and register allocation > within those functions. > > Since the caller doesn't save any registers, callee stacks/unstacks > all the registers it uses. So, for example, LL/SC fallback > implementation of atomic_add preserves 2 registers with or without > this change. > Did you look at the disassembly?
On 3 August 2018 at 21:13, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 3 August 2018 at 21:08, Tri Vo <trong@android.com> wrote: >> On Fri, Aug 3, 2018 at 10:27 AM, Ard Biesheuvel >> <ard.biesheuvel@linaro.org> wrote: >>> On 3 August 2018 at 19:17, Nick Desaulniers <ndesaulniers@google.com> wrote: >>>> On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>>> >>>>> On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: >>>>> > Remove -ffixed- compile flags as they are not required for correct >>>>> > behavior of out-of-line ll/sc implementations of atomic functions. >>>>> > Registers (except x0 which contains return value) only need to be >>>>> > callee-saved, not necessarily fixed. >>>>> > >>>>> > For readability purposes, represent "callee-saves-all-registers" calling >>>>> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >>>>> > >>>>> > Signed-off-by: Tri Vo <trong@android.com> >>>>> >>>>> Wouldn't this permit the compiler to stack/unstack these registers in >>>>> the implementations of the out of line LL/SC fallbacks. >> >> Yes. However, the number of registers that get stacked/unstacked >> should remain the same since compiler should only preserve registers >> that are used by callee (assuming callee saves all). >>>> >>>> I'm curious to understand more about how this LSE patching works, and >>>> what the constraints exactly are. Are there some docs I should read >>>> that you can point me to? I see >>>> arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line >>>> implementations? Does that make arch/arm64/include/asm/atomic.h the >>>> out of line definitions? What are the restrictions around preventing >>>> the compiler to push/pop those registers? >>>> >>> >>> The idea is that LSE atomics can each be patched into a bl instruction >>> that calls the appropriate LL/SC fallback, but without the overhead of >>> an ordinary function call. (actually the bl is there by default, but >>> patched into LSE ops on h/w that supports it) >>> >>> So that is why the 'bl' is in an asm() block, and why x30 is used as a >>> temporary/clobber in the LSE sequences: the function call is >>> completely hidden from the compiler. Obviously, for performance, you >>> want to prevent these fallbacks from touching the stack. >> >> The function call is hidden from the compiler on the caller side >> using 'bl' in the asm block. I suspect that is done to make >> alternative instructions the same byte length (constraints of dynamic >> patching). However, on the callee side, LL/SC fallbacks >> (in atomic_ll_sc.h) are implemented as C functions. So the compiler >> still handles the function call details and register allocation >> within those functions. >> >> Since the caller doesn't save any registers, callee stacks/unstacks >> all the registers it uses. So, for example, LL/SC fallback >> implementation of atomic_add preserves 2 registers with or without >> this change. >> > > Did you look at the disassembly? Actually, I only just spotted that you only modify x1 .. x7 and that the other registers already use -fcall-saved-xNN Will should comment on why the distinction exists in the first place. Note that a strict reading of the GCC man page suggests that using -fcall-saved-x8 is a bad idea, given that it is the indirect return register. However, given the limited scope of the option here, that probably does not matter in practice. So I guess you need this patch for Clang? Would be better to mention that in the commit log if this is the case.
On Fri, Aug 3, 2018 at 12:34 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 3 August 2018 at 21:13, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >> On 3 August 2018 at 21:08, Tri Vo <trong@android.com> wrote: >>> On Fri, Aug 3, 2018 at 10:27 AM, Ard Biesheuvel >>> <ard.biesheuvel@linaro.org> wrote: >>>> On 3 August 2018 at 19:17, Nick Desaulniers <ndesaulniers@google.com> wrote: >>>>> On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>>>> >>>>>> On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: >>>>>> > Remove -ffixed- compile flags as they are not required for correct >>>>>> > behavior of out-of-line ll/sc implementations of atomic functions. >>>>>> > Registers (except x0 which contains return value) only need to be >>>>>> > callee-saved, not necessarily fixed. >>>>>> > >>>>>> > For readability purposes, represent "callee-saves-all-registers" calling >>>>>> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >>>>>> > >>>>>> > Signed-off-by: Tri Vo <trong@android.com> >>>>>> >>>>>> Wouldn't this permit the compiler to stack/unstack these registers in >>>>>> the implementations of the out of line LL/SC fallbacks. >>> >>> Yes. However, the number of registers that get stacked/unstacked >>> should remain the same since compiler should only preserve registers >>> that are used by callee (assuming callee saves all). >>>>> >>>>> I'm curious to understand more about how this LSE patching works, and >>>>> what the constraints exactly are. Are there some docs I should read >>>>> that you can point me to? I see >>>>> arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line >>>>> implementations? Does that make arch/arm64/include/asm/atomic.h the >>>>> out of line definitions? What are the restrictions around preventing >>>>> the compiler to push/pop those registers? >>>>> >>>> >>>> The idea is that LSE atomics can each be patched into a bl instruction >>>> that calls the appropriate LL/SC fallback, but without the overhead of >>>> an ordinary function call. (actually the bl is there by default, but >>>> patched into LSE ops on h/w that supports it) >>>> >>>> So that is why the 'bl' is in an asm() block, and why x30 is used as a >>>> temporary/clobber in the LSE sequences: the function call is >>>> completely hidden from the compiler. Obviously, for performance, you >>>> want to prevent these fallbacks from touching the stack. >>> >>> The function call is hidden from the compiler on the caller side >>> using 'bl' in the asm block. I suspect that is done to make >>> alternative instructions the same byte length (constraints of dynamic >>> patching). However, on the callee side, LL/SC fallbacks >>> (in atomic_ll_sc.h) are implemented as C functions. So the compiler >>> still handles the function call details and register allocation >>> within those functions. >>> >>> Since the caller doesn't save any registers, callee stacks/unstacks >>> all the registers it uses. So, for example, LL/SC fallback >>> implementation of atomic_add preserves 2 registers with or without >>> this change. >>> >> >> Did you look at the disassembly? Yes, I did. > > Actually, I only just spotted that you only modify x1 .. x7 and that > the other registers already use -fcall-saved-xNN > > Will should comment on why the distinction exists in the first place. > Note that a strict reading of the GCC man page suggests that using > -fcall-saved-x8 is a bad idea, given that it is the indirect return > register. However, given the limited scope of the option here, that > probably does not matter in practice. > > So I guess you need this patch for Clang? Would be better to mention > that in the commit log if this is the case. Yes, I am planning to fold whatever compiler support is needed for LSE atomics into "callee-saves-all-registers" calling convention (without ffixed flags). I'll then implement that calling convention in Clang. I'll mention that in the commit message.
On 4 August 2018 at 03:05, Tri Vo <trong@android.com> wrote: > On Fri, Aug 3, 2018 at 12:34 PM, Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: >> On 3 August 2018 at 21:13, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>> On 3 August 2018 at 21:08, Tri Vo <trong@android.com> wrote: >>>> On Fri, Aug 3, 2018 at 10:27 AM, Ard Biesheuvel >>>> <ard.biesheuvel@linaro.org> wrote: >>>>> On 3 August 2018 at 19:17, Nick Desaulniers <ndesaulniers@google.com> wrote: >>>>>> On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>>>>> >>>>>>> On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: >>>>>>> > Remove -ffixed- compile flags as they are not required for correct >>>>>>> > behavior of out-of-line ll/sc implementations of atomic functions. >>>>>>> > Registers (except x0 which contains return value) only need to be >>>>>>> > callee-saved, not necessarily fixed. >>>>>>> > >>>>>>> > For readability purposes, represent "callee-saves-all-registers" calling >>>>>>> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >>>>>>> > >>>>>>> > Signed-off-by: Tri Vo <trong@android.com> >>>>>>> >>>>>>> Wouldn't this permit the compiler to stack/unstack these registers in >>>>>>> the implementations of the out of line LL/SC fallbacks. >>>> >>>> Yes. However, the number of registers that get stacked/unstacked >>>> should remain the same since compiler should only preserve registers >>>> that are used by callee (assuming callee saves all). >>>>>> >>>>>> I'm curious to understand more about how this LSE patching works, and >>>>>> what the constraints exactly are. Are there some docs I should read >>>>>> that you can point me to? I see >>>>>> arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line >>>>>> implementations? Does that make arch/arm64/include/asm/atomic.h the >>>>>> out of line definitions? What are the restrictions around preventing >>>>>> the compiler to push/pop those registers? >>>>>> >>>>> >>>>> The idea is that LSE atomics can each be patched into a bl instruction >>>>> that calls the appropriate LL/SC fallback, but without the overhead of >>>>> an ordinary function call. (actually the bl is there by default, but >>>>> patched into LSE ops on h/w that supports it) >>>>> >>>>> So that is why the 'bl' is in an asm() block, and why x30 is used as a >>>>> temporary/clobber in the LSE sequences: the function call is >>>>> completely hidden from the compiler. Obviously, for performance, you >>>>> want to prevent these fallbacks from touching the stack. >>>> >>>> The function call is hidden from the compiler on the caller side >>>> using 'bl' in the asm block. I suspect that is done to make >>>> alternative instructions the same byte length (constraints of dynamic >>>> patching). However, on the callee side, LL/SC fallbacks >>>> (in atomic_ll_sc.h) are implemented as C functions. So the compiler >>>> still handles the function call details and register allocation >>>> within those functions. >>>> >>>> Since the caller doesn't save any registers, callee stacks/unstacks >>>> all the registers it uses. So, for example, LL/SC fallback >>>> implementation of atomic_add preserves 2 registers with or without >>>> this change. >>>> >>> >>> Did you look at the disassembly? > > Yes, I did. 0000000000000000 <__ll_sc_atomic_add>: 0: f9800031 prfm pstl1strm, [x1] 4: 885f7c31 ldxr w17, [x1] 8: 0b000231 add w17, w17, w0 c: 88107c31 stxr w16, w17, [x1] 10: 35ffffb0 cbnz w16, 4 <__ll_sc_atomic_add+0x4> 14: d65f03c0 ret Now where does it preserve 2 registers? >> >> Actually, I only just spotted that you only modify x1 .. x7 and that >> the other registers already use -fcall-saved-xNN >> >> Will should comment on why the distinction exists in the first place. >> Note that a strict reading of the GCC man page suggests that using >> -fcall-saved-x8 is a bad idea, given that it is the indirect return >> register. However, given the limited scope of the option here, that >> probably does not matter in practice. >> >> So I guess you need this patch for Clang? Would be better to mention >> that in the commit log if this is the case. > > Yes, I am planning to fold whatever compiler support is needed for LSE > atomics into "callee-saves-all-registers" calling convention (without > ffixed flags). I'll then implement that calling convention in Clang. > I'll mention that in the commit message. Not *all* registers. x16 and x17 are excluded in this case, and I am not sure whether you would to encode that in an option. What is wrong with using -ffixed or -fcall-saved?
On Sat, Aug 4, 2018 at 12:21 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 4 August 2018 at 03:05, Tri Vo <trong@android.com> wrote: >> On Fri, Aug 3, 2018 at 12:34 PM, Ard Biesheuvel >> <ard.biesheuvel@linaro.org> wrote: >>> On 3 August 2018 at 21:13, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>> On 3 August 2018 at 21:08, Tri Vo <trong@android.com> wrote: >>>>> On Fri, Aug 3, 2018 at 10:27 AM, Ard Biesheuvel >>>>> <ard.biesheuvel@linaro.org> wrote: >>>>>> On 3 August 2018 at 19:17, Nick Desaulniers <ndesaulniers@google.com> wrote: >>>>>>> On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>>>>>> >>>>>>>> On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: >>>>>>>> > Remove -ffixed- compile flags as they are not required for correct >>>>>>>> > behavior of out-of-line ll/sc implementations of atomic functions. >>>>>>>> > Registers (except x0 which contains return value) only need to be >>>>>>>> > callee-saved, not necessarily fixed. >>>>>>>> > >>>>>>>> > For readability purposes, represent "callee-saves-all-registers" calling >>>>>>>> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >>>>>>>> > >>>>>>>> > Signed-off-by: Tri Vo <trong@android.com> >>>>>>>> >>>>>>>> Wouldn't this permit the compiler to stack/unstack these registers in >>>>>>>> the implementations of the out of line LL/SC fallbacks. >>>>> >>>>> Yes. However, the number of registers that get stacked/unstacked >>>>> should remain the same since compiler should only preserve registers >>>>> that are used by callee (assuming callee saves all). >>>>>>> >>>>>>> I'm curious to understand more about how this LSE patching works, and >>>>>>> what the constraints exactly are. Are there some docs I should read >>>>>>> that you can point me to? I see >>>>>>> arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line >>>>>>> implementations? Does that make arch/arm64/include/asm/atomic.h the >>>>>>> out of line definitions? What are the restrictions around preventing >>>>>>> the compiler to push/pop those registers? >>>>>>> >>>>>> >>>>>> The idea is that LSE atomics can each be patched into a bl instruction >>>>>> that calls the appropriate LL/SC fallback, but without the overhead of >>>>>> an ordinary function call. (actually the bl is there by default, but >>>>>> patched into LSE ops on h/w that supports it) >>>>>> >>>>>> So that is why the 'bl' is in an asm() block, and why x30 is used as a >>>>>> temporary/clobber in the LSE sequences: the function call is >>>>>> completely hidden from the compiler. Obviously, for performance, you >>>>>> want to prevent these fallbacks from touching the stack. >>>>> >>>>> The function call is hidden from the compiler on the caller side >>>>> using 'bl' in the asm block. I suspect that is done to make >>>>> alternative instructions the same byte length (constraints of dynamic >>>>> patching). However, on the callee side, LL/SC fallbacks >>>>> (in atomic_ll_sc.h) are implemented as C functions. So the compiler >>>>> still handles the function call details and register allocation >>>>> within those functions. >>>>> >>>>> Since the caller doesn't save any registers, callee stacks/unstacks >>>>> all the registers it uses. So, for example, LL/SC fallback >>>>> implementation of atomic_add preserves 2 registers with or without >>>>> this change. >>>>> >>>> >>>> Did you look at the disassembly? >> >> Yes, I did. > > 0000000000000000 <__ll_sc_atomic_add>: > 0: f9800031 prfm pstl1strm, [x1] > 4: 885f7c31 ldxr w17, [x1] > 8: 0b000231 add w17, w17, w0 > c: 88107c31 stxr w16, w17, [x1] > 10: 35ffffb0 cbnz w16, 4 <__ll_sc_atomic_add+0x4> > 14: d65f03c0 ret > > Now where does it preserve 2 registers? I tested the change on a Hikey 4.9 kernel. The disassembly there looks like this: 0000000000000000 <__ll_sc_atomic_add>: 0: a9be7bfd stp x29, x30, [sp, #-32]! 4: 910003fd mov x29, sp 8: a90127e8 stp x8, x9, [sp, #16] c: 2a0003e9 mov w9, w0 10: aa0103e8 mov x8, x1 14: aa1e03e0 mov x0, x30 18: 94000000 bl 0 <_mcount> 1c: f9800111 prfm pstl1strm, [x8] 20: 885f7d00 ldxr w0, [x8] 24: 0b090000 add w0, w0, w9 28: 88107d00 stxr w16, w0, [x8] 2c: 35ffffb0 cbnz w16, 20 <__ll_sc_atomic_add+0x20> 30: a94127e8 ldp x8, x9, [sp, #16] 34: a8c27bfd ldp x29, x30, [sp], #32 38: d65f03c0 ret 3c: d503201f nop Excluding FP, SP, there are 2 registers x8, x9 preserved. > >>> >>> Actually, I only just spotted that you only modify x1 .. x7 and that >>> the other registers already use -fcall-saved-xNN >>> >>> Will should comment on why the distinction exists in the first place. >>> Note that a strict reading of the GCC man page suggests that using >>> -fcall-saved-x8 is a bad idea, given that it is the indirect return >>> register. However, given the limited scope of the option here, that >>> probably does not matter in practice. >>> >>> So I guess you need this patch for Clang? Would be better to mention >>> that in the commit log if this is the case. >> >> Yes, I am planning to fold whatever compiler support is needed for LSE >> atomics into "callee-saves-all-registers" calling convention (without >> ffixed flags). I'll then implement that calling convention in Clang. >> I'll mention that in the commit message. > > Not *all* registers. x16 and x17 are excluded in this case, and I am > not sure whether you would to encode that in an option. > > What is wrong with using -ffixed or -fcall-saved? Looking at your disassembly snippet, usage of -ffixed flags makes sense to me now. They prevent argument registers from being unnecessarily stacked/unstacked if compiler uses x16, x17 (which don't need to be saved) for allocation. This patch will indeed cause a performance regression, although not on all build environments. Sorry for the confusion.
On 6 August 2018 at 23:54, Tri Vo <trong@android.com> wrote: > On Sat, Aug 4, 2018 at 12:21 AM, Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: >> On 4 August 2018 at 03:05, Tri Vo <trong@android.com> wrote: >>> On Fri, Aug 3, 2018 at 12:34 PM, Ard Biesheuvel >>> <ard.biesheuvel@linaro.org> wrote: >>>> On 3 August 2018 at 21:13, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>>> On 3 August 2018 at 21:08, Tri Vo <trong@android.com> wrote: >>>>>> On Fri, Aug 3, 2018 at 10:27 AM, Ard Biesheuvel >>>>>> <ard.biesheuvel@linaro.org> wrote: >>>>>>> On 3 August 2018 at 19:17, Nick Desaulniers <ndesaulniers@google.com> wrote: >>>>>>>> On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>>>>>>> >>>>>>>>> On 3 August 2018 at 01:17, Tri Vo <trong@android.com> wrote: >>>>>>>>> > Remove -ffixed- compile flags as they are not required for correct >>>>>>>>> > behavior of out-of-line ll/sc implementations of atomic functions. >>>>>>>>> > Registers (except x0 which contains return value) only need to be >>>>>>>>> > callee-saved, not necessarily fixed. >>>>>>>>> > >>>>>>>>> > For readability purposes, represent "callee-saves-all-registers" calling >>>>>>>>> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >>>>>>>>> > >>>>>>>>> > Signed-off-by: Tri Vo <trong@android.com> >>>>>>>>> >>>>>>>>> Wouldn't this permit the compiler to stack/unstack these registers in >>>>>>>>> the implementations of the out of line LL/SC fallbacks. >>>>>> >>>>>> Yes. However, the number of registers that get stacked/unstacked >>>>>> should remain the same since compiler should only preserve registers >>>>>> that are used by callee (assuming callee saves all). >>>>>>>> >>>>>>>> I'm curious to understand more about how this LSE patching works, and >>>>>>>> what the constraints exactly are. Are there some docs I should read >>>>>>>> that you can point me to? I see >>>>>>>> arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line >>>>>>>> implementations? Does that make arch/arm64/include/asm/atomic.h the >>>>>>>> out of line definitions? What are the restrictions around preventing >>>>>>>> the compiler to push/pop those registers? >>>>>>>> >>>>>>> >>>>>>> The idea is that LSE atomics can each be patched into a bl instruction >>>>>>> that calls the appropriate LL/SC fallback, but without the overhead of >>>>>>> an ordinary function call. (actually the bl is there by default, but >>>>>>> patched into LSE ops on h/w that supports it) >>>>>>> >>>>>>> So that is why the 'bl' is in an asm() block, and why x30 is used as a >>>>>>> temporary/clobber in the LSE sequences: the function call is >>>>>>> completely hidden from the compiler. Obviously, for performance, you >>>>>>> want to prevent these fallbacks from touching the stack. >>>>>> >>>>>> The function call is hidden from the compiler on the caller side >>>>>> using 'bl' in the asm block. I suspect that is done to make >>>>>> alternative instructions the same byte length (constraints of dynamic >>>>>> patching). However, on the callee side, LL/SC fallbacks >>>>>> (in atomic_ll_sc.h) are implemented as C functions. So the compiler >>>>>> still handles the function call details and register allocation >>>>>> within those functions. >>>>>> >>>>>> Since the caller doesn't save any registers, callee stacks/unstacks >>>>>> all the registers it uses. So, for example, LL/SC fallback >>>>>> implementation of atomic_add preserves 2 registers with or without >>>>>> this change. >>>>>> >>>>> >>>>> Did you look at the disassembly? >>> >>> Yes, I did. >> >> 0000000000000000 <__ll_sc_atomic_add>: >> 0: f9800031 prfm pstl1strm, [x1] >> 4: 885f7c31 ldxr w17, [x1] >> 8: 0b000231 add w17, w17, w0 >> c: 88107c31 stxr w16, w17, [x1] >> 10: 35ffffb0 cbnz w16, 4 <__ll_sc_atomic_add+0x4> >> 14: d65f03c0 ret >> >> Now where does it preserve 2 registers? > > I tested the change on a Hikey 4.9 kernel. The disassembly there > looks like this: > 0000000000000000 <__ll_sc_atomic_add>: > 0: a9be7bfd stp x29, x30, [sp, #-32]! > 4: 910003fd mov x29, sp > 8: a90127e8 stp x8, x9, [sp, #16] > c: 2a0003e9 mov w9, w0 > 10: aa0103e8 mov x8, x1 > 14: aa1e03e0 mov x0, x30 > 18: 94000000 bl 0 <_mcount> > 1c: f9800111 prfm pstl1strm, [x8] > 20: 885f7d00 ldxr w0, [x8] > 24: 0b090000 add w0, w0, w9 > 28: 88107d00 stxr w16, w0, [x8] > 2c: 35ffffb0 cbnz w16, 20 <__ll_sc_atomic_add+0x20> > 30: a94127e8 ldp x8, x9, [sp, #16] > 34: a8c27bfd ldp x29, x30, [sp], #32 > 38: d65f03c0 ret > 3c: d503201f nop > > Excluding FP, SP, there are 2 registers x8, x9 preserved. This is because you have ftrace enabled on an old kernel. We now disable all instrumentation for this file, and only the 'fetch' variants now preserve/restore a single register. I sent out a patch for that. >> >>>> >>>> Actually, I only just spotted that you only modify x1 .. x7 and that >>>> the other registers already use -fcall-saved-xNN >>>> >>>> Will should comment on why the distinction exists in the first place. >>>> Note that a strict reading of the GCC man page suggests that using >>>> -fcall-saved-x8 is a bad idea, given that it is the indirect return >>>> register. However, given the limited scope of the option here, that >>>> probably does not matter in practice. >>>> >>>> So I guess you need this patch for Clang? Would be better to mention >>>> that in the commit log if this is the case. >>> >>> Yes, I am planning to fold whatever compiler support is needed for LSE >>> atomics into "callee-saves-all-registers" calling convention (without >>> ffixed flags). I'll then implement that calling convention in Clang. >>> I'll mention that in the commit message. >> >> Not *all* registers. x16 and x17 are excluded in this case, and I am >> not sure whether you would to encode that in an option. >> >> What is wrong with using -ffixed or -fcall-saved? > > Looking at your disassembly snippet, usage of -ffixed flags makes > sense to me now. They prevent argument registers from being > unnecessarily stacked/unstacked if compiler uses x16, x17 (which > don't need to be saved) for allocation. > > This patch will indeed cause a performance regression, although not > on all build environments. Sorry for the confusion. No worries, I don't think that is actually the case. But having a callee-saves-all-registers option is not going to fly either, so sticking with -ffixed or -fcall-saved is probably better.
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile index c86b7909ef31..05ba3d276c15 100644 --- a/arch/arm64/lib/Makefile +++ b/arch/arm64/lib/Makefile @@ -11,9 +11,9 @@ lib-y := bitops.o clear_user.o delay.o copy_from_user.o \ # when supported by the CPU. Result and argument registers are handled # correctly, based on the function prototype. lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o -CFLAGS_atomic_ll_sc.o := -fcall-used-x0 -ffixed-x1 -ffixed-x2 \ - -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6 \ - -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9 \ - -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12 \ - -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15 \ - -fcall-saved-x18 +CFLAGS_atomic_ll_sc.o := -fcall-used-x0 -fcall-saved-x1 \ + -fcall-saved-x2 -fcall-saved-x3 -fcall-saved-x4 \ + -fcall-saved-x5 -fcall-saved-x6 -fcall-saved-x7 \ + -fcall-saved-x8 -fcall-saved-x9 -fcall-saved-x10 \ + -fcall-saved-x11 -fcall-saved-x12 -fcall-saved-x13 \ + -fcall-saved-x14 -fcall-saved-x15 -fcall-saved-x18
Remove -ffixed- compile flags as they are not required for correct behavior of out-of-line ll/sc implementations of atomic functions. Registers (except x0 which contains return value) only need to be callee-saved, not necessarily fixed. For readability purposes, represent "callee-saves-all-registers" calling convention with a combination of -fcall-used- and -fcall-saved- flags only. Signed-off-by: Tri Vo <trong@android.com> --- arch/arm64/lib/Makefile | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) -- 2.18.0.597.ga71716f1ad-goog