Message ID | 20241016092154.1493035-1-frediano.ziglio@cloud.com (mailing list archive) |
---|---|
Headers | show |
Series | Reuse 32 bit C code more safely | expand |
Hello, Preempting some future work I'm expecting to arrive, I had a go at using __builtin_*() in obj32. This is formed of 2 patches on top of this series: https://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/xen-lib32 Patch 1 introduces lib32 beside obj32, with strlen() being the first broken-out function, and patch 2 swaps to __builtin_strlen(). Both compile, but the difference that patch 2 introduces was unexpected. With just lib32, and taking strsubcmp() as an example, we get: 00000000 <strsubcmp>: 0: 83 ec 0c sub $0xc,%esp 3: 89 5c 24 04 mov %ebx,0x4(%esp) 7: 89 74 24 08 mov %esi,0x8(%esp) b: 89 c6 mov %eax,%esi d: 89 d3 mov %edx,%ebx f: 89 d0 mov %edx,%eax 11: /-- e8 fc ff ff ff call 12 <strsubcmp+0x12> 12: R_386_PC32 strlen 16: 89 c1 mov %eax,%ecx 18: 89 da mov %ebx,%edx 1a: 89 f0 mov %esi,%eax 1c: /-- e8 fc ff ff ff call 1d <strsubcmp+0x1d> 1d: R_386_PC32 .text.strncmp 21: 8b 5c 24 04 mov 0x4(%esp),%ebx 25: 8b 74 24 08 mov 0x8(%esp),%esi 29: 83 c4 0c add $0xc,%esp 2c: c3 ret which all seems fine. We get a plain PC32 relocation to strlen (which is now in the separate library). However, with patch 2 in place (simply swapping the plain extern for __builtin_strlen(), we now get: 00000000 <strsubcmp>: 0: 83 ec 0c sub $0xc,%esp 3: 89 1c 24 mov %ebx,(%esp) 6: 89 74 24 04 mov %esi,0x4(%esp) a: 89 7c 24 08 mov %edi,0x8(%esp) e: /-- e8 fc ff ff ff call f <strsubcmp+0xf> f: R_386_PC32 __x86.get_pc_thunk.bx 13: 81 c3 02 00 00 00 add $0x2,%ebx 15: R_386_GOTPC _GLOBAL_OFFSET_TABLE_ 19: 89 c7 mov %eax,%edi 1b: 89 d6 mov %edx,%esi 1d: 89 d0 mov %edx,%eax 1f: /-- e8 fc ff ff ff call 20 <strsubcmp+0x20> 20: R_386_PLT32 strlen 24: 89 c1 mov %eax,%ecx 26: 89 f2 mov %esi,%edx 28: 89 f8 mov %edi,%eax 2a: /-- e8 fc ff ff ff call 2b <strsubcmp+0x2b> 2b: R_386_PC32 .text.strncmp 2f: 8b 1c 24 mov (%esp),%ebx 32: 8b 74 24 04 mov 0x4(%esp),%esi 36: 8b 7c 24 08 mov 0x8(%esp),%edi 3a: 83 c4 0c add $0xc,%esp 3d: c3 ret The builtin hasn't managed to optimise away the call to strlen (that's fine). But, we've ended up spilling %ebx to the stack, calculating the location of the GOT and not using it. So, as it stands, trying to use __builtin_strlen() results in worse code generation. One thing I noticed was that we're not passing -fvisibility=hidden into CFLAGS_x86_32, but fixing that doesn't help either. We do have the pragma from compiler.h, so I'm out of visibility ideas. Anything else I've missed? ~Andrew
On Wed, Oct 16, 2024 at 2:30 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > Hello, > > Preempting some future work I'm expecting to arrive, I had a go at using > __builtin_*() in obj32. > > This is formed of 2 patches on top of this series: > https://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/xen-lib32 > You are confident we'll have a lot of shared code to need an additional "lib32" in the Makefile! I would personally stick with obj32. Note that file should be strlen.c, not strlen.32.c, otherwise you are possibly going to pick up the general rule and not the one in the Makefile (but maybe is what you wanted). > Patch 1 introduces lib32 beside obj32, with strlen() being the first > broken-out function, and patch 2 swaps to __builtin_strlen(). > > Both compile, but the difference that patch 2 introduces was unexpected. > > With just lib32, and taking strsubcmp() as an example, we get: > > 00000000 <strsubcmp>: > 0: 83 ec 0c sub $0xc,%esp > 3: 89 5c 24 04 mov %ebx,0x4(%esp) > 7: 89 74 24 08 mov %esi,0x8(%esp) > b: 89 c6 mov %eax,%esi > d: 89 d3 mov %edx,%ebx > f: 89 d0 mov %edx,%eax > 11: /-- e8 fc ff ff ff call 12 <strsubcmp+0x12> > 12: R_386_PC32 strlen > 16: 89 c1 mov %eax,%ecx > 18: 89 da mov %ebx,%edx > 1a: 89 f0 mov %esi,%eax > 1c: /-- e8 fc ff ff ff call 1d <strsubcmp+0x1d> > 1d: R_386_PC32 .text.strncmp > 21: 8b 5c 24 04 mov 0x4(%esp),%ebx > 25: 8b 74 24 08 mov 0x8(%esp),%esi > 29: 83 c4 0c add $0xc,%esp > 2c: c3 ret > > which all seems fine. We get a plain PC32 relocation to strlen (which > is now in the separate library). > > However, with patch 2 in place (simply swapping the plain extern for > __builtin_strlen(), we now get: > > 00000000 <strsubcmp>: > 0: 83 ec 0c sub $0xc,%esp > 3: 89 1c 24 mov %ebx,(%esp) > 6: 89 74 24 04 mov %esi,0x4(%esp) > a: 89 7c 24 08 mov %edi,0x8(%esp) > e: /-- e8 fc ff ff ff call f <strsubcmp+0xf> > f: R_386_PC32 __x86.get_pc_thunk.bx > 13: 81 c3 02 00 00 00 add $0x2,%ebx > 15: R_386_GOTPC _GLOBAL_OFFSET_TABLE_ > 19: 89 c7 mov %eax,%edi > 1b: 89 d6 mov %edx,%esi > 1d: 89 d0 mov %edx,%eax > 1f: /-- e8 fc ff ff ff call 20 <strsubcmp+0x20> > 20: R_386_PLT32 strlen PLT means it not declared hidden, otherwise it would have used the relative relocation. Maybe size_t strlen(const char *s); #define strlen(s) __builtin_strlen(s) xen/compiler.h is included, so all declaration should get the hidden by default ? Or add __attribute__((visibility("hidden"))) explicitly. > 24: 89 c1 mov %eax,%ecx > 26: 89 f2 mov %esi,%edx > 28: 89 f8 mov %edi,%eax > 2a: /-- e8 fc ff ff ff call 2b <strsubcmp+0x2b> > 2b: R_386_PC32 .text.strncmp > 2f: 8b 1c 24 mov (%esp),%ebx > 32: 8b 74 24 04 mov 0x4(%esp),%esi > 36: 8b 7c 24 08 mov 0x8(%esp),%edi > 3a: 83 c4 0c add $0xc,%esp > 3d: c3 ret > > > The builtin hasn't managed to optimise away the call to strlen (that's > fine). But, we've ended up spilling %ebx to the stack, calculating the > location of the GOT and not using it. > Maybe the ABI for PLT is to have %ebx set to the GOT ? Not sure about it. > So, as it stands, trying to use __builtin_strlen() results in worse code > generation. One thing I noticed was that we're not passing > -fvisibility=hidden into CFLAGS_x86_32, but fixing that doesn't help > either. We do have the pragma from compiler.h, so I'm out of visibility > ideas. > The -fvisibility=hidden should be overridden by the xen/compiler.h; but should be overridden with hidden! Maybe strlen is defined by default with another visibility? If you generate the assembly, you should see if the strlen symbol gets the .hidden bless or not. > Anything else I've missed? > Coffee :-) Frediano
On Wed, Oct 16, 2024 at 3:53 PM Frediano Ziglio <frediano.ziglio@cloud.com> wrote: > > On Wed, Oct 16, 2024 at 2:30 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > > > Hello, > > > > Preempting some future work I'm expecting to arrive, I had a go at using > > __builtin_*() in obj32. > > > > This is formed of 2 patches on top of this series: > > https://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/xen-lib32 > > > > You are confident we'll have a lot of shared code to need an > additional "lib32" in the Makefile! > I would personally stick with obj32. > Note that file should be strlen.c, not strlen.32.c, otherwise you are > possibly going to pick up the general rule and not the one in the > Makefile (but maybe is what you wanted). > > > Patch 1 introduces lib32 beside obj32, with strlen() being the first > > broken-out function, and patch 2 swaps to __builtin_strlen(). > > > > Both compile, but the difference that patch 2 introduces was unexpected. > > > > With just lib32, and taking strsubcmp() as an example, we get: > > > > 00000000 <strsubcmp>: > > 0: 83 ec 0c sub $0xc,%esp > > 3: 89 5c 24 04 mov %ebx,0x4(%esp) > > 7: 89 74 24 08 mov %esi,0x8(%esp) > > b: 89 c6 mov %eax,%esi > > d: 89 d3 mov %edx,%ebx > > f: 89 d0 mov %edx,%eax > > 11: /-- e8 fc ff ff ff call 12 <strsubcmp+0x12> > > 12: R_386_PC32 strlen > > 16: 89 c1 mov %eax,%ecx > > 18: 89 da mov %ebx,%edx > > 1a: 89 f0 mov %esi,%eax > > 1c: /-- e8 fc ff ff ff call 1d <strsubcmp+0x1d> > > 1d: R_386_PC32 .text.strncmp > > 21: 8b 5c 24 04 mov 0x4(%esp),%ebx > > 25: 8b 74 24 08 mov 0x8(%esp),%esi > > 29: 83 c4 0c add $0xc,%esp > > 2c: c3 ret > > > > which all seems fine. We get a plain PC32 relocation to strlen (which > > is now in the separate library). > > > > However, with patch 2 in place (simply swapping the plain extern for > > __builtin_strlen(), we now get: > > > > 00000000 <strsubcmp>: > > 0: 83 ec 0c sub $0xc,%esp > > 3: 89 1c 24 mov %ebx,(%esp) > > 6: 89 74 24 04 mov %esi,0x4(%esp) > > a: 89 7c 24 08 mov %edi,0x8(%esp) > > e: /-- e8 fc ff ff ff call f <strsubcmp+0xf> > > f: R_386_PC32 __x86.get_pc_thunk.bx > > 13: 81 c3 02 00 00 00 add $0x2,%ebx > > 15: R_386_GOTPC _GLOBAL_OFFSET_TABLE_ > > 19: 89 c7 mov %eax,%edi > > 1b: 89 d6 mov %edx,%esi > > 1d: 89 d0 mov %edx,%eax > > 1f: /-- e8 fc ff ff ff call 20 <strsubcmp+0x20> > > 20: R_386_PLT32 strlen > > PLT means it not declared hidden, otherwise it would have used the > relative relocation. > Maybe > > size_t strlen(const char *s); > #define strlen(s) __builtin_strlen(s) > > xen/compiler.h is included, so all declaration should get the hidden > by default ? Or add __attribute__((visibility("hidden"))) explicitly. > > > 24: 89 c1 mov %eax,%ecx > > 26: 89 f2 mov %esi,%edx > > 28: 89 f8 mov %edi,%eax > > 2a: /-- e8 fc ff ff ff call 2b <strsubcmp+0x2b> > > 2b: R_386_PC32 .text.strncmp > > 2f: 8b 1c 24 mov (%esp),%ebx > > 32: 8b 74 24 04 mov 0x4(%esp),%esi > > 36: 8b 7c 24 08 mov 0x8(%esp),%edi > > 3a: 83 c4 0c add $0xc,%esp > > 3d: c3 ret > > > > > > The builtin hasn't managed to optimise away the call to strlen (that's > > fine). But, we've ended up spilling %ebx to the stack, calculating the > > location of the GOT and not using it. > > > > Maybe the ABI for PLT is to have %ebx set to the GOT ? Not sure about it. > > > So, as it stands, trying to use __builtin_strlen() results in worse code > > generation. One thing I noticed was that we're not passing > > -fvisibility=hidden into CFLAGS_x86_32, but fixing that doesn't help > > either. We do have the pragma from compiler.h, so I'm out of visibility > > ideas. > > > > The -fvisibility=hidden should be overridden by the xen/compiler.h; > but should be overridden with hidden! > Maybe strlen is defined by default with another visibility? > If you generate the assembly, you should see if the strlen symbol gets > the .hidden bless or not. > I did some more experiments, but I didn't manage to fix the issue. It looks like when __builtin_strlen fallback to calling strlen it uses a private declaration of strlen. -mregparam argument is taken into account, but not visibility. I tried to declare strlen as hidden (checking also preprocessor output to see if other declaration were present, none found), still the @PLT. I tried to add a "static inline strlen"... and it was not used. I found this workaround: diff --git a/xen/arch/x86/boot/cmdline.c b/xen/arch/x86/boot/cmdline.c index 80ffd0885e..ac2b0b0a4d 100644 --- a/xen/arch/x86/boot/cmdline.c +++ b/xen/arch/x86/boot/cmdline.c @@ -51,7 +51,12 @@ static const char delim_chars_comma[] = ", \n\r\t"; #define delim_chars (delim_chars_comma + 1) -#define strlen(s) __builtin_strlen(s) +size_t strlen(const char *s); +#define strlen(str) \ + (__extension__ (__builtin_constant_p(str) \ + ? __builtin_strlen(str) \ + : strlen(str))) + static int strncmp(const char *cs, const char *ct, size_t count) { diff --git a/xen/arch/x86/boot/strlen.32.c b/xen/arch/x86/boot/strlen.c similarity index 100% rename from xen/arch/x86/boot/strlen.32.c rename to xen/arch/x86/boot/strlen.c Frediano
On 16.10.2024 16:53, Frediano Ziglio wrote: > On Wed, Oct 16, 2024 at 2:30 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> >> Hello, >> >> Preempting some future work I'm expecting to arrive, I had a go at using >> __builtin_*() in obj32. >> >> This is formed of 2 patches on top of this series: >> https://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/xen-lib32 >> > > You are confident we'll have a lot of shared code to need an > additional "lib32" in the Makefile! > I would personally stick with obj32. > Note that file should be strlen.c, not strlen.32.c, otherwise you are > possibly going to pick up the general rule and not the one in the > Makefile (but maybe is what you wanted). > >> Patch 1 introduces lib32 beside obj32, with strlen() being the first >> broken-out function, and patch 2 swaps to __builtin_strlen(). >> >> Both compile, but the difference that patch 2 introduces was unexpected. >> >> With just lib32, and taking strsubcmp() as an example, we get: >> >> 00000000 <strsubcmp>: >> 0: 83 ec 0c sub $0xc,%esp >> 3: 89 5c 24 04 mov %ebx,0x4(%esp) >> 7: 89 74 24 08 mov %esi,0x8(%esp) >> b: 89 c6 mov %eax,%esi >> d: 89 d3 mov %edx,%ebx >> f: 89 d0 mov %edx,%eax >> 11: /-- e8 fc ff ff ff call 12 <strsubcmp+0x12> >> 12: R_386_PC32 strlen >> 16: 89 c1 mov %eax,%ecx >> 18: 89 da mov %ebx,%edx >> 1a: 89 f0 mov %esi,%eax >> 1c: /-- e8 fc ff ff ff call 1d <strsubcmp+0x1d> >> 1d: R_386_PC32 .text.strncmp >> 21: 8b 5c 24 04 mov 0x4(%esp),%ebx >> 25: 8b 74 24 08 mov 0x8(%esp),%esi >> 29: 83 c4 0c add $0xc,%esp >> 2c: c3 ret >> >> which all seems fine. We get a plain PC32 relocation to strlen (which >> is now in the separate library). >> >> However, with patch 2 in place (simply swapping the plain extern for >> __builtin_strlen(), we now get: >> >> 00000000 <strsubcmp>: >> 0: 83 ec 0c sub $0xc,%esp >> 3: 89 1c 24 mov %ebx,(%esp) >> 6: 89 74 24 04 mov %esi,0x4(%esp) >> a: 89 7c 24 08 mov %edi,0x8(%esp) >> e: /-- e8 fc ff ff ff call f <strsubcmp+0xf> >> f: R_386_PC32 __x86.get_pc_thunk.bx >> 13: 81 c3 02 00 00 00 add $0x2,%ebx >> 15: R_386_GOTPC _GLOBAL_OFFSET_TABLE_ >> 19: 89 c7 mov %eax,%edi >> 1b: 89 d6 mov %edx,%esi >> 1d: 89 d0 mov %edx,%eax >> 1f: /-- e8 fc ff ff ff call 20 <strsubcmp+0x20> >> 20: R_386_PLT32 strlen > > PLT means it not declared hidden, otherwise it would have used the > relative relocation. > Maybe > > size_t strlen(const char *s); > #define strlen(s) __builtin_strlen(s) > > xen/compiler.h is included, so all declaration should get the hidden > by default ? Or add __attribute__((visibility("hidden"))) explicitly. > >> 24: 89 c1 mov %eax,%ecx >> 26: 89 f2 mov %esi,%edx >> 28: 89 f8 mov %edi,%eax >> 2a: /-- e8 fc ff ff ff call 2b <strsubcmp+0x2b> >> 2b: R_386_PC32 .text.strncmp >> 2f: 8b 1c 24 mov (%esp),%ebx >> 32: 8b 74 24 04 mov 0x4(%esp),%esi >> 36: 8b 7c 24 08 mov 0x8(%esp),%edi >> 3a: 83 c4 0c add $0xc,%esp >> 3d: c3 ret >> >> >> The builtin hasn't managed to optimise away the call to strlen (that's >> fine). But, we've ended up spilling %ebx to the stack, calculating the >> location of the GOT and not using it. > > Maybe the ABI for PLT is to have %ebx set to the GOT ? Not sure about it. Yes, PIC PLT entries use %ebx. >> So, as it stands, trying to use __builtin_strlen() results in worse code >> generation. One thing I noticed was that we're not passing >> -fvisibility=hidden into CFLAGS_x86_32, but fixing that doesn't help >> either. We do have the pragma from compiler.h, so I'm out of visibility >> ideas. > > The -fvisibility=hidden should be overridden by the xen/compiler.h; > but should be overridden with hidden! > Maybe strlen is defined by default with another visibility? > If you generate the assembly, you should see if the strlen symbol gets > the .hidden bless or not. Only defined symbols have .hidden emitted for them, afaics. But that also doesn't matter much, as it's the emission of the @plt on the calls which requires the setting up of %ebx prior to those calls. (Arguably in 32-bit code, where all addresses are reachable, "hidden" could serve as a hint that calls don't need to go through the PLT. This maybe considered a missed optimization opportunity.) There's no difference there between calling strlen() or e.g. a custom (extern) StrLen(). And btw, reloc.c and cmdline.c currently also compile to code using __x86.get_pc_thunk.* (and @gotoff relocations for data references). Jan