Message ID | 20200311180416.6509-5-richard.henderson@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: gcc asm flag outputs | expand |
On 2020-03-11 6:04 pm, Richard Henderson wrote: > Uses of __range_ok almost always feed a branch. > This allows the compiler to use flags directly. > > Signed-off-by: Richard Henderson <richard.henderson@linaro.org> > --- > arch/arm64/include/asm/uaccess.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h > index fe3dd70e901e..ca1acd7b95c3 100644 > --- a/arch/arm64/include/asm/uaccess.h > +++ b/arch/arm64/include/asm/uaccess.h > @@ -22,6 +22,7 @@ > #include <asm/ptrace.h> > #include <asm/memory.h> > #include <asm/extable.h> > +#include <asm/ccset.h> > > #define get_fs() (current_thread_info()->addr_limit) > > @@ -86,10 +87,10 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si > // comes from the carry in being clear. Otherwise, we are > // testing X' - C == 0, subject to the previous adjustments. > " sbcs xzr, %[addr], %[limit]\n" > - " cset %[addr], ls\n" > - : [addr] "=&r" (ret), [limit] "+r" (limit) > + CC_SET(ls) > + : [addr] "=&r" (addr), [limit] "+r" (limit), CC_OUT(ls) (ret) ...and we've immediately undone any benefit of the previous patch by effectively recoupling %0 with addr again :/ I don't entirely follow why, not least because this CC_SET/CC_OUT business is virtually unreadable. At the very least the macro should at least take an operand identifier as an explicit argument rather than secretly generating one, so that we're not all scratching our heads wondering how it can possibly work at all. Furthermore, if the associated C variable is a 32-bit or smaller type, then won't it provoke warnings from Clang due to the operand lacking the "w" modifier? Robin. > : [size] "Ir" (size), [addr_in] "r" (addr) > - : "cc"); > + : CC_CLOBBER); > > return ret; > } >
On Wed, Mar 11, 2020 at 11:04:14AM -0700, Richard Henderson wrote: > Uses of __range_ok almost always feed a branch. > This allows the compiler to use flags directly. If we want to give hte compiler the most freedom, the best thing would be to write this in C. IIUC this code is written in assembly largely for historical reasons, and the comment above says: | This is equivalent to the following test: | (u65)addr + (u65)size <= (u65)current->addr_limit + 1 ... which e.g. we could write as: (__uint128_t)addr + (__uint128_t)size <= (__uint128_t)limit + 1; ... which would be much clearer than the assembly. Is there any pattern like that for which the compiler generates reasonable looking code, and if not, is that something that could be improved compiler side? Thanks, Mark. > > Signed-off-by: Richard Henderson <richard.henderson@linaro.org> > --- > arch/arm64/include/asm/uaccess.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h > index fe3dd70e901e..ca1acd7b95c3 100644 > --- a/arch/arm64/include/asm/uaccess.h > +++ b/arch/arm64/include/asm/uaccess.h > @@ -22,6 +22,7 @@ > #include <asm/ptrace.h> > #include <asm/memory.h> > #include <asm/extable.h> > +#include <asm/ccset.h> > > #define get_fs() (current_thread_info()->addr_limit) > > @@ -86,10 +87,10 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si > // comes from the carry in being clear. Otherwise, we are > // testing X' - C == 0, subject to the previous adjustments. > " sbcs xzr, %[addr], %[limit]\n" > - " cset %[addr], ls\n" > - : [addr] "=&r" (ret), [limit] "+r" (limit) > + CC_SET(ls) > + : [addr] "=&r" (addr), [limit] "+r" (limit), CC_OUT(ls) (ret) > : [size] "Ir" (size), [addr_in] "r" (addr) > - : "cc"); > + : CC_CLOBBER); > > return ret; > } > -- > 2.20.1 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On 2020-03-13 11:04 am, Mark Rutland wrote: > On Wed, Mar 11, 2020 at 11:04:14AM -0700, Richard Henderson wrote: >> Uses of __range_ok almost always feed a branch. >> This allows the compiler to use flags directly. > > If we want to give hte compiler the most freedom, the best thing would > be to write this in C. IIUC this code is written in assembly largely for > historical reasons, and the comment above says: > > | This is equivalent to the following test: > | (u65)addr + (u65)size <= (u65)current->addr_limit + 1 > > ... which e.g. we could write as: > > (__uint128_t)addr + (__uint128_t)size <= (__uint128_t)limit + 1; > > ... which would be much clearer than the assembly. > > Is there any pattern like that for which the compiler generates > reasonable looking code, and if not, is that something that could be > improved compiler side? Hmm, in fact this: __uint128_t tmp = (__uint128_t)(unsigned long)addr + size; return !tmp || tmp - 1 <= limit; generates code that looks like utter crap in isolation[1], but once inlined actually leads to a modest overall reduction (-0.09%) across the whole of vmlinux according to bloat-o-meter (presumably most of those branches roll up into the overall "if(access_ok())..." control flow for the typical case, and I'm sure size and limit get constant-folded a lot). IIRC at the time there were so many uncertainties flying around that sticking with asm to take compiler unknowns out of the picture felt desirable, but perhaps the time might be nigh to retire my baby after all... I'll investigate a bit further. Robin. [1]: 0000000000000000 <range_ok>: 0: ab010000 adds x0, x0, x1 4: 9a9f37e3 cset x3, cs // cs = hs, nlast 8: aa030001 orr x1, x0, x3 c: b4000161 cbz x1, 38 <range_ok+0x38> 10: f1000401 subs x1, x0, #0x1 14: d2800020 mov x0, #0x1 // #1 18: da1f0063 sbc x3, x3, xzr 1c: b4000063 cbz x3, 28 <range_ok+0x28> 20: d2800000 mov x0, #0x0 // #0 24: d65f03c0 ret 28: eb02003f cmp x1, x2 2c: 54ffffc9 b.ls 24 <range_ok+0x24> // b.plast 30: d2800000 mov x0, #0x0 // #0 34: 17fffffc b 24 <range_ok+0x24> 38: d2800020 mov x0, #0x1 // #1 3c: d65f03c0 ret
On Fri, Mar 13, 2020 at 04:51:28PM +0000, Robin Murphy wrote: > On 2020-03-13 11:04 am, Mark Rutland wrote: > > On Wed, Mar 11, 2020 at 11:04:14AM -0700, Richard Henderson wrote: > > > Uses of __range_ok almost always feed a branch. > > > This allows the compiler to use flags directly. > > > > If we want to give hte compiler the most freedom, the best thing would > > be to write this in C. IIUC this code is written in assembly largely for > > historical reasons, and the comment above says: > > > > | This is equivalent to the following test: > > | (u65)addr + (u65)size <= (u65)current->addr_limit + 1 > > > > ... which e.g. we could write as: > > > > (__uint128_t)addr + (__uint128_t)size <= (__uint128_t)limit + 1; > > > > ... which would be much clearer than the assembly. > > > > Is there any pattern like that for which the compiler generates > > reasonable looking code, and if not, is that something that could be > > improved compiler side? > > Hmm, in fact this: > > __uint128_t tmp = (__uint128_t)(unsigned long)addr + size; > return !tmp || tmp - 1 <= limit; > > generates code that looks like utter crap in isolation[1], but once inlined > actually leads to a modest overall reduction (-0.09%) across the whole of > vmlinux according to bloat-o-meter (presumably most of those branches roll > up into the overall "if(access_ok())..." control flow for the typical case, > and I'm sure size and limit get constant-folded a lot). Neat. IIUC for a non-zero size the !tmp check can be elided, and for a constant size the subtraction can be folded in at compile time, so for a {get,put}_user(), the compiler only needs to do the addition and a check against limit. > IIRC at the time there were so many uncertainties flying around that > sticking with asm to take compiler unknowns out of the picture felt > desirable, but perhaps the time might be nigh to retire my baby after all... I guess we might have thought we'd need to pass the result into some masking logic, but I think uaccess_mask_ptr() turned out to be good enough in practice. > I'll investigate a bit further. Great! Thanks, Mark.
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index fe3dd70e901e..ca1acd7b95c3 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -22,6 +22,7 @@ #include <asm/ptrace.h> #include <asm/memory.h> #include <asm/extable.h> +#include <asm/ccset.h> #define get_fs() (current_thread_info()->addr_limit) @@ -86,10 +87,10 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si // comes from the carry in being clear. Otherwise, we are // testing X' - C == 0, subject to the previous adjustments. " sbcs xzr, %[addr], %[limit]\n" - " cset %[addr], ls\n" - : [addr] "=&r" (ret), [limit] "+r" (limit) + CC_SET(ls) + : [addr] "=&r" (addr), [limit] "+r" (limit), CC_OUT(ls) (ret) : [size] "Ir" (size), [addr_in] "r" (addr) - : "cc"); + : CC_CLOBBER); return ret; }
Uses of __range_ok almost always feed a branch. This allows the compiler to use flags directly. Signed-off-by: Richard Henderson <richard.henderson@linaro.org> --- arch/arm64/include/asm/uaccess.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)