CFT: move outer_cache_sync() out of line

Message ID	20150112163648.GL12302@n2100.arm.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Date: Mon, 12 Jan 2015 16:36:48 +0000 From: Russell King - ARM Linux <linux@arm.linux.org.uk> To: linux-arm-kernel@lists.infradead.org Subject: CFT: move outer_cache_sync() out of line Message-ID: <20150112163648.GL12302@n2100.arm.linux.org.uk> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Russell King - ARM Linux Jan. 12, 2015, 4:36 p.m. UTC

The theory is that moving outer_cache_sync() out of line moves the
conditional test from multiple sites to a single location, which
in turn means that the branch predictor stands a better chance of
making the correct decision.

However, I'm a hardware pauper - I have the choice of testing this
on iMX6 (which seems to have bus limitations, and seems to produce
a wide variation on measured performance which makes it hard to
evaluate any differences) or the Versatile Express, which really is
not suitable for making IO performance measurements.

So, without help from people with "good" hardware, I can't evaluate
the performance impact of this change.  Anyone willing to do some
performance testing on this and feedback any numbers?

Obviously, drivers which make use of a lot of non-relaxed IO accessors
will be most affected by this - so you really have to know your
drivers when deciding how to evaluate this (sorry, I can't say "measure
network bandwidth" or "measure SSD sata performance" is a good test.)

Theoretically, this should help overall system performance, since the
branch predictor should be able to predict this better, but it's entirely
possible that trying to benchmark a single workload won't be measurably
different.

In terms of kernel size figures, this change alone saves almost 17K of
10MB of kernel text on my iMX6 kernels - which is bordering on
insignificant since that's not quite a 0.2% saving.

So... right now I can't justify this change, but I'm hoping some can come
up with some figures which shows that it benefits their workload without
causing a performance regression for others.

Jisheng Zhang Jan. 13, 2015, 6:48 a.m. UTC | #1

Hi Russell,

On Mon, 12 Jan 2015 08:36:48 -0800
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:

> The theory is that moving outer_cache_sync() out of line moves the
> conditional test from multiple sites to a single location, which
> in turn means that the branch predictor stands a better chance of
> making the correct decision.
> 
> However, I'm a hardware pauper - I have the choice of testing this
> on iMX6 (which seems to have bus limitations, and seems to produce
> a wide variation on measured performance which makes it hard to
> evaluate any differences) or the Versatile Express, which really is
> not suitable for making IO performance measurements.
> 
> So, without help from people with "good" hardware, I can't evaluate
> the performance impact of this change.  Anyone willing to do some
> performance testing on this and feedback any numbers?

I applied your patch into marvell internal bsp tree, then did a basic emmc
performance test via. dd cmd, the performance is improved by 0.13%
the sdhc host driver is drivers/mmc/host/sdhci-pxav3.c, I'm not sure
it make use of a lot of non-relaxed IO accessors.

> 
> Obviously, drivers which make use of a lot of non-relaxed IO accessors
> will be most affected by this - so you really have to know your
> drivers when deciding how to evaluate this (sorry, I can't say "measure
> network bandwidth" or "measure SSD sata performance" is a good test.)
> 
> Theoretically, this should help overall system performance, since the
> branch predictor should be able to predict this better, but it's entirely
> possible that trying to benchmark a single workload won't be measurably
> different.
> 
> In terms of kernel size figures, this change alone saves almost 17K of
> 10MB of kernel text on my iMX6 kernels - which is bordering on
> insignificant since that's not quite a 0.2% saving.
> 
> So... right now I can't justify this change, but I'm hoping some can come
> up with some figures which shows that it benefits their workload without
> causing a performance regression for others.
> 
> diff --git a/arch/arm/include/asm/outercache.h
> b/arch/arm/include/asm/outercache.h index 891a56b35bcf..807e4e71c8e7 100644
> --- a/arch/arm/include/asm/outercache.h
> +++ b/arch/arm/include/asm/outercache.h
> @@ -133,11 +133,7 @@ static inline void outer_resume(void) { }
>   * Ensure that all outer cache operations are complete and any store
>   * buffers are drained.
>   */
> -static inline void outer_sync(void)
> -{
> -	if (outer_cache.sync)
> -		outer_cache.sync();
> -}
> +extern void outer_sync(void);
>  #else
>  static inline void outer_sync(void)
>  { }
> diff --git a/arch/arm/mm/l2c-common.c b/arch/arm/mm/l2c-common.c
> index 10a3cf28c362..b1c24c8c1eb9 100644
> --- a/arch/arm/mm/l2c-common.c
> +++ b/arch/arm/mm/l2c-common.c
> @@ -7,9 +7,17 @@
>   * published by the Free Software Foundation.
>   */
>  #include <linux/bug.h>
> +#include <linux/export.h>
>  #include <linux/smp.h>
>  #include <asm/outercache.h>
>  
> +void outer_sync(void)
> +{
> +	if (outer_cache.sync)
> +		outer_cache.sync();
> +}
> +EXPORT_SYMBOL(outer_sync);
> +
>  void outer_disable(void)
>  {
>  	WARN_ON(!irqs_disabled());
> 
>

Catalin Marinas Jan. 13, 2015, 3:45 p.m. UTC | #2

On Mon, Jan 12, 2015 at 04:36:48PM +0000, Russell King - ARM Linux wrote:
> The theory is that moving outer_cache_sync() out of line moves the
> conditional test from multiple sites to a single location, which
> in turn means that the branch predictor stands a better chance of
> making the correct decision.
[...]
> --- a/arch/arm/mm/l2c-common.c
> +++ b/arch/arm/mm/l2c-common.c
> @@ -7,9 +7,17 @@
>   * published by the Free Software Foundation.
>   */
>  #include <linux/bug.h>
> +#include <linux/export.h>
>  #include <linux/smp.h>
>  #include <asm/outercache.h>
>  
> +void outer_sync(void)
> +{
> +	if (outer_cache.sync)
> +		outer_cache.sync();
> +}

But don't you end up with two branches now? With an outer cache, I guess
the additional branch won't be noticed, especially if the prediction on
the outer_cache.sync() call is better (but that's not a simple
conditional branch prediction, it relies on predicting the value of the
pointer, not entirely sure how this works).

However, on SoCs without an outer cache (and single kernel image), we
end up with a permanent branch/return even though such branch would not
be needed.

Russell King - ARM Linux Jan. 13, 2015, 4:27 p.m. UTC | #3

On Tue, Jan 13, 2015 at 03:45:52PM +0000, Catalin Marinas wrote:
> On Mon, Jan 12, 2015 at 04:36:48PM +0000, Russell King - ARM Linux wrote:
> > The theory is that moving outer_cache_sync() out of line moves the
> > conditional test from multiple sites to a single location, which
> > in turn means that the branch predictor stands a better chance of
> > making the correct decision.
> [...]
> > --- a/arch/arm/mm/l2c-common.c
> > +++ b/arch/arm/mm/l2c-common.c
> > @@ -7,9 +7,17 @@
> >   * published by the Free Software Foundation.
> >   */
> >  #include <linux/bug.h>
> > +#include <linux/export.h>
> >  #include <linux/smp.h>
> >  #include <asm/outercache.h>
> >  
> > +void outer_sync(void)
> > +{
> > +	if (outer_cache.sync)
> > +		outer_cache.sync();
> > +}
> 
> But don't you end up with two branches now? With an outer cache, I guess
> the additional branch won't be noticed, especially if the prediction on
> the outer_cache.sync() call is better (but that's not a simple
> conditional branch prediction, it relies on predicting the value of the
> pointer, not entirely sure how this works).

Yes, we end up with two branches.  The branch to this function is 
unconditional and always taken - if a branch predictor gets that wrong,
then the branch predictor is broken!

However, that is not the only side effect of this change: the other
change is a reduction in instruction cache usage.

> However, on SoCs without an outer cache (and single kernel image), we
> end up with a permanent branch/return even though such branch would not
> be needed.

Right, so we go from something like this:

 188:   e594301c        ldr     r3, [r4, #28]
 18c:   e3002356        movw    r2, #854        ; 0x356
 190:   e3402100        movt    r2, #256        ; 0x100
 194:   e583204c        str     r2, [r3, #76]   ; 0x4c
 198:   f57ff04e        dsb     st
 19c:   e5953014        ldr     r3, [r5, #20]
 1a0:   e3530000        cmp     r3, #0
 1a4:   0a000000        beq     1ac <ipu_dp_csc_init+0x1ac>
 1a8:   e12fff33        blx     r3

down to:

 1e0:   e594301c        ldr     r3, [r4, #28]
 1e4:   e3002356        movw    r2, #854        ; 0x356
 1e8:   e3402100        movt    r2, #256        ; 0x100
 1ec:   e583204c        str     r2, [r3, #76]   ; 0x4c
 1f0:   f57ff04e        dsb     st
 1f4:   ebfffffe        bl      0 <outer_sync>

which saves 12 bytes per writel() - in favour of having the contents
of outer_cache_sync() being in one location in the kernel, and will
therefore be hotter in the instruction cache.

(Addresses different because the optimiser arranges the functions
differently.)

I'm willing to bet that the excessive use of "inline" in the kernel is
actually harming things: it made sense when taking a branch was expensive
(such as with older ARM CPUs) but this isn't the case anymore.

Now, if we wish to discuss this theoretically, we can do, but it's all
pie in the sky.  What really matters is the performance measurements,
which is what I'm asking people to measure.

So far, the results are either not measurable (due to a huge variance in
measurements), or a small increase in performance.  A small increase in
performance coupled with a small decrease in code size wins over
theoretical arguments. :)

Arnd Bergmann Jan. 13, 2015, 4:34 p.m. UTC | #4

On Monday 12 January 2015 16:36:48 Russell King - ARM Linux wrote:

> Theoretically, this should help overall system performance, since the
> branch predictor should be able to predict this better, but it's entirely
> possible that trying to benchmark a single workload won't be measurably
> different.
> 
> In terms of kernel size figures, this change alone saves almost 17K of
> 10MB of kernel text on my iMX6 kernels - which is bordering on
> insignificant since that's not quite a 0.2% saving.
> 
> So... right now I can't justify this change, but I'm hoping some can come
> up with some figures which shows that it benefits their workload without
> causing a performance regression for others.

From the theory, I think it can only help to do this. I would guess that
the time spent inside of the cache_sync function dwarfs both the extra
unconditional branch you introduce and the possible misprediction, so
17K in space savings sounds like more than enough justification to just
do it.

Acked-by: Arnd Bergmann <arnd@arndb.de>

Russell King - ARM Linux Jan. 13, 2015, 5:09 p.m. UTC | #5

On Tue, Jan 13, 2015 at 04:27:54PM +0000, Russell King - ARM Linux wrote:
> On Tue, Jan 13, 2015 at 03:45:52PM +0000, Catalin Marinas wrote:
> > On Mon, Jan 12, 2015 at 04:36:48PM +0000, Russell King - ARM Linux wrote:
> > > The theory is that moving outer_cache_sync() out of line moves the
> > > conditional test from multiple sites to a single location, which
> > > in turn means that the branch predictor stands a better chance of
> > > making the correct decision.
> > [...]
> > > --- a/arch/arm/mm/l2c-common.c
> > > +++ b/arch/arm/mm/l2c-common.c
> > > @@ -7,9 +7,17 @@
> > >   * published by the Free Software Foundation.
> > >   */
> > >  #include <linux/bug.h>
> > > +#include <linux/export.h>
> > >  #include <linux/smp.h>
> > >  #include <asm/outercache.h>
> > >  
> > > +void outer_sync(void)
> > > +{
> > > +	if (outer_cache.sync)
> > > +		outer_cache.sync();
> > > +}
> > 
> > But don't you end up with two branches now? With an outer cache, I guess
> > the additional branch won't be noticed, especially if the prediction on
> > the outer_cache.sync() call is better (but that's not a simple
> > conditional branch prediction, it relies on predicting the value of the
> > pointer, not entirely sure how this works).
> 
> Yes, we end up with two branches.  The branch to this function is 
> unconditional and always taken - if a branch predictor gets that wrong,
> then the branch predictor is broken!
> 
> However, that is not the only side effect of this change: the other
> change is a reduction in instruction cache usage.
> 
> > However, on SoCs without an outer cache (and single kernel image), we
> > end up with a permanent branch/return even though such branch would not
> > be needed.
> 
> Right, so we go from something like this:
> 
>  188:   e594301c        ldr     r3, [r4, #28]
>  18c:   e3002356        movw    r2, #854        ; 0x356
>  190:   e3402100        movt    r2, #256        ; 0x100
>  194:   e583204c        str     r2, [r3, #76]   ; 0x4c
>  198:   f57ff04e        dsb     st
>  19c:   e5953014        ldr     r3, [r5, #20]
>  1a0:   e3530000        cmp     r3, #0
>  1a4:   0a000000        beq     1ac <ipu_dp_csc_init+0x1ac>
>  1a8:   e12fff33        blx     r3
> 
> down to:
> 
>  1e0:   e594301c        ldr     r3, [r4, #28]
>  1e4:   e3002356        movw    r2, #854        ; 0x356
>  1e8:   e3402100        movt    r2, #256        ; 0x100
>  1ec:   e583204c        str     r2, [r3, #76]   ; 0x4c
>  1f0:   f57ff04e        dsb     st
>  1f4:   ebfffffe        bl      0 <outer_sync>
> 
> which saves 12 bytes per writel() - in favour of having the contents
> of outer_cache_sync() being in one location in the kernel, and will
> therefore be hotter in the instruction cache.
> 
> (Addresses different because the optimiser arranges the functions
> differently.)
> 
> I'm willing to bet that the excessive use of "inline" in the kernel is
> actually harming things: it made sense when taking a branch was expensive
> (such as with older ARM CPUs) but this isn't the case anymore.

BTW, what we /could/ do is to be a little more clever at these (and
similar) callsites.

We could have the linker build a table of callsites and a pointer to
the indirect function pointer, and we initialise each of the callsites
to a small "dynamic patcher."

When one of these sites is hit, the dynamic patcher is called, which
looks up the site in the linker built table, and patches the callsite
with a direct call to the required function, or if the function pointer
is NULL, replaces it with a NOP instruction.

This eliminates the loads to get the function pointer, the test, and
the conditional branch, turning the whole thing into a _single_ well
known branch or a NOP instruction.

Since things like the cache ops, processor ops, etc are all constant
at runtime, I think this would make sense.

Catalin Marinas Jan. 13, 2015, 6:39 p.m. UTC | #6

On Tue, Jan 13, 2015 at 05:09:11PM +0000, Russell King - ARM Linux wrote:
> On Tue, Jan 13, 2015 at 04:27:54PM +0000, Russell King - ARM Linux wrote:
> > On Tue, Jan 13, 2015 at 03:45:52PM +0000, Catalin Marinas wrote:
> > > On Mon, Jan 12, 2015 at 04:36:48PM +0000, Russell King - ARM Linux wrote:
> > > > The theory is that moving outer_cache_sync() out of line moves the
> > > > conditional test from multiple sites to a single location, which
> > > > in turn means that the branch predictor stands a better chance of
> > > > making the correct decision.
> > > [...]
> > > > --- a/arch/arm/mm/l2c-common.c
> > > > +++ b/arch/arm/mm/l2c-common.c
> > > > @@ -7,9 +7,17 @@
> > > >   * published by the Free Software Foundation.
> > > >   */
> > > >  #include <linux/bug.h>
> > > > +#include <linux/export.h>
> > > >  #include <linux/smp.h>
> > > >  #include <asm/outercache.h>
> > > >  
> > > > +void outer_sync(void)
> > > > +{
> > > > +	if (outer_cache.sync)
> > > > +		outer_cache.sync();
> > > > +}
> > > 
> > > But don't you end up with two branches now? With an outer cache, I guess
> > > the additional branch won't be noticed, especially if the prediction on
> > > the outer_cache.sync() call is better (but that's not a simple
> > > conditional branch prediction, it relies on predicting the value of the
> > > pointer, not entirely sure how this works).
> > 
> > Yes, we end up with two branches.  The branch to this function is 
> > unconditional and always taken - if a branch predictor gets that wrong,
> > then the branch predictor is broken!
> > 
> > However, that is not the only side effect of this change: the other
> > change is a reduction in instruction cache usage.
> > 
> > > However, on SoCs without an outer cache (and single kernel image), we
> > > end up with a permanent branch/return even though such branch would not
> > > be needed.
> > 
> > Right, so we go from something like this:
> > 
> >  188:   e594301c        ldr     r3, [r4, #28]
> >  18c:   e3002356        movw    r2, #854        ; 0x356
> >  190:   e3402100        movt    r2, #256        ; 0x100
> >  194:   e583204c        str     r2, [r3, #76]   ; 0x4c
> >  198:   f57ff04e        dsb     st
> >  19c:   e5953014        ldr     r3, [r5, #20]
> >  1a0:   e3530000        cmp     r3, #0
> >  1a4:   0a000000        beq     1ac <ipu_dp_csc_init+0x1ac>
> >  1a8:   e12fff33        blx     r3
> > 
> > down to:
> > 
> >  1e0:   e594301c        ldr     r3, [r4, #28]
> >  1e4:   e3002356        movw    r2, #854        ; 0x356
> >  1e8:   e3402100        movt    r2, #256        ; 0x100
> >  1ec:   e583204c        str     r2, [r3, #76]   ; 0x4c
> >  1f0:   f57ff04e        dsb     st
> >  1f4:   ebfffffe        bl      0 <outer_sync>
> > 
> > which saves 12 bytes per writel() - in favour of having the contents
> > of outer_cache_sync() being in one location in the kernel, and will
> > therefore be hotter in the instruction cache.
> > 
> > (Addresses different because the optimiser arranges the functions
> > differently.)
> > 
> > I'm willing to bet that the excessive use of "inline" in the kernel is
> > actually harming things: it made sense when taking a branch was expensive
> > (such as with older ARM CPUs) but this isn't the case anymore.
> 
> BTW, what we /could/ do is to be a little more clever at these (and
> similar) callsites.
> 
> We could have the linker build a table of callsites and a pointer to
> the indirect function pointer, and we initialise each of the callsites
> to a small "dynamic patcher."
> 
> When one of these sites is hit, the dynamic patcher is called, which
> looks up the site in the linker built table, and patches the callsite
> with a direct call to the required function, or if the function pointer
> is NULL, replaces it with a NOP instruction.
> 
> This eliminates the loads to get the function pointer, the test, and
> the conditional branch, turning the whole thing into a _single_ well
> known branch or a NOP instruction.
> 
> Since things like the cache ops, processor ops, etc are all constant
> at runtime, I think this would make sense.

Jump labels. Maybe we could get them to do a BL instead of B (unless
they can do this already). This would be beneficial in all cases.

(BTW, I'm off tomorrow, so not able to follow up)

CFT: move outer_cache_sync() out of line

Commit Message

Comments

Patch