diff mbox series

[2/2] crypto: xor - use ktime for template benchmarking

Message ID 20200923182230.22715-3-ardb@kernel.org (mailing list archive)
State Superseded
Delegated to: Herbert Xu
Headers show
Series crypto: xor - defer and optimize boot time benchmark | expand

Commit Message

Ard Biesheuvel Sept. 23, 2020, 6:22 p.m. UTC
Currently, we use the jiffies counter as a time source, by staring at
it until a HZ period elapses, and then staring at it again and perform
as many XOR operations as we can at the same time until another HZ
period elapses, so that we can calculate the throughput. This takes
longer than necessary, and depends on HZ, which is undesirable, since
HZ is system dependent.

Let's use the ktime interface instead, and use it to time a fixed
number of XOR operations, which can be done much faster, and makes
the time spent depend on the performance level of the system itself,
which is much more reasonable.

On ThunderX2, I get the following results:

Before:

  [72625.956765] xor: measuring software checksum speed
  [72625.993104]    8regs     : 10169.000 MB/sec
  [72626.033099]    32regs    : 12050.000 MB/sec
  [72626.073095]    arm64_neon: 11100.000 MB/sec
  [72626.073097] xor: using function: 32regs (12050.000 MB/sec)

After:

  [ 2503.189696] xor: measuring software checksum speed
  [ 2503.189896]    8regs           : 10556 MB/sec
  [ 2503.190061]    32regs          : 12538 MB/sec
  [ 2503.190250]    arm64_neon      : 11470 MB/sec
  [ 2503.190252] xor: using function: 32regs (12538 MB/sec)

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/xor.c | 36 ++++++++------------
 1 file changed, 15 insertions(+), 21 deletions(-)

Comments

Doug Anderson Sept. 24, 2020, 12:36 a.m. UTC | #1
Hi,

On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Currently, we use the jiffies counter as a time source, by staring at
> it until a HZ period elapses, and then staring at it again and perform
> as many XOR operations as we can at the same time until another HZ
> period elapses, so that we can calculate the throughput. This takes
> longer than necessary, and depends on HZ, which is undesirable, since
> HZ is system dependent.
>
> Let's use the ktime interface instead, and use it to time a fixed
> number of XOR operations, which can be done much faster, and makes
> the time spent depend on the performance level of the system itself,
> which is much more reasonable.
>
> On ThunderX2, I get the following results:
>
> Before:
>
>   [72625.956765] xor: measuring software checksum speed
>   [72625.993104]    8regs     : 10169.000 MB/sec
>   [72626.033099]    32regs    : 12050.000 MB/sec
>   [72626.073095]    arm64_neon: 11100.000 MB/sec
>   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
>
> After:
>
>   [ 2503.189696] xor: measuring software checksum speed
>   [ 2503.189896]    8regs           : 10556 MB/sec
>   [ 2503.190061]    32regs          : 12538 MB/sec
>   [ 2503.190250]    arm64_neon      : 11470 MB/sec
>   [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  crypto/xor.c | 36 ++++++++------------
>  1 file changed, 15 insertions(+), 21 deletions(-)
>
> diff --git a/crypto/xor.c b/crypto/xor.c
> index b42c38343733..23f98b451b69 100644
> --- a/crypto/xor.c
> +++ b/crypto/xor.c
> @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
>  }
>  #endif
>
> -#define BENCH_SIZE (PAGE_SIZE)
> +#define BENCH_SIZE     4096

I'm curious why the change away from PAGE_SIZE to using 4096.
Everything should work OK w/ using PAGE_SIZE, right?


> +#define REPS           100

Is this sufficient?  I'm not sure what the lower bound on what's
expected of ktime.  If I'm doing the math right, on your system
running 100 loops took 38802 ns in one case, since:

(4096 * 1000 * 100) / 10556 = 38802

If you happen to have your timer backed by a 32 kHz clock, one tick of
ktime could be as much as 31250 ns, right?  Maybe on systems backed
with a 32kHz clock they'll take longer, but it still seems moderately
iffy?  I dunno, maybe I'm just being paranoid.


>  static void __init
>  do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
>  {
>         int speed;
> -       unsigned long now, j;
> -       int i, count, max;
> +       int i, j, count;
> +       ktime_t min, start, diff;
>
>         tmpl->next = template_list;
>         template_list = tmpl;
>
>         preempt_disable();
>
> -       /*
> -        * Count the number of XORs done during a whole jiffy, and use
> -        * this to calculate the speed of checksumming.  We use a 2-page
> -        * allocation to have guaranteed color L1-cache layout.
> -        */
> -       max = 0;
> +       min = (ktime_t)S64_MAX;
>         for (i = 0; i < 5; i++) {
> -               j = jiffies;
> -               count = 0;
> -               while ((now = jiffies) == j)
> -                       cpu_relax();
> -               while (time_before(jiffies, now + 1)) {
> +               start = ktime_get();
> +               for (j = 0; j < REPS; j++) {
>                         mb(); /* prevent loop optimzation */
>                         tmpl->do_2(BENCH_SIZE, b1, b2);
>                         mb();
>                         count++;
>                         mb();
>                 }
> -               if (count > max)
> -                       max = count;
> +               diff = ktime_sub(ktime_get(), start);
> +               if (diff < min)
> +                       min = diff;
>         }
>
>         preempt_enable();
>
> -       speed = max * (HZ * BENCH_SIZE / 1024);
> +       // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]

Comment is super helpful, thanks!  ...but are folks really OK with
"//" comments these days?


> +       speed = (1000 * REPS * BENCH_SIZE) / (u32)min;

nit: Just for prettiness, maybe call ktime_to_ns(min)?

optional nit: I always think of u32 as something for accessing
hardware.  Maybe "unsigned int"?


>         tmpl->speed = speed;
>
> -       printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
> -              speed / 1000, speed % 1000);
> +       printk(KERN_INFO "   %-16s: %5d MB/sec\n", tmpl->name, speed);

Since you're touching, switch to pr_info()?


>  }
>
>  static int __init
> @@ -158,8 +152,8 @@ calibrate_xor_blocks(void)
>                 if (f->speed > fastest->speed)
>                         fastest = f;
>
> -       printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
> -              fastest->name, fastest->speed / 1000, fastest->speed % 1000);
> +       printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n",
> +              fastest->name, fastest->speed);

Since you're touching, switch to pr_info()?


-Doug
Ard Biesheuvel Sept. 24, 2020, 8:31 a.m. UTC | #2
On Thu, 24 Sep 2020 at 02:36, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > Currently, we use the jiffies counter as a time source, by staring at
> > it until a HZ period elapses, and then staring at it again and perform
> > as many XOR operations as we can at the same time until another HZ
> > period elapses, so that we can calculate the throughput. This takes
> > longer than necessary, and depends on HZ, which is undesirable, since
> > HZ is system dependent.
> >
> > Let's use the ktime interface instead, and use it to time a fixed
> > number of XOR operations, which can be done much faster, and makes
> > the time spent depend on the performance level of the system itself,
> > which is much more reasonable.
> >
> > On ThunderX2, I get the following results:
> >
> > Before:
> >
> >   [72625.956765] xor: measuring software checksum speed
> >   [72625.993104]    8regs     : 10169.000 MB/sec
> >   [72626.033099]    32regs    : 12050.000 MB/sec
> >   [72626.073095]    arm64_neon: 11100.000 MB/sec
> >   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
> >
> > After:
> >
> >   [ 2503.189696] xor: measuring software checksum speed
> >   [ 2503.189896]    8regs           : 10556 MB/sec
> >   [ 2503.190061]    32regs          : 12538 MB/sec
> >   [ 2503.190250]    arm64_neon      : 11470 MB/sec
> >   [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  crypto/xor.c | 36 ++++++++------------
> >  1 file changed, 15 insertions(+), 21 deletions(-)
> >
> > diff --git a/crypto/xor.c b/crypto/xor.c
> > index b42c38343733..23f98b451b69 100644
> > --- a/crypto/xor.c
> > +++ b/crypto/xor.c
> > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
> >  }
> >  #endif
> >
> > -#define BENCH_SIZE (PAGE_SIZE)
> > +#define BENCH_SIZE     4096
>
> I'm curious why the change away from PAGE_SIZE to using 4096.
> Everything should work OK w/ using PAGE_SIZE, right?
>

Yes, but then the test will take 16x more time on a 64k page size
system for no reason whatsoever.

>
> > +#define REPS           100
>
> Is this sufficient?  I'm not sure what the lower bound on what's
> expected of ktime.  If I'm doing the math right, on your system
> running 100 loops took 38802 ns in one case, since:
>
> (4096 * 1000 * 100) / 10556 = 38802
>
> If you happen to have your timer backed by a 32 kHz clock, one tick of
> ktime could be as much as 31250 ns, right?  Maybe on systems backed
> with a 32kHz clock they'll take longer, but it still seems moderately
> iffy?  I dunno, maybe I'm just being paranoid.
>

No, that is a good point - I didn't really consider that ktime could
be that coarse.

OTOH, we don't really need the full 5 digits of precision either, as
long as we don't misidentify the fastest algorithm.

So I think it should be sufficient to bump this to 800. If my
calculations are correct, this would limit any potential
misidentification of algorithms performing below 10 GB/s to ones that
only deviate in performance up to 10%.

800 * 1000 * 4096 / (10 * 31250) = 10485
800 * 1000 * 4096 / (11 * 31250) = 9532

(10485/9532) / 10485 = 10%

>
> >  static void __init
> >  do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
> >  {
> >         int speed;
> > -       unsigned long now, j;
> > -       int i, count, max;
> > +       int i, j, count;
> > +       ktime_t min, start, diff;
> >
> >         tmpl->next = template_list;
> >         template_list = tmpl;
> >
> >         preempt_disable();
> >
> > -       /*
> > -        * Count the number of XORs done during a whole jiffy, and use
> > -        * this to calculate the speed of checksumming.  We use a 2-page
> > -        * allocation to have guaranteed color L1-cache layout.
> > -        */
> > -       max = 0;
> > +       min = (ktime_t)S64_MAX;
> >         for (i = 0; i < 5; i++) {
> > -               j = jiffies;
> > -               count = 0;
> > -               while ((now = jiffies) == j)
> > -                       cpu_relax();
> > -               while (time_before(jiffies, now + 1)) {
> > +               start = ktime_get();
> > +               for (j = 0; j < REPS; j++) {
> >                         mb(); /* prevent loop optimzation */
> >                         tmpl->do_2(BENCH_SIZE, b1, b2);
> >                         mb();
> >                         count++;
> >                         mb();
> >                 }
> > -               if (count > max)
> > -                       max = count;
> > +               diff = ktime_sub(ktime_get(), start);
> > +               if (diff < min)
> > +                       min = diff;
> >         }
> >
> >         preempt_enable();
> >
> > -       speed = max * (HZ * BENCH_SIZE / 1024);
> > +       // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
>
> Comment is super helpful, thanks!  ...but are folks really OK with
> "//" comments these days?
>

Linus said he is fine with it, and even prefers it for single line
comments, so I don't think it's a problem

>
> > +       speed = (1000 * REPS * BENCH_SIZE) / (u32)min;
>
> nit: Just for prettiness, maybe call ktime_to_ns(min)?
>
> optional nit: I always think of u32 as something for accessing
> hardware.  Maybe "unsigned int"?
>

Ack

>
> >         tmpl->speed = speed;
> >
> > -       printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
> > -              speed / 1000, speed % 1000);
> > +       printk(KERN_INFO "   %-16s: %5d MB/sec\n", tmpl->name, speed);
>
> Since you're touching, switch to pr_info()?
>

Ack (x2)


>
> >  }
> >
> >  static int __init
> > @@ -158,8 +152,8 @@ calibrate_xor_blocks(void)
> >                 if (f->speed > fastest->speed)
> >                         fastest = f;
> >
> > -       printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
> > -              fastest->name, fastest->speed / 1000, fastest->speed % 1000);
> > +       printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n",
> > +              fastest->name, fastest->speed);
>
> Since you're touching, switch to pr_info()?
>
>
> -Doug
Doug Anderson Sept. 24, 2020, 3:28 p.m. UTC | #3
Hi,

On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 24 Sep 2020 at 02:36, Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > Currently, we use the jiffies counter as a time source, by staring at
> > > it until a HZ period elapses, and then staring at it again and perform
> > > as many XOR operations as we can at the same time until another HZ
> > > period elapses, so that we can calculate the throughput. This takes
> > > longer than necessary, and depends on HZ, which is undesirable, since
> > > HZ is system dependent.
> > >
> > > Let's use the ktime interface instead, and use it to time a fixed
> > > number of XOR operations, which can be done much faster, and makes
> > > the time spent depend on the performance level of the system itself,
> > > which is much more reasonable.
> > >
> > > On ThunderX2, I get the following results:
> > >
> > > Before:
> > >
> > >   [72625.956765] xor: measuring software checksum speed
> > >   [72625.993104]    8regs     : 10169.000 MB/sec
> > >   [72626.033099]    32regs    : 12050.000 MB/sec
> > >   [72626.073095]    arm64_neon: 11100.000 MB/sec
> > >   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
> > >
> > > After:
> > >
> > >   [ 2503.189696] xor: measuring software checksum speed
> > >   [ 2503.189896]    8regs           : 10556 MB/sec
> > >   [ 2503.190061]    32regs          : 12538 MB/sec
> > >   [ 2503.190250]    arm64_neon      : 11470 MB/sec
> > >   [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > >  crypto/xor.c | 36 ++++++++------------
> > >  1 file changed, 15 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/crypto/xor.c b/crypto/xor.c
> > > index b42c38343733..23f98b451b69 100644
> > > --- a/crypto/xor.c
> > > +++ b/crypto/xor.c
> > > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
> > >  }
> > >  #endif
> > >
> > > -#define BENCH_SIZE (PAGE_SIZE)
> > > +#define BENCH_SIZE     4096
> >
> > I'm curious why the change away from PAGE_SIZE to using 4096.
> > Everything should work OK w/ using PAGE_SIZE, right?
> >
>
> Yes, but then the test will take 16x more time on a 64k page size
> system for no reason whatsoever.

Ah.  I wasn't sure if using PAGE_SIZE as trying to help avoid
measurement inaccuracies or make it work more like the real thing
(maybe the code that calls XOR often does it PAGE_SIZE at a time?)  I
definitely haven't played with it, but I could sorta imagine it making
some small differences.  Maybe the "prefetch" versions of the XOR ops
have some overhead but the overhead is mitigated with larger
operations?  Though both 4K and 64K are probably large enough and
probably it wouldn't affect the outcome of which algorithm is best...


> > > +#define REPS           100
> >
> > Is this sufficient?  I'm not sure what the lower bound on what's
> > expected of ktime.  If I'm doing the math right, on your system
> > running 100 loops took 38802 ns in one case, since:
> >
> > (4096 * 1000 * 100) / 10556 = 38802
> >
> > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > with a 32kHz clock they'll take longer, but it still seems moderately
> > iffy?  I dunno, maybe I'm just being paranoid.
> >
>
> No, that is a good point - I didn't really consider that ktime could
> be that coarse.
>
> OTOH, we don't really need the full 5 digits of precision either, as
> long as we don't misidentify the fastest algorithm.
>
> So I think it should be sufficient to bump this to 800. If my
> calculations are correct, this would limit any potential
> misidentification of algorithms performing below 10 GB/s to ones that
> only deviate in performance up to 10%.
>
> 800 * 1000 * 4096 / (10 * 31250) = 10485
> 800 * 1000 * 4096 / (11 * 31250) = 9532
>
> (10485/9532) / 10485 = 10%

Seems OK to me.  Seems unlikely that super fast machine are going to
have a 32 kHz backed k_time and the worst case is that we'll pick a
slightly sub-optimal xor, I guess.  I assume your goal is to keep
things fitting in a 32-bit unsigned integer?  Looks like if your use
1000 it also fits...

-Doug
Ard Biesheuvel Sept. 24, 2020, 3:36 p.m. UTC | #4
On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote:
>
> On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
...
> > > > +#define REPS           100
> > >
> > > Is this sufficient?  I'm not sure what the lower bound on what's
> > > expected of ktime.  If I'm doing the math right, on your system
> > > running 100 loops took 38802 ns in one case, since:
> > >
> > > (4096 * 1000 * 100) / 10556 = 38802
> > >
> > > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > > with a 32kHz clock they'll take longer, but it still seems moderately
> > > iffy?  I dunno, maybe I'm just being paranoid.
> > >
> >
> > No, that is a good point - I didn't really consider that ktime could
> > be that coarse.
> >
> > OTOH, we don't really need the full 5 digits of precision either, as
> > long as we don't misidentify the fastest algorithm.
> >
> > So I think it should be sufficient to bump this to 800. If my
> > calculations are correct, this would limit any potential
> > misidentification of algorithms performing below 10 GB/s to ones that
> > only deviate in performance up to 10%.
> >
> > 800 * 1000 * 4096 / (10 * 31250) = 10485
> > 800 * 1000 * 4096 / (11 * 31250) = 9532
> >
> > (10485/9532) / 10485 = 10%
>
> Seems OK to me.  Seems unlikely that super fast machine are going to
> have a 32 kHz backed k_time and the worst case is that we'll pick a
> slightly sub-optimal xor, I guess.  I assume your goal is to keep
> things fitting in a 32-bit unsigned integer?  Looks like if your use
> 1000 it also fits...
>

Yes, but the larger we make this number, the more time the test will
take on such slow machines. Doing 1000 iterations of 4k on a low-end
machine that only manages 500 MB/s (?) takes a couple of milliseconds,
which is more than it does today when HZ=1000 I think.

Not that 800 vs 1000 makes a great deal of difference in that regard,
just illustrating that there is an upper bound as well.
Doug Anderson Sept. 24, 2020, 6:22 p.m. UTC | #5
Hi,

On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote:
> >
> > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> ...
> > > > > +#define REPS           100
> > > >
> > > > Is this sufficient?  I'm not sure what the lower bound on what's
> > > > expected of ktime.  If I'm doing the math right, on your system
> > > > running 100 loops took 38802 ns in one case, since:
> > > >
> > > > (4096 * 1000 * 100) / 10556 = 38802
> > > >
> > > > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > > > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > > > with a 32kHz clock they'll take longer, but it still seems moderately
> > > > iffy?  I dunno, maybe I'm just being paranoid.
> > > >
> > >
> > > No, that is a good point - I didn't really consider that ktime could
> > > be that coarse.
> > >
> > > OTOH, we don't really need the full 5 digits of precision either, as
> > > long as we don't misidentify the fastest algorithm.
> > >
> > > So I think it should be sufficient to bump this to 800. If my
> > > calculations are correct, this would limit any potential
> > > misidentification of algorithms performing below 10 GB/s to ones that
> > > only deviate in performance up to 10%.
> > >
> > > 800 * 1000 * 4096 / (10 * 31250) = 10485
> > > 800 * 1000 * 4096 / (11 * 31250) = 9532
> > >
> > > (10485/9532) / 10485 = 10%
> >
> > Seems OK to me.  Seems unlikely that super fast machine are going to
> > have a 32 kHz backed k_time and the worst case is that we'll pick a
> > slightly sub-optimal xor, I guess.  I assume your goal is to keep
> > things fitting in a 32-bit unsigned integer?  Looks like if your use
> > 1000 it also fits...
> >
>
> Yes, but the larger we make this number, the more time the test will
> take on such slow machines. Doing 1000 iterations of 4k on a low-end
> machine that only manages 500 MB/s (?) takes a couple of milliseconds,
> which is more than it does today when HZ=1000 I think.
>
> Not that 800 vs 1000 makes a great deal of difference in that regard,
> just illustrating that there is an upper bound as well.

Would it make sense to use some type of hybrid approach?  I know
getting ktime itself has some overhead so you don't want to do it in a
tight loop, but maybe calling it every once in a while would be
acceptable and if it's been more than 500 us then stop early?

-Doug
Ard Biesheuvel Sept. 24, 2020, 6:40 p.m. UTC | #6
On Thu, 24 Sep 2020 at 20:22, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > ...
> > > > > > +#define REPS           100
> > > > >
> > > > > Is this sufficient?  I'm not sure what the lower bound on what's
> > > > > expected of ktime.  If I'm doing the math right, on your system
> > > > > running 100 loops took 38802 ns in one case, since:
> > > > >
> > > > > (4096 * 1000 * 100) / 10556 = 38802
> > > > >
> > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > > > > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > > > > with a 32kHz clock they'll take longer, but it still seems moderately
> > > > > iffy?  I dunno, maybe I'm just being paranoid.
> > > > >
> > > >
> > > > No, that is a good point - I didn't really consider that ktime could
> > > > be that coarse.
> > > >
> > > > OTOH, we don't really need the full 5 digits of precision either, as
> > > > long as we don't misidentify the fastest algorithm.
> > > >
> > > > So I think it should be sufficient to bump this to 800. If my
> > > > calculations are correct, this would limit any potential
> > > > misidentification of algorithms performing below 10 GB/s to ones that
> > > > only deviate in performance up to 10%.
> > > >
> > > > 800 * 1000 * 4096 / (10 * 31250) = 10485
> > > > 800 * 1000 * 4096 / (11 * 31250) = 9532
> > > >
> > > > (10485/9532) / 10485 = 10%
> > >
> > > Seems OK to me.  Seems unlikely that super fast machine are going to
> > > have a 32 kHz backed k_time and the worst case is that we'll pick a
> > > slightly sub-optimal xor, I guess.  I assume your goal is to keep
> > > things fitting in a 32-bit unsigned integer?  Looks like if your use
> > > 1000 it also fits...
> > >
> >
> > Yes, but the larger we make this number, the more time the test will
> > take on such slow machines. Doing 1000 iterations of 4k on a low-end
> > machine that only manages 500 MB/s (?) takes a couple of milliseconds,
> > which is more than it does today when HZ=1000 I think.
> >
> > Not that 800 vs 1000 makes a great deal of difference in that regard,
> > just illustrating that there is an upper bound as well.
>
> Would it make sense to use some type of hybrid approach?  I know
> getting ktime itself has some overhead so you don't want to do it in a
> tight loop, but maybe calling it every once in a while would be
> acceptable and if it's been more than 500 us then stop early?
>

To be honest, I don't think we don't need complexity like this - if
boot time is critical on such a slow system, you probable won't have
XOR built in, assuming it even makes sense to do software XOR on such
a system.

It is indeed preferable to have a numerator that fits in a U32, and so
1000 would be equally suitable in that regard, but I think I will
stick with 800 if you don't mind.
Doug Anderson Sept. 24, 2020, 6:52 p.m. UTC | #7
Hi,

On Thu, Sep 24, 2020 at 11:40 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 24 Sep 2020 at 20:22, Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote:
> > > >
> > > > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > ...
> > > > > > > +#define REPS           100
> > > > > >
> > > > > > Is this sufficient?  I'm not sure what the lower bound on what's
> > > > > > expected of ktime.  If I'm doing the math right, on your system
> > > > > > running 100 loops took 38802 ns in one case, since:
> > > > > >
> > > > > > (4096 * 1000 * 100) / 10556 = 38802
> > > > > >
> > > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > > > > > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > > > > > with a 32kHz clock they'll take longer, but it still seems moderately
> > > > > > iffy?  I dunno, maybe I'm just being paranoid.
> > > > > >
> > > > >
> > > > > No, that is a good point - I didn't really consider that ktime could
> > > > > be that coarse.
> > > > >
> > > > > OTOH, we don't really need the full 5 digits of precision either, as
> > > > > long as we don't misidentify the fastest algorithm.
> > > > >
> > > > > So I think it should be sufficient to bump this to 800. If my
> > > > > calculations are correct, this would limit any potential
> > > > > misidentification of algorithms performing below 10 GB/s to ones that
> > > > > only deviate in performance up to 10%.
> > > > >
> > > > > 800 * 1000 * 4096 / (10 * 31250) = 10485
> > > > > 800 * 1000 * 4096 / (11 * 31250) = 9532
> > > > >
> > > > > (10485/9532) / 10485 = 10%
> > > >
> > > > Seems OK to me.  Seems unlikely that super fast machine are going to
> > > > have a 32 kHz backed k_time and the worst case is that we'll pick a
> > > > slightly sub-optimal xor, I guess.  I assume your goal is to keep
> > > > things fitting in a 32-bit unsigned integer?  Looks like if your use
> > > > 1000 it also fits...
> > > >
> > >
> > > Yes, but the larger we make this number, the more time the test will
> > > take on such slow machines. Doing 1000 iterations of 4k on a low-end
> > > machine that only manages 500 MB/s (?) takes a couple of milliseconds,
> > > which is more than it does today when HZ=1000 I think.
> > >
> > > Not that 800 vs 1000 makes a great deal of difference in that regard,
> > > just illustrating that there is an upper bound as well.
> >
> > Would it make sense to use some type of hybrid approach?  I know
> > getting ktime itself has some overhead so you don't want to do it in a
> > tight loop, but maybe calling it every once in a while would be
> > acceptable and if it's been more than 500 us then stop early?
> >
>
> To be honest, I don't think we don't need complexity like this - if
> boot time is critical on such a slow system, you probable won't have
> XOR built in, assuming it even makes sense to do software XOR on such
> a system.
>
> It is indeed preferable to have a numerator that fits in a U32, and so
> 1000 would be equally suitable in that regard, but I think I will
> stick with 800 if you don't mind.

OK, fair enough.

-Doug
diff mbox series

Patch

diff --git a/crypto/xor.c b/crypto/xor.c
index b42c38343733..23f98b451b69 100644
--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -76,49 +76,43 @@  static int __init register_xor_blocks(void)
 }
 #endif
 
-#define BENCH_SIZE (PAGE_SIZE)
+#define BENCH_SIZE	4096
+#define REPS		100
 
 static void __init
 do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 {
 	int speed;
-	unsigned long now, j;
-	int i, count, max;
+	int i, j, count;
+	ktime_t min, start, diff;
 
 	tmpl->next = template_list;
 	template_list = tmpl;
 
 	preempt_disable();
 
-	/*
-	 * Count the number of XORs done during a whole jiffy, and use
-	 * this to calculate the speed of checksumming.  We use a 2-page
-	 * allocation to have guaranteed color L1-cache layout.
-	 */
-	max = 0;
+	min = (ktime_t)S64_MAX;
 	for (i = 0; i < 5; i++) {
-		j = jiffies;
-		count = 0;
-		while ((now = jiffies) == j)
-			cpu_relax();
-		while (time_before(jiffies, now + 1)) {
+		start = ktime_get();
+		for (j = 0; j < REPS; j++) {
 			mb(); /* prevent loop optimzation */
 			tmpl->do_2(BENCH_SIZE, b1, b2);
 			mb();
 			count++;
 			mb();
 		}
-		if (count > max)
-			max = count;
+		diff = ktime_sub(ktime_get(), start);
+		if (diff < min)
+			min = diff;
 	}
 
 	preempt_enable();
 
-	speed = max * (HZ * BENCH_SIZE / 1024);
+	// bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
+	speed = (1000 * REPS * BENCH_SIZE) / (u32)min;
 	tmpl->speed = speed;
 
-	printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
-	       speed / 1000, speed % 1000);
+	printk(KERN_INFO "   %-16s: %5d MB/sec\n", tmpl->name, speed);
 }
 
 static int __init
@@ -158,8 +152,8 @@  calibrate_xor_blocks(void)
 		if (f->speed > fastest->speed)
 			fastest = f;
 
-	printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
-	       fastest->name, fastest->speed / 1000, fastest->speed % 1000);
+	printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n",
+	       fastest->name, fastest->speed);
 
 #undef xor_speed