Message ID | 20200923182230.22715-3-ardb@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Herbert Xu |
Headers | show |
Series | crypto: xor - defer and optimize boot time benchmark | expand |
Hi, On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > Currently, we use the jiffies counter as a time source, by staring at > it until a HZ period elapses, and then staring at it again and perform > as many XOR operations as we can at the same time until another HZ > period elapses, so that we can calculate the throughput. This takes > longer than necessary, and depends on HZ, which is undesirable, since > HZ is system dependent. > > Let's use the ktime interface instead, and use it to time a fixed > number of XOR operations, which can be done much faster, and makes > the time spent depend on the performance level of the system itself, > which is much more reasonable. > > On ThunderX2, I get the following results: > > Before: > > [72625.956765] xor: measuring software checksum speed > [72625.993104] 8regs : 10169.000 MB/sec > [72626.033099] 32regs : 12050.000 MB/sec > [72626.073095] arm64_neon: 11100.000 MB/sec > [72626.073097] xor: using function: 32regs (12050.000 MB/sec) > > After: > > [ 2503.189696] xor: measuring software checksum speed > [ 2503.189896] 8regs : 10556 MB/sec > [ 2503.190061] 32regs : 12538 MB/sec > [ 2503.190250] arm64_neon : 11470 MB/sec > [ 2503.190252] xor: using function: 32regs (12538 MB/sec) > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > --- > crypto/xor.c | 36 ++++++++------------ > 1 file changed, 15 insertions(+), 21 deletions(-) > > diff --git a/crypto/xor.c b/crypto/xor.c > index b42c38343733..23f98b451b69 100644 > --- a/crypto/xor.c > +++ b/crypto/xor.c > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void) > } > #endif > > -#define BENCH_SIZE (PAGE_SIZE) > +#define BENCH_SIZE 4096 I'm curious why the change away from PAGE_SIZE to using 4096. Everything should work OK w/ using PAGE_SIZE, right? > +#define REPS 100 Is this sufficient? I'm not sure what the lower bound on what's expected of ktime. If I'm doing the math right, on your system running 100 loops took 38802 ns in one case, since: (4096 * 1000 * 100) / 10556 = 38802 If you happen to have your timer backed by a 32 kHz clock, one tick of ktime could be as much as 31250 ns, right? Maybe on systems backed with a 32kHz clock they'll take longer, but it still seems moderately iffy? I dunno, maybe I'm just being paranoid. > static void __init > do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2) > { > int speed; > - unsigned long now, j; > - int i, count, max; > + int i, j, count; > + ktime_t min, start, diff; > > tmpl->next = template_list; > template_list = tmpl; > > preempt_disable(); > > - /* > - * Count the number of XORs done during a whole jiffy, and use > - * this to calculate the speed of checksumming. We use a 2-page > - * allocation to have guaranteed color L1-cache layout. > - */ > - max = 0; > + min = (ktime_t)S64_MAX; > for (i = 0; i < 5; i++) { > - j = jiffies; > - count = 0; > - while ((now = jiffies) == j) > - cpu_relax(); > - while (time_before(jiffies, now + 1)) { > + start = ktime_get(); > + for (j = 0; j < REPS; j++) { > mb(); /* prevent loop optimzation */ > tmpl->do_2(BENCH_SIZE, b1, b2); > mb(); > count++; > mb(); > } > - if (count > max) > - max = count; > + diff = ktime_sub(ktime_get(), start); > + if (diff < min) > + min = diff; > } > > preempt_enable(); > > - speed = max * (HZ * BENCH_SIZE / 1024); > + // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s] Comment is super helpful, thanks! ...but are folks really OK with "//" comments these days? > + speed = (1000 * REPS * BENCH_SIZE) / (u32)min; nit: Just for prettiness, maybe call ktime_to_ns(min)? optional nit: I always think of u32 as something for accessing hardware. Maybe "unsigned int"? > tmpl->speed = speed; > > - printk(KERN_INFO " %-10s: %5d.%03d MB/sec\n", tmpl->name, > - speed / 1000, speed % 1000); > + printk(KERN_INFO " %-16s: %5d MB/sec\n", tmpl->name, speed); Since you're touching, switch to pr_info()? > } > > static int __init > @@ -158,8 +152,8 @@ calibrate_xor_blocks(void) > if (f->speed > fastest->speed) > fastest = f; > > - printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n", > - fastest->name, fastest->speed / 1000, fastest->speed % 1000); > + printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n", > + fastest->name, fastest->speed); Since you're touching, switch to pr_info()? -Doug
On Thu, 24 Sep 2020 at 02:36, Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > Currently, we use the jiffies counter as a time source, by staring at > > it until a HZ period elapses, and then staring at it again and perform > > as many XOR operations as we can at the same time until another HZ > > period elapses, so that we can calculate the throughput. This takes > > longer than necessary, and depends on HZ, which is undesirable, since > > HZ is system dependent. > > > > Let's use the ktime interface instead, and use it to time a fixed > > number of XOR operations, which can be done much faster, and makes > > the time spent depend on the performance level of the system itself, > > which is much more reasonable. > > > > On ThunderX2, I get the following results: > > > > Before: > > > > [72625.956765] xor: measuring software checksum speed > > [72625.993104] 8regs : 10169.000 MB/sec > > [72626.033099] 32regs : 12050.000 MB/sec > > [72626.073095] arm64_neon: 11100.000 MB/sec > > [72626.073097] xor: using function: 32regs (12050.000 MB/sec) > > > > After: > > > > [ 2503.189696] xor: measuring software checksum speed > > [ 2503.189896] 8regs : 10556 MB/sec > > [ 2503.190061] 32regs : 12538 MB/sec > > [ 2503.190250] arm64_neon : 11470 MB/sec > > [ 2503.190252] xor: using function: 32regs (12538 MB/sec) > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > --- > > crypto/xor.c | 36 ++++++++------------ > > 1 file changed, 15 insertions(+), 21 deletions(-) > > > > diff --git a/crypto/xor.c b/crypto/xor.c > > index b42c38343733..23f98b451b69 100644 > > --- a/crypto/xor.c > > +++ b/crypto/xor.c > > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void) > > } > > #endif > > > > -#define BENCH_SIZE (PAGE_SIZE) > > +#define BENCH_SIZE 4096 > > I'm curious why the change away from PAGE_SIZE to using 4096. > Everything should work OK w/ using PAGE_SIZE, right? > Yes, but then the test will take 16x more time on a 64k page size system for no reason whatsoever. > > > +#define REPS 100 > > Is this sufficient? I'm not sure what the lower bound on what's > expected of ktime. If I'm doing the math right, on your system > running 100 loops took 38802 ns in one case, since: > > (4096 * 1000 * 100) / 10556 = 38802 > > If you happen to have your timer backed by a 32 kHz clock, one tick of > ktime could be as much as 31250 ns, right? Maybe on systems backed > with a 32kHz clock they'll take longer, but it still seems moderately > iffy? I dunno, maybe I'm just being paranoid. > No, that is a good point - I didn't really consider that ktime could be that coarse. OTOH, we don't really need the full 5 digits of precision either, as long as we don't misidentify the fastest algorithm. So I think it should be sufficient to bump this to 800. If my calculations are correct, this would limit any potential misidentification of algorithms performing below 10 GB/s to ones that only deviate in performance up to 10%. 800 * 1000 * 4096 / (10 * 31250) = 10485 800 * 1000 * 4096 / (11 * 31250) = 9532 (10485/9532) / 10485 = 10% > > > static void __init > > do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2) > > { > > int speed; > > - unsigned long now, j; > > - int i, count, max; > > + int i, j, count; > > + ktime_t min, start, diff; > > > > tmpl->next = template_list; > > template_list = tmpl; > > > > preempt_disable(); > > > > - /* > > - * Count the number of XORs done during a whole jiffy, and use > > - * this to calculate the speed of checksumming. We use a 2-page > > - * allocation to have guaranteed color L1-cache layout. > > - */ > > - max = 0; > > + min = (ktime_t)S64_MAX; > > for (i = 0; i < 5; i++) { > > - j = jiffies; > > - count = 0; > > - while ((now = jiffies) == j) > > - cpu_relax(); > > - while (time_before(jiffies, now + 1)) { > > + start = ktime_get(); > > + for (j = 0; j < REPS; j++) { > > mb(); /* prevent loop optimzation */ > > tmpl->do_2(BENCH_SIZE, b1, b2); > > mb(); > > count++; > > mb(); > > } > > - if (count > max) > > - max = count; > > + diff = ktime_sub(ktime_get(), start); > > + if (diff < min) > > + min = diff; > > } > > > > preempt_enable(); > > > > - speed = max * (HZ * BENCH_SIZE / 1024); > > + // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s] > > Comment is super helpful, thanks! ...but are folks really OK with > "//" comments these days? > Linus said he is fine with it, and even prefers it for single line comments, so I don't think it's a problem > > > + speed = (1000 * REPS * BENCH_SIZE) / (u32)min; > > nit: Just for prettiness, maybe call ktime_to_ns(min)? > > optional nit: I always think of u32 as something for accessing > hardware. Maybe "unsigned int"? > Ack > > > tmpl->speed = speed; > > > > - printk(KERN_INFO " %-10s: %5d.%03d MB/sec\n", tmpl->name, > > - speed / 1000, speed % 1000); > > + printk(KERN_INFO " %-16s: %5d MB/sec\n", tmpl->name, speed); > > Since you're touching, switch to pr_info()? > Ack (x2) > > > } > > > > static int __init > > @@ -158,8 +152,8 @@ calibrate_xor_blocks(void) > > if (f->speed > fastest->speed) > > fastest = f; > > > > - printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n", > > - fastest->name, fastest->speed / 1000, fastest->speed % 1000); > > + printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n", > > + fastest->name, fastest->speed); > > Since you're touching, switch to pr_info()? > > > -Doug
Hi, On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > On Thu, 24 Sep 2020 at 02:36, Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > > Currently, we use the jiffies counter as a time source, by staring at > > > it until a HZ period elapses, and then staring at it again and perform > > > as many XOR operations as we can at the same time until another HZ > > > period elapses, so that we can calculate the throughput. This takes > > > longer than necessary, and depends on HZ, which is undesirable, since > > > HZ is system dependent. > > > > > > Let's use the ktime interface instead, and use it to time a fixed > > > number of XOR operations, which can be done much faster, and makes > > > the time spent depend on the performance level of the system itself, > > > which is much more reasonable. > > > > > > On ThunderX2, I get the following results: > > > > > > Before: > > > > > > [72625.956765] xor: measuring software checksum speed > > > [72625.993104] 8regs : 10169.000 MB/sec > > > [72626.033099] 32regs : 12050.000 MB/sec > > > [72626.073095] arm64_neon: 11100.000 MB/sec > > > [72626.073097] xor: using function: 32regs (12050.000 MB/sec) > > > > > > After: > > > > > > [ 2503.189696] xor: measuring software checksum speed > > > [ 2503.189896] 8regs : 10556 MB/sec > > > [ 2503.190061] 32regs : 12538 MB/sec > > > [ 2503.190250] arm64_neon : 11470 MB/sec > > > [ 2503.190252] xor: using function: 32regs (12538 MB/sec) > > > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > > --- > > > crypto/xor.c | 36 ++++++++------------ > > > 1 file changed, 15 insertions(+), 21 deletions(-) > > > > > > diff --git a/crypto/xor.c b/crypto/xor.c > > > index b42c38343733..23f98b451b69 100644 > > > --- a/crypto/xor.c > > > +++ b/crypto/xor.c > > > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void) > > > } > > > #endif > > > > > > -#define BENCH_SIZE (PAGE_SIZE) > > > +#define BENCH_SIZE 4096 > > > > I'm curious why the change away from PAGE_SIZE to using 4096. > > Everything should work OK w/ using PAGE_SIZE, right? > > > > Yes, but then the test will take 16x more time on a 64k page size > system for no reason whatsoever. Ah. I wasn't sure if using PAGE_SIZE as trying to help avoid measurement inaccuracies or make it work more like the real thing (maybe the code that calls XOR often does it PAGE_SIZE at a time?) I definitely haven't played with it, but I could sorta imagine it making some small differences. Maybe the "prefetch" versions of the XOR ops have some overhead but the overhead is mitigated with larger operations? Though both 4K and 64K are probably large enough and probably it wouldn't affect the outcome of which algorithm is best... > > > +#define REPS 100 > > > > Is this sufficient? I'm not sure what the lower bound on what's > > expected of ktime. If I'm doing the math right, on your system > > running 100 loops took 38802 ns in one case, since: > > > > (4096 * 1000 * 100) / 10556 = 38802 > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of > > ktime could be as much as 31250 ns, right? Maybe on systems backed > > with a 32kHz clock they'll take longer, but it still seems moderately > > iffy? I dunno, maybe I'm just being paranoid. > > > > No, that is a good point - I didn't really consider that ktime could > be that coarse. > > OTOH, we don't really need the full 5 digits of precision either, as > long as we don't misidentify the fastest algorithm. > > So I think it should be sufficient to bump this to 800. If my > calculations are correct, this would limit any potential > misidentification of algorithms performing below 10 GB/s to ones that > only deviate in performance up to 10%. > > 800 * 1000 * 4096 / (10 * 31250) = 10485 > 800 * 1000 * 4096 / (11 * 31250) = 9532 > > (10485/9532) / 10485 = 10% Seems OK to me. Seems unlikely that super fast machine are going to have a 32 kHz backed k_time and the worst case is that we'll pick a slightly sub-optimal xor, I guess. I assume your goal is to keep things fitting in a 32-bit unsigned integer? Looks like if your use 1000 it also fits... -Doug
On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote: > > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > ... > > > > +#define REPS 100 > > > > > > Is this sufficient? I'm not sure what the lower bound on what's > > > expected of ktime. If I'm doing the math right, on your system > > > running 100 loops took 38802 ns in one case, since: > > > > > > (4096 * 1000 * 100) / 10556 = 38802 > > > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of > > > ktime could be as much as 31250 ns, right? Maybe on systems backed > > > with a 32kHz clock they'll take longer, but it still seems moderately > > > iffy? I dunno, maybe I'm just being paranoid. > > > > > > > No, that is a good point - I didn't really consider that ktime could > > be that coarse. > > > > OTOH, we don't really need the full 5 digits of precision either, as > > long as we don't misidentify the fastest algorithm. > > > > So I think it should be sufficient to bump this to 800. If my > > calculations are correct, this would limit any potential > > misidentification of algorithms performing below 10 GB/s to ones that > > only deviate in performance up to 10%. > > > > 800 * 1000 * 4096 / (10 * 31250) = 10485 > > 800 * 1000 * 4096 / (11 * 31250) = 9532 > > > > (10485/9532) / 10485 = 10% > > Seems OK to me. Seems unlikely that super fast machine are going to > have a 32 kHz backed k_time and the worst case is that we'll pick a > slightly sub-optimal xor, I guess. I assume your goal is to keep > things fitting in a 32-bit unsigned integer? Looks like if your use > 1000 it also fits... > Yes, but the larger we make this number, the more time the test will take on such slow machines. Doing 1000 iterations of 4k on a low-end machine that only manages 500 MB/s (?) takes a couple of milliseconds, which is more than it does today when HZ=1000 I think. Not that 800 vs 1000 makes a great deal of difference in that regard, just illustrating that there is an upper bound as well.
Hi, On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote: > > > > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > ... > > > > > +#define REPS 100 > > > > > > > > Is this sufficient? I'm not sure what the lower bound on what's > > > > expected of ktime. If I'm doing the math right, on your system > > > > running 100 loops took 38802 ns in one case, since: > > > > > > > > (4096 * 1000 * 100) / 10556 = 38802 > > > > > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of > > > > ktime could be as much as 31250 ns, right? Maybe on systems backed > > > > with a 32kHz clock they'll take longer, but it still seems moderately > > > > iffy? I dunno, maybe I'm just being paranoid. > > > > > > > > > > No, that is a good point - I didn't really consider that ktime could > > > be that coarse. > > > > > > OTOH, we don't really need the full 5 digits of precision either, as > > > long as we don't misidentify the fastest algorithm. > > > > > > So I think it should be sufficient to bump this to 800. If my > > > calculations are correct, this would limit any potential > > > misidentification of algorithms performing below 10 GB/s to ones that > > > only deviate in performance up to 10%. > > > > > > 800 * 1000 * 4096 / (10 * 31250) = 10485 > > > 800 * 1000 * 4096 / (11 * 31250) = 9532 > > > > > > (10485/9532) / 10485 = 10% > > > > Seems OK to me. Seems unlikely that super fast machine are going to > > have a 32 kHz backed k_time and the worst case is that we'll pick a > > slightly sub-optimal xor, I guess. I assume your goal is to keep > > things fitting in a 32-bit unsigned integer? Looks like if your use > > 1000 it also fits... > > > > Yes, but the larger we make this number, the more time the test will > take on such slow machines. Doing 1000 iterations of 4k on a low-end > machine that only manages 500 MB/s (?) takes a couple of milliseconds, > which is more than it does today when HZ=1000 I think. > > Not that 800 vs 1000 makes a great deal of difference in that regard, > just illustrating that there is an upper bound as well. Would it make sense to use some type of hybrid approach? I know getting ktime itself has some overhead so you don't want to do it in a tight loop, but maybe calling it every once in a while would be acceptable and if it's been more than 500 us then stop early? -Doug
On Thu, 24 Sep 2020 at 20:22, Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote: > > > > > > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > > ... > > > > > > +#define REPS 100 > > > > > > > > > > Is this sufficient? I'm not sure what the lower bound on what's > > > > > expected of ktime. If I'm doing the math right, on your system > > > > > running 100 loops took 38802 ns in one case, since: > > > > > > > > > > (4096 * 1000 * 100) / 10556 = 38802 > > > > > > > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of > > > > > ktime could be as much as 31250 ns, right? Maybe on systems backed > > > > > with a 32kHz clock they'll take longer, but it still seems moderately > > > > > iffy? I dunno, maybe I'm just being paranoid. > > > > > > > > > > > > > No, that is a good point - I didn't really consider that ktime could > > > > be that coarse. > > > > > > > > OTOH, we don't really need the full 5 digits of precision either, as > > > > long as we don't misidentify the fastest algorithm. > > > > > > > > So I think it should be sufficient to bump this to 800. If my > > > > calculations are correct, this would limit any potential > > > > misidentification of algorithms performing below 10 GB/s to ones that > > > > only deviate in performance up to 10%. > > > > > > > > 800 * 1000 * 4096 / (10 * 31250) = 10485 > > > > 800 * 1000 * 4096 / (11 * 31250) = 9532 > > > > > > > > (10485/9532) / 10485 = 10% > > > > > > Seems OK to me. Seems unlikely that super fast machine are going to > > > have a 32 kHz backed k_time and the worst case is that we'll pick a > > > slightly sub-optimal xor, I guess. I assume your goal is to keep > > > things fitting in a 32-bit unsigned integer? Looks like if your use > > > 1000 it also fits... > > > > > > > Yes, but the larger we make this number, the more time the test will > > take on such slow machines. Doing 1000 iterations of 4k on a low-end > > machine that only manages 500 MB/s (?) takes a couple of milliseconds, > > which is more than it does today when HZ=1000 I think. > > > > Not that 800 vs 1000 makes a great deal of difference in that regard, > > just illustrating that there is an upper bound as well. > > Would it make sense to use some type of hybrid approach? I know > getting ktime itself has some overhead so you don't want to do it in a > tight loop, but maybe calling it every once in a while would be > acceptable and if it's been more than 500 us then stop early? > To be honest, I don't think we don't need complexity like this - if boot time is critical on such a slow system, you probable won't have XOR built in, assuming it even makes sense to do software XOR on such a system. It is indeed preferable to have a numerator that fits in a U32, and so 1000 would be equally suitable in that regard, but I think I will stick with 800 if you don't mind.
Hi, On Thu, Sep 24, 2020 at 11:40 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > On Thu, 24 Sep 2020 at 20:22, Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > > On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote: > > > > > > > > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > > > > ... > > > > > > > +#define REPS 100 > > > > > > > > > > > > Is this sufficient? I'm not sure what the lower bound on what's > > > > > > expected of ktime. If I'm doing the math right, on your system > > > > > > running 100 loops took 38802 ns in one case, since: > > > > > > > > > > > > (4096 * 1000 * 100) / 10556 = 38802 > > > > > > > > > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of > > > > > > ktime could be as much as 31250 ns, right? Maybe on systems backed > > > > > > with a 32kHz clock they'll take longer, but it still seems moderately > > > > > > iffy? I dunno, maybe I'm just being paranoid. > > > > > > > > > > > > > > > > No, that is a good point - I didn't really consider that ktime could > > > > > be that coarse. > > > > > > > > > > OTOH, we don't really need the full 5 digits of precision either, as > > > > > long as we don't misidentify the fastest algorithm. > > > > > > > > > > So I think it should be sufficient to bump this to 800. If my > > > > > calculations are correct, this would limit any potential > > > > > misidentification of algorithms performing below 10 GB/s to ones that > > > > > only deviate in performance up to 10%. > > > > > > > > > > 800 * 1000 * 4096 / (10 * 31250) = 10485 > > > > > 800 * 1000 * 4096 / (11 * 31250) = 9532 > > > > > > > > > > (10485/9532) / 10485 = 10% > > > > > > > > Seems OK to me. Seems unlikely that super fast machine are going to > > > > have a 32 kHz backed k_time and the worst case is that we'll pick a > > > > slightly sub-optimal xor, I guess. I assume your goal is to keep > > > > things fitting in a 32-bit unsigned integer? Looks like if your use > > > > 1000 it also fits... > > > > > > > > > > Yes, but the larger we make this number, the more time the test will > > > take on such slow machines. Doing 1000 iterations of 4k on a low-end > > > machine that only manages 500 MB/s (?) takes a couple of milliseconds, > > > which is more than it does today when HZ=1000 I think. > > > > > > Not that 800 vs 1000 makes a great deal of difference in that regard, > > > just illustrating that there is an upper bound as well. > > > > Would it make sense to use some type of hybrid approach? I know > > getting ktime itself has some overhead so you don't want to do it in a > > tight loop, but maybe calling it every once in a while would be > > acceptable and if it's been more than 500 us then stop early? > > > > To be honest, I don't think we don't need complexity like this - if > boot time is critical on such a slow system, you probable won't have > XOR built in, assuming it even makes sense to do software XOR on such > a system. > > It is indeed preferable to have a numerator that fits in a U32, and so > 1000 would be equally suitable in that regard, but I think I will > stick with 800 if you don't mind. OK, fair enough. -Doug
diff --git a/crypto/xor.c b/crypto/xor.c index b42c38343733..23f98b451b69 100644 --- a/crypto/xor.c +++ b/crypto/xor.c @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void) } #endif -#define BENCH_SIZE (PAGE_SIZE) +#define BENCH_SIZE 4096 +#define REPS 100 static void __init do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2) { int speed; - unsigned long now, j; - int i, count, max; + int i, j, count; + ktime_t min, start, diff; tmpl->next = template_list; template_list = tmpl; preempt_disable(); - /* - * Count the number of XORs done during a whole jiffy, and use - * this to calculate the speed of checksumming. We use a 2-page - * allocation to have guaranteed color L1-cache layout. - */ - max = 0; + min = (ktime_t)S64_MAX; for (i = 0; i < 5; i++) { - j = jiffies; - count = 0; - while ((now = jiffies) == j) - cpu_relax(); - while (time_before(jiffies, now + 1)) { + start = ktime_get(); + for (j = 0; j < REPS; j++) { mb(); /* prevent loop optimzation */ tmpl->do_2(BENCH_SIZE, b1, b2); mb(); count++; mb(); } - if (count > max) - max = count; + diff = ktime_sub(ktime_get(), start); + if (diff < min) + min = diff; } preempt_enable(); - speed = max * (HZ * BENCH_SIZE / 1024); + // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s] + speed = (1000 * REPS * BENCH_SIZE) / (u32)min; tmpl->speed = speed; - printk(KERN_INFO " %-10s: %5d.%03d MB/sec\n", tmpl->name, - speed / 1000, speed % 1000); + printk(KERN_INFO " %-16s: %5d MB/sec\n", tmpl->name, speed); } static int __init @@ -158,8 +152,8 @@ calibrate_xor_blocks(void) if (f->speed > fastest->speed) fastest = f; - printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n", - fastest->name, fastest->speed / 1000, fastest->speed % 1000); + printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n", + fastest->name, fastest->speed); #undef xor_speed
Currently, we use the jiffies counter as a time source, by staring at it until a HZ period elapses, and then staring at it again and perform as many XOR operations as we can at the same time until another HZ period elapses, so that we can calculate the throughput. This takes longer than necessary, and depends on HZ, which is undesirable, since HZ is system dependent. Let's use the ktime interface instead, and use it to time a fixed number of XOR operations, which can be done much faster, and makes the time spent depend on the performance level of the system itself, which is much more reasonable. On ThunderX2, I get the following results: Before: [72625.956765] xor: measuring software checksum speed [72625.993104] 8regs : 10169.000 MB/sec [72626.033099] 32regs : 12050.000 MB/sec [72626.073095] arm64_neon: 11100.000 MB/sec [72626.073097] xor: using function: 32regs (12050.000 MB/sec) After: [ 2503.189696] xor: measuring software checksum speed [ 2503.189896] 8regs : 10556 MB/sec [ 2503.190061] 32regs : 12538 MB/sec [ 2503.190250] arm64_neon : 11470 MB/sec [ 2503.190252] xor: using function: 32regs (12538 MB/sec) Signed-off-by: Ard Biesheuvel <ardb@kernel.org> --- crypto/xor.c | 36 ++++++++------------ 1 file changed, 15 insertions(+), 21 deletions(-)