use strpbrk(3) to search for characters from a given set

Message ID	4140dade-d999-a74a-1f8e-06eedb84ed20@web.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=hy0J=4K=vger.kernel.org=git-owner@kernel.org> To: Git Mailing List <git@vger.kernel.org> From: =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de> Subject: [PATCH] use strpbrk(3) to search for characters from a given set Cc: Junio C Hamano <gitster@pobox.com> Message-ID: <4140dade-d999-a74a-1f8e-06eedb84ed20@web.de> Date: Sat, 22 Feb 2020 19:51:19 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.2 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Sender: git-owner@vger.kernel.org Precedence: bulk
Series	use strpbrk(3) to search for characters from a given set \| expand use strpbrk(3) to search for characters from a given set

René Scharfe Feb. 22, 2020, 6:51 p.m. UTC

We can check if certain characters are present in a string by calling
strchr(3) on each of them, or we can pass them all to a single
strpbrk(3) call.  The latter is shorter, less repetitive and slightly
more efficient, so let's do that instead.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 builtin/show-branch.c              | 2 +-
 compat/mingw.c                     | 2 +-
 mailinfo.c                         | 3 +--
 t/helper/test-windows-named-pipe.c | 2 +-
 4 files changed, 4 insertions(+), 5 deletions(-)

--
2.25.1

Martin Ågren Feb. 23, 2020, 9:01 a.m. UTC | #1

On Sat, 22 Feb 2020 at 19:53, René Scharfe <l.s.r@web.de> wrote:
>
> We can check if certain characters are present in a string by calling
> strchr(3) on each of them, or we can pass them all to a single
> strpbrk(3) call.  The latter is shorter, less repetitive and slightly
> more efficient, so let's do that instead.
>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
>  builtin/show-branch.c              | 2 +-
>  compat/mingw.c                     | 2 +-
>  mailinfo.c                         | 3 +--
>  t/helper/test-windows-named-pipe.c | 2 +-
>  4 files changed, 4 insertions(+), 5 deletions(-)

I failed to identify any other spots, except for some hits in test data
(in t/t4256/1/) which are probably better left alone.

> -       if (strchr(av, '*') || strchr(av, '?') || strchr(av, '[')) {
> +       if (strpbrk(av, "*?[")) {

Indeed, the `strchr()` calls use the same haystack `av`, as opposed to
`av`/`aw` or `av++`/`av++` or so. All conversions in this patch look
good to me.

Martin

Jeff King Feb. 24, 2020, 6:38 a.m. UTC | #2

On Sat, Feb 22, 2020 at 07:51:19PM +0100, René Scharfe wrote:

> We can check if certain characters are present in a string by calling
> strchr(3) on each of them, or we can pass them all to a single
> strpbrk(3) call.  The latter is shorter, less repetitive and slightly
> more efficient, so let's do that instead.

I think the resulting code is more readable, and this is worth doing on
those grounds alone.

I was curious whether it's always more efficient (tldr, it is; read on
only if you're also curious). The default glibc implementation of
strpbrk() is built on strcspn(). That in turn creates a 256-byte table
with a boolean flag set for each char in the accept string, and then
walks the haystack string doing O(1) lookups in the table for a hit. So
we have to memset() the table to zeroes at the start. Depending on the
length of the input string, that could be more expensive than just
walking the haystack string three times (one for each char in the accept
string).

But of course in practice there's all kinds of weird SSE assembly going
on behind the scenes. So in most of my tests, strpbrk() beats strchr()
handily. The one exception is when the first strchr() matches early in
the string and can short-circuit the rest of them.

The test I used was:

  $ cat foo.c
  #include <string.h>

  int check(const char *s)
  {
  #if USE_STRCHR
  	return strchr(s, '<') || strchr(s, '>') || strchr(s, '@');
  #else
  	return !!strpbrk(s, "@<>");
  #endif
  }

  $ cat bar.c 
  int check(const char *s);

  int main(int argc, const char **argv)
  {
  	int ret;
  	int i;

  	for (i = 0; i < 100000000; i++)
  		ret += check(argv[1]);
  	return ret & 0x7f;
  }

I put it into two separate files to avoid check() being hoisted out of
the loop. There's something a bit interesting, though. If you combine
those two files, gcc _does_ hoist the strpbrk() version, generating
this as the complete code:

	  subq    $8, %rsp
          .cfi_def_cfa_offset 16
          movq    8(%rsi), %rdi
          leaq    .LC0(%rip), %rsi
          call    strpbrk@PLT
          testq   %rax, %rax
          setne   %al
          addq    $8, %rsp
          .cfi_def_cfa_offset 8
          movzbl  %al, %eax
          movl    %eax, %edx
          imull   $99999999, %eax, %eax
          addl    %edx, %eax
          andl    $127, %eax
          ret

So it calls strpbrk() exactly once, then recognizes that the loop is
just adding up the same variable over and over and turns it into a
multiply. No loop at all; very cool (though I am puzzled why it doesn't
multiply by the full loop count; instead if multiplies by one less and
then adds back in one iteration).

But with strchr() it was less willing to do that (curiously, it will if
the function _isn't_ static, but won't if it is).

Anyway, that's all compiler arcana subject to change between versions,
and it's a dumb toy example anyway. But it does seem like the single
call is more likely to get inlined or hoisted by the compiler, which is
a point in its favor.

-Peff

Junio C Hamano Feb. 24, 2020, 5:10 p.m. UTC | #3

René Scharfe <l.s.r@web.de> writes:

> We can check if certain characters are present in a string by calling
> strchr(3) on each of them, or we can pass them all to a single
> strpbrk(3) call.  The latter is shorter, less repetitive and slightly
> more efficient, so let's do that instead.
>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
>  builtin/show-branch.c              | 2 +-
>  compat/mingw.c                     | 2 +-
>  mailinfo.c                         | 3 +--
>  t/helper/test-windows-named-pipe.c | 2 +-
>  4 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/builtin/show-branch.c b/builtin/show-branch.c
> index 35d7f51c23..8c90cbb18f 100644
> --- a/builtin/show-branch.c
> +++ b/builtin/show-branch.c
> @@ -536,7 +536,7 @@ static void append_one_rev(const char *av)
>  		append_ref(av, &revkey, 0);
>  		return;
>  	}
> -	if (strchr(av, '*') || strchr(av, '?') || strchr(av, '[')) {
> +	if (strpbrk(av, "*?[")) {


The changes in the patch obviously look all correct.

I wonder how we can exploit Coccinelle to do this kind of
transformations, though.  Would it be possible to say

 * if we see "strchr(S, C1) || strchr(S, C2)", transform it to
   "strpbrk(S, concat(stringify(C1),stringify(C2)))"; and
 * if we see "strpbrk(S, N) || strchr(S, C)", transform it to
   "strpbrk(S, concat(N, stringify(C))";

and let the tool apply these two rules repeatedly, to catch the
pattern to find any number of needle character in the same haystack?

René Scharfe Feb. 24, 2020, 7:01 p.m. UTC | #4

Am 24.02.20 um 18:10 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> We can check if certain characters are present in a string by calling
>> strchr(3) on each of them, or we can pass them all to a single
>> strpbrk(3) call.  The latter is shorter, less repetitive and slightly
>> more efficient, so let's do that instead.
>>
>> Signed-off-by: René Scharfe <l.s.r@web.de>
>> ---
>>  builtin/show-branch.c              | 2 +-
>>  compat/mingw.c                     | 2 +-
>>  mailinfo.c                         | 3 +--
>>  t/helper/test-windows-named-pipe.c | 2 +-
>>  4 files changed, 4 insertions(+), 5 deletions(-)
>>
>> diff --git a/builtin/show-branch.c b/builtin/show-branch.c
>> index 35d7f51c23..8c90cbb18f 100644
>> --- a/builtin/show-branch.c
>> +++ b/builtin/show-branch.c
>> @@ -536,7 +536,7 @@ static void append_one_rev(const char *av)
>>  		append_ref(av, &revkey, 0);
>>  		return;
>>  	}
>> -	if (strchr(av, '*') || strchr(av, '?') || strchr(av, '[')) {
>> +	if (strpbrk(av, "*?[")) {
>
>
> The changes in the patch obviously look all correct.
>
> I wonder how we can exploit Coccinelle to do this kind of
> transformations, though.  Would it be possible to say
>
>  * if we see "strchr(S, C1) || strchr(S, C2)", transform it to
>    "strpbrk(S, concat(stringify(C1),stringify(C2)))"; and
>  * if we see "strpbrk(S, N) || strchr(S, C)", transform it to
>    "strpbrk(S, concat(N, stringify(C))";
>
> and let the tool apply these two rules repeatedly, to catch the
> pattern to find any number of needle character in the same haystack?

That would be nice.  I briefly considered it, but I only can think of a
silly way to convert char literals to strings (by using one rule for
each possible character value), I don't know how to concatenate strings
in Coccinelle (simply putting them next to each other as in C doesn't
seem to work), and I don't know how to apply a rule recursively to allow
transforming an arbitrarily long chain of strchr() calls. :-/

René

Jeff King Feb. 24, 2020, 7:10 p.m. UTC | #5

On Mon, Feb 24, 2020 at 08:01:36PM +0100, René Scharfe wrote:

> > The changes in the patch obviously look all correct.
> >
> > I wonder how we can exploit Coccinelle to do this kind of
> > transformations, though.  Would it be possible to say
> >
> >  * if we see "strchr(S, C1) || strchr(S, C2)", transform it to
> >    "strpbrk(S, concat(stringify(C1),stringify(C2)))"; and
> >  * if we see "strpbrk(S, N) || strchr(S, C)", transform it to
> >    "strpbrk(S, concat(N, stringify(C))";
> >
> > and let the tool apply these two rules repeatedly, to catch the
> > pattern to find any number of needle character in the same haystack?
> 
> That would be nice.  I briefly considered it, but I only can think of a
> silly way to convert char literals to strings (by using one rule for
> each possible character value), I don't know how to concatenate strings
> in Coccinelle (simply putting them next to each other as in C doesn't
> seem to work), and I don't know how to apply a rule recursively to allow
> transforming an arbitrarily long chain of strchr() calls. :-/

I suspect you could do it with the python scripting interface of
coccinelle. We haven't relied on that so far (since it's technically
optional), but I think it would be worth doing if it makes impossible
things possible (it also would make some existing very slow things much
faster).

As far as the recursive rules, I thought coccinelle would always
re-apply rules to the result. So the two that Junio gave above would
work (each application of the second one would suck up another strchr).
But I also won't be surprised if I'm totally wrong; it wouldn't be the
first time I've been puzzled by coccinelle. :)

All that said, I don't plan to spend time on this. And I wouldn't really
ask you or anyone to unless they find it fun or interesting. It seems
like a lot of poking around for probably not that much benefit.

-Peff

use strpbrk(3) to search for characters from a given set

Commit Message

Comments

Patch