Message ID | 4140dade-d999-a74a-1f8e-06eedb84ed20@web.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | use strpbrk(3) to search for characters from a given set | expand |
On Sat, 22 Feb 2020 at 19:53, René Scharfe <l.s.r@web.de> wrote: > > We can check if certain characters are present in a string by calling > strchr(3) on each of them, or we can pass them all to a single > strpbrk(3) call. The latter is shorter, less repetitive and slightly > more efficient, so let's do that instead. > > Signed-off-by: René Scharfe <l.s.r@web.de> > --- > builtin/show-branch.c | 2 +- > compat/mingw.c | 2 +- > mailinfo.c | 3 +-- > t/helper/test-windows-named-pipe.c | 2 +- > 4 files changed, 4 insertions(+), 5 deletions(-) I failed to identify any other spots, except for some hits in test data (in t/t4256/1/) which are probably better left alone. > - if (strchr(av, '*') || strchr(av, '?') || strchr(av, '[')) { > + if (strpbrk(av, "*?[")) { Indeed, the `strchr()` calls use the same haystack `av`, as opposed to `av`/`aw` or `av++`/`av++` or so. All conversions in this patch look good to me. Martin
On Sat, Feb 22, 2020 at 07:51:19PM +0100, René Scharfe wrote: > We can check if certain characters are present in a string by calling > strchr(3) on each of them, or we can pass them all to a single > strpbrk(3) call. The latter is shorter, less repetitive and slightly > more efficient, so let's do that instead. I think the resulting code is more readable, and this is worth doing on those grounds alone. I was curious whether it's always more efficient (tldr, it is; read on only if you're also curious). The default glibc implementation of strpbrk() is built on strcspn(). That in turn creates a 256-byte table with a boolean flag set for each char in the accept string, and then walks the haystack string doing O(1) lookups in the table for a hit. So we have to memset() the table to zeroes at the start. Depending on the length of the input string, that could be more expensive than just walking the haystack string three times (one for each char in the accept string). But of course in practice there's all kinds of weird SSE assembly going on behind the scenes. So in most of my tests, strpbrk() beats strchr() handily. The one exception is when the first strchr() matches early in the string and can short-circuit the rest of them. The test I used was: $ cat foo.c #include <string.h> int check(const char *s) { #if USE_STRCHR return strchr(s, '<') || strchr(s, '>') || strchr(s, '@'); #else return !!strpbrk(s, "@<>"); #endif } $ cat bar.c int check(const char *s); int main(int argc, const char **argv) { int ret; int i; for (i = 0; i < 100000000; i++) ret += check(argv[1]); return ret & 0x7f; } I put it into two separate files to avoid check() being hoisted out of the loop. There's something a bit interesting, though. If you combine those two files, gcc _does_ hoist the strpbrk() version, generating this as the complete code: subq $8, %rsp .cfi_def_cfa_offset 16 movq 8(%rsi), %rdi leaq .LC0(%rip), %rsi call strpbrk@PLT testq %rax, %rax setne %al addq $8, %rsp .cfi_def_cfa_offset 8 movzbl %al, %eax movl %eax, %edx imull $99999999, %eax, %eax addl %edx, %eax andl $127, %eax ret So it calls strpbrk() exactly once, then recognizes that the loop is just adding up the same variable over and over and turns it into a multiply. No loop at all; very cool (though I am puzzled why it doesn't multiply by the full loop count; instead if multiplies by one less and then adds back in one iteration). But with strchr() it was less willing to do that (curiously, it will if the function _isn't_ static, but won't if it is). Anyway, that's all compiler arcana subject to change between versions, and it's a dumb toy example anyway. But it does seem like the single call is more likely to get inlined or hoisted by the compiler, which is a point in its favor. -Peff
René Scharfe <l.s.r@web.de> writes: > We can check if certain characters are present in a string by calling > strchr(3) on each of them, or we can pass them all to a single > strpbrk(3) call. The latter is shorter, less repetitive and slightly > more efficient, so let's do that instead. > > Signed-off-by: René Scharfe <l.s.r@web.de> > --- > builtin/show-branch.c | 2 +- > compat/mingw.c | 2 +- > mailinfo.c | 3 +-- > t/helper/test-windows-named-pipe.c | 2 +- > 4 files changed, 4 insertions(+), 5 deletions(-) > > diff --git a/builtin/show-branch.c b/builtin/show-branch.c > index 35d7f51c23..8c90cbb18f 100644 > --- a/builtin/show-branch.c > +++ b/builtin/show-branch.c > @@ -536,7 +536,7 @@ static void append_one_rev(const char *av) > append_ref(av, &revkey, 0); > return; > } > - if (strchr(av, '*') || strchr(av, '?') || strchr(av, '[')) { > + if (strpbrk(av, "*?[")) { The changes in the patch obviously look all correct. I wonder how we can exploit Coccinelle to do this kind of transformations, though. Would it be possible to say * if we see "strchr(S, C1) || strchr(S, C2)", transform it to "strpbrk(S, concat(stringify(C1),stringify(C2)))"; and * if we see "strpbrk(S, N) || strchr(S, C)", transform it to "strpbrk(S, concat(N, stringify(C))"; and let the tool apply these two rules repeatedly, to catch the pattern to find any number of needle character in the same haystack?
Am 24.02.20 um 18:10 schrieb Junio C Hamano: > René Scharfe <l.s.r@web.de> writes: > >> We can check if certain characters are present in a string by calling >> strchr(3) on each of them, or we can pass them all to a single >> strpbrk(3) call. The latter is shorter, less repetitive and slightly >> more efficient, so let's do that instead. >> >> Signed-off-by: René Scharfe <l.s.r@web.de> >> --- >> builtin/show-branch.c | 2 +- >> compat/mingw.c | 2 +- >> mailinfo.c | 3 +-- >> t/helper/test-windows-named-pipe.c | 2 +- >> 4 files changed, 4 insertions(+), 5 deletions(-) >> >> diff --git a/builtin/show-branch.c b/builtin/show-branch.c >> index 35d7f51c23..8c90cbb18f 100644 >> --- a/builtin/show-branch.c >> +++ b/builtin/show-branch.c >> @@ -536,7 +536,7 @@ static void append_one_rev(const char *av) >> append_ref(av, &revkey, 0); >> return; >> } >> - if (strchr(av, '*') || strchr(av, '?') || strchr(av, '[')) { >> + if (strpbrk(av, "*?[")) { > > > The changes in the patch obviously look all correct. > > I wonder how we can exploit Coccinelle to do this kind of > transformations, though. Would it be possible to say > > * if we see "strchr(S, C1) || strchr(S, C2)", transform it to > "strpbrk(S, concat(stringify(C1),stringify(C2)))"; and > * if we see "strpbrk(S, N) || strchr(S, C)", transform it to > "strpbrk(S, concat(N, stringify(C))"; > > and let the tool apply these two rules repeatedly, to catch the > pattern to find any number of needle character in the same haystack? That would be nice. I briefly considered it, but I only can think of a silly way to convert char literals to strings (by using one rule for each possible character value), I don't know how to concatenate strings in Coccinelle (simply putting them next to each other as in C doesn't seem to work), and I don't know how to apply a rule recursively to allow transforming an arbitrarily long chain of strchr() calls. :-/ René
On Mon, Feb 24, 2020 at 08:01:36PM +0100, René Scharfe wrote: > > The changes in the patch obviously look all correct. > > > > I wonder how we can exploit Coccinelle to do this kind of > > transformations, though. Would it be possible to say > > > > * if we see "strchr(S, C1) || strchr(S, C2)", transform it to > > "strpbrk(S, concat(stringify(C1),stringify(C2)))"; and > > * if we see "strpbrk(S, N) || strchr(S, C)", transform it to > > "strpbrk(S, concat(N, stringify(C))"; > > > > and let the tool apply these two rules repeatedly, to catch the > > pattern to find any number of needle character in the same haystack? > > That would be nice. I briefly considered it, but I only can think of a > silly way to convert char literals to strings (by using one rule for > each possible character value), I don't know how to concatenate strings > in Coccinelle (simply putting them next to each other as in C doesn't > seem to work), and I don't know how to apply a rule recursively to allow > transforming an arbitrarily long chain of strchr() calls. :-/ I suspect you could do it with the python scripting interface of coccinelle. We haven't relied on that so far (since it's technically optional), but I think it would be worth doing if it makes impossible things possible (it also would make some existing very slow things much faster). As far as the recursive rules, I thought coccinelle would always re-apply rules to the result. So the two that Junio gave above would work (each application of the second one would suck up another strchr). But I also won't be surprised if I'm totally wrong; it wouldn't be the first time I've been puzzled by coccinelle. :) All that said, I don't plan to spend time on this. And I wouldn't really ask you or anyone to unless they find it fun or interesting. It seems like a lot of poking around for probably not that much benefit. -Peff
diff --git a/builtin/show-branch.c b/builtin/show-branch.c index 35d7f51c23..8c90cbb18f 100644 --- a/builtin/show-branch.c +++ b/builtin/show-branch.c @@ -536,7 +536,7 @@ static void append_one_rev(const char *av) append_ref(av, &revkey, 0); return; } - if (strchr(av, '*') || strchr(av, '?') || strchr(av, '[')) { + if (strpbrk(av, "*?[")) { /* glob style match */ int saved_matches = ref_name_cnt; diff --git a/compat/mingw.c b/compat/mingw.c index b5230149db..d14065d60e 100644 --- a/compat/mingw.c +++ b/compat/mingw.c @@ -1245,7 +1245,7 @@ static char *path_lookup(const char *cmd, int exe_only) int len = strlen(cmd); int isexe = len >= 4 && !strcasecmp(cmd+len-4, ".exe"); - if (strchr(cmd, '/') || strchr(cmd, '\\')) + if (strpbrk(cmd, "/\\")) return xstrdup(cmd); path = mingw_getenv("PATH"); diff --git a/mailinfo.c b/mailinfo.c index cf92255515..742fa376ab 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -19,8 +19,7 @@ static void cleanup_space(struct strbuf *sb) static void get_sane_name(struct strbuf *out, struct strbuf *name, struct strbuf *email) { struct strbuf *src = name; - if (name->len < 3 || 60 < name->len || strchr(name->buf, '@') || - strchr(name->buf, '<') || strchr(name->buf, '>')) + if (name->len < 3 || 60 < name->len || strpbrk(name->buf, "@<>")) src = email; else if (name == out) return; diff --git a/t/helper/test-windows-named-pipe.c b/t/helper/test-windows-named-pipe.c index b4b752b01a..ae52183e63 100644 --- a/t/helper/test-windows-named-pipe.c +++ b/t/helper/test-windows-named-pipe.c @@ -19,7 +19,7 @@ int cmd__windows_named_pipe(int argc, const char **argv) if (argc < 2) goto print_usage; filename = argv[1]; - if (strchr(filename, '/') || strchr(filename, '\\')) + if (strpbrk(filename, "/\\")) goto print_usage; strbuf_addf(&pathname, "//./pipe/%s", filename);
We can check if certain characters are present in a string by calling strchr(3) on each of them, or we can pass them all to a single strpbrk(3) call. The latter is shorter, less repetitive and slightly more efficient, so let's do that instead. Signed-off-by: René Scharfe <l.s.r@web.de> --- builtin/show-branch.c | 2 +- compat/mingw.c | 2 +- mailinfo.c | 3 +-- t/helper/test-windows-named-pipe.c | 2 +- 4 files changed, 4 insertions(+), 5 deletions(-) -- 2.25.1