Message ID | pull.1221.v2.git.git.1646708063480.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2] cat-file: skip expanding default format | expand |
"John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index 7b3f42950ec..ab9a49e13a4 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -351,6 +351,14 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d > } > } > > +static int print_default_format(char *buf, int len, struct expand_data *data) > +{ > + return xsnprintf(buf, len, "%s %s %"PRIuMAX"\n", oid_to_hex(&data->oid), > + data->info.type_name->buf, > + (uintmax_t)*data->info.sizep); > + > +} OK. We want size and type if we were to show the default output out of the object-info API. > /* > * If "pack" is non-NULL, then "offset" is the byte offset within the pack from > * which the object may be accessed (though note that we may also rely on > @@ -363,6 +371,11 @@ static void batch_object_write(const char *obj_name, > struct packed_git *pack, > off_t offset) > { > + struct strbuf type_name = STRBUF_INIT; > + > + if (!opt->format) > + data->info.type_name = &type_name; And at this point, !opt->format means we would use the default format, so we cannot leave .type_name member NULL. That is OK but puzzling. Why didn't we need this before? If the caller is batch_objects(), there is the "mark_query" call to strbuf_expand() to learn which field in data->info are needed, so it seems that this new code should NOT be necessary. Side note. I briefly wondered if this expand is something you would want to optimize when the default format is used, but this is just "probe just once to ensure various members of data->info are populated, to prepare for showing hundreds of objects in the batch request", so it probably is not worth it. I am guessing that this is for callers that do not come via batch_objects() where the "mark_query" strbuf_expand() is not made? If so, * why is it sufficient to fill .type_name and not .sizep for the default format (i.e. when opt->format is NULL)? * why is it OK not to do anything for non-default format? If no "mark_query" call has been made, we wouldn't be preparing the .type_name field even if the user-supplied format calls for %(objecttype), would we? Looking at the call graph: - batch_object_write() is called by - batch_one_object() - batch_object_cb() - batch_unordered_object() - batch_one_object() is called only by batch_objects() - batch_object_cb() is used only by batch_objects() - batch_unordered_object() is called by - batch_unordered_loose() - batch_unordered_packed() and these two are called only by batch_objects() And the "mark_query" strbuf_expand() to probe which members in expand_data are are necessary is done very early, before any of the calls batch_objects() makes that reach batch_object_write(). OK, so my initial guess that the new "we need .type_name member to point at a strbuf" is because there are some code that bypasses the "mark_query" strbuf_expand() in batch_objects() is totally wrong. Everybody uses the "mark_query" thing. Then why do we need to ask type_name? Going back to the new special case print_default_format() gives us the answer to the question. It expects that data->info already knows the stringified typename in the type_name member. The original slow code path in expand_atom() uses this, instead: } else if (is_atom("objecttype", atom, len)) { if (data->mark_query) data->info.typep = &data->type; else strbuf_addstr(sb, type_name(data->type)); Which makes me wonder: * Is calling type_name(data->type) for many objects a lot less efficient than asking the stringified type_name from the object-info layer? If that is the case, would you gain performance for all cases if you did this instead } else if (is_atom("objecttype", atom, len)) { - if (data->mark_query) - data->info.typep = &data->type; - else - strbuf_addstr(sb, type_name(data->type)); + if (data->mark_query) { + data->info.typep = &data->type; + data->info.type_name = &data->type_name; + } else { + strbuf_addstr(sb, data->type_name); + } in expand_atom()? Side note: I am keeping data->info.typep because a lot of existing code switches on data->type, which is an enum. We may have to keep the strbuf_release() at the end of this function this patch added, to release data->info.type_name, if we go that route, but we wouldn't be dealing with an on-stack type_name in this function. * If it does not make any difference between calling type_name() on our side in expand_atom() or asking object-info API to do so, then would it make more sense to lose the local type_name strbuf and print type_name(data->type) in print_default_format() instead? Other than that, this looks good to me. Thanks.
Hi Junio, On 8 Mar 2022, at 11:59, Junio C Hamano wrote: > "John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> diff --git a/builtin/cat-file.c b/builtin/cat-file.c >> index 7b3f42950ec..ab9a49e13a4 100644 >> --- a/builtin/cat-file.c >> +++ b/builtin/cat-file.c >> @@ -351,6 +351,14 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d >> } >> } >> >> +static int print_default_format(char *buf, int len, struct expand_data *data) >> +{ >> + return xsnprintf(buf, len, "%s %s %"PRIuMAX"\n", oid_to_hex(&data->oid), >> + data->info.type_name->buf, >> + (uintmax_t)*data->info.sizep); >> + >> +} > > OK. We want size and type if we were to show the default output out > of the object-info API. > >> /* >> * If "pack" is non-NULL, then "offset" is the byte offset within the pack from >> * which the object may be accessed (though note that we may also rely on >> @@ -363,6 +371,11 @@ static void batch_object_write(const char *obj_name, >> struct packed_git *pack, >> off_t offset) >> { >> + struct strbuf type_name = STRBUF_INIT; >> + >> + if (!opt->format) >> + data->info.type_name = &type_name; > > And at this point, !opt->format means we would use the default > format, so we cannot leave .type_name member NULL. That is OK > but puzzling. Why didn't we need this before? > > If the caller is batch_objects(), there is the "mark_query" call to > strbuf_expand() to learn which field in data->info are needed, so > it seems that this new code should NOT be necessary. > > Side note. I briefly wondered if this expand is something you > would want to optimize when the default format is used, but this > is just "probe just once to ensure various members of data->info > are populated, to prepare for showing hundreds of objects in the > batch request", so it probably is not worth it. > > I am guessing that this is for callers that do not come via > batch_objects() where the "mark_query" strbuf_expand() is not made? > If so, > > * why is it sufficient to fill .type_name and not .sizep for the > default format (i.e. when opt->format is NULL)? > > * why is it OK not to do anything for non-default format? If no > "mark_query" call has been made, we wouldn't be preparing the > .type_name field even if the user-supplied format calls for > %(objecttype), would we? > > Looking at the call graph: > > - batch_object_write() is called by > - batch_one_object() > - batch_object_cb() > - batch_unordered_object() > > - batch_one_object() is called only by batch_objects() > - batch_object_cb() is used only by batch_objects() > > - batch_unordered_object() is called by > - batch_unordered_loose() > - batch_unordered_packed() > and these two are called only by batch_objects() > > And the "mark_query" strbuf_expand() to probe which members in > expand_data are are necessary is done very early, before any of the > calls batch_objects() makes that reach batch_object_write(). > > OK, so my initial guess that the new "we need .type_name member to > point at a strbuf" is because there are some code that bypasses the > "mark_query" strbuf_expand() in batch_objects() is totally wrong. > Everybody uses the "mark_query" thing. Then why do we need to ask > type_name? > > Going back to the new special case print_default_format() gives us > the answer to the question. It expects that data->info already > knows the stringified typename in the type_name member. The > original slow code path in expand_atom() uses this, instead: > > } else if (is_atom("objecttype", atom, len)) { > if (data->mark_query) > data->info.typep = &data->type; > else > strbuf_addstr(sb, type_name(data->type)); Thanks for going through this analysis! so looks like I am relying on oid_object_info_extended() which calls do_oid_object_info_extended(), which calls type_name(co->type) if oi->type_name is not NULL. This is a bit roundabout, so I like what you suggest below of just calling type_name() in print_default_format() directly. > > Which makes me wonder: > > * Is calling type_name(data->type) for many objects a lot less > efficient than asking the stringified type_name from the > object-info layer? I'm not sure, but I imagine that if the # of calls to type_name remain the same, eg: once per object that it wouldn't really matter much where in the stack it happens. Also, I took a look at type_name() in object.c and it's just a lookup in a constant array so that should be pretty fast. > If that is the case, would you gain > performance for all cases if you did this instead > > } else if (is_atom("objecttype", atom, len)) { > - if (data->mark_query) > - data->info.typep = &data->type; > - else > - strbuf_addstr(sb, type_name(data->type)); > + if (data->mark_query) { > + data->info.typep = &data->type; > + data->info.type_name = &data->type_name; > + } else { > + strbuf_addstr(sb, data->type_name); > + } > > in expand_atom()? I don't quite follow here. Would we add a member type_name to expand_data? Also where would the call to type_name() be to get the stringified type_name? Also I'm thinking this approach may not work well with the default format optimization as we would be skipping the strbuf_expand() call altogether when default format is used. > > Side note: I am keeping data->info.typep because a lot of > existing code switches on data->type, which is an enum. > > We may have to keep the strbuf_release() at the end of this > function this patch added, to release data->info.type_name, if we > go that route, but we wouldn't be dealing with an on-stack > type_name in this function. > > * If it does not make any difference between calling type_name() on > our side in expand_atom() or asking object-info API to do so, > then would it make more sense to lose the local type_name strbuf > and print type_name(data->type) in print_default_format() instead? I think this is the most intuitive solution. > > Other than that, this looks good to me. > > Thanks. thanks! John
On Tue, Mar 08, 2022 at 02:54:23AM +0000, John Cai via GitGitGadget wrote: > /* > * If "pack" is non-NULL, then "offset" is the byte offset within the pack from > * which the object may be accessed (though note that we may also rely on > @@ -363,6 +371,11 @@ static void batch_object_write(const char *obj_name, > struct packed_git *pack, > off_t offset) > { > + struct strbuf type_name = STRBUF_INIT; > + > + if (!opt->format) > + data->info.type_name = &type_name; Hmmm, I'm not quite sure I understand why the extra string buffer is necessary here. When we do the first expansion with the DEFAULT_FORMAT, we set data->info.typep to point at data->type. So by the time we do our actual query (either with `packed_object_info()` or just `oid_object_info_extended()`), we will get the type filled into data->type. And we should be able to use that in print_default_format(), which saves us in a couple of ways: - We don't have to (stack) allocate a string buffer, which then needs to be cleaned up after printing each object. - (More importantly) we can avoid the extra string copy through the intermediate buffer by using type_name() directly. Here's a small patch on top that you could apply, if you're interested: --- >8 --- diff --git a/builtin/cat-file.c b/builtin/cat-file.c index ab9a49e13a..9f55afa75b 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -355,5 +355,5 @@ static int print_default_format(char *buf, int len, struct expand_data *data) { return xsnprintf(buf, len, "%s %s %"PRIuMAX"\n", oid_to_hex(&data->oid), - data->info.type_name->buf, + type_name(data->type), (uintmax_t)*data->info.sizep); @@ -372,9 +372,4 @@ static void batch_object_write(const char *obj_name, off_t offset) { - struct strbuf type_name = STRBUF_INIT; - - if (!opt->format) - data->info.type_name = &type_name; - if (!data->skip_object_info) { int ret; @@ -391,5 +386,5 @@ static void batch_object_write(const char *obj_name, obj_name ? obj_name : oid_to_hex(&data->oid)); fflush(stdout); - goto cleanup; + return; } } @@ -410,7 +405,4 @@ static void batch_object_write(const char *obj_name, batch_write(opt, "\n", 1); } - -cleanup: - strbuf_release(&type_name); } --- 8< --- On my copy of git.git., it shaves off around ~7ms that we're spending just copying type names back and forth. (without the above) Benchmark 1: git.compile cat-file --batch-check --batch-all-objects Time (mean ± σ): 578.7 ms ± 12.7 ms [User: 553.4 ms, System: 25.2 ms] Range (min … max): 568.1 ms … 600.1 ms 10 runs (with the above) Benchmark 1: git.compile cat-file --batch-check --batch-all-objects Time (mean ± σ): 571.5 ms ± 7.9 ms [User: 544.0 ms, System: 27.3 ms] Range (min … max): 558.8 ms … 583.2 ms 10 runs Thanks, Taylor
Hi Taylor On 8 Mar 2022, at 17:00, Taylor Blau wrote: > On Tue, Mar 08, 2022 at 02:54:23AM +0000, John Cai via GitGitGadget wrote: >> /* >> * If "pack" is non-NULL, then "offset" is the byte offset within the pack from >> * which the object may be accessed (though note that we may also rely on >> @@ -363,6 +371,11 @@ static void batch_object_write(const char *obj_name, >> struct packed_git *pack, >> off_t offset) >> { >> + struct strbuf type_name = STRBUF_INIT; >> + >> + if (!opt->format) >> + data->info.type_name = &type_name; > > Hmmm, I'm not quite sure I understand why the extra string buffer is > necessary here. When we do the first expansion with the DEFAULT_FORMAT, > we set data->info.typep to point at data->type. > > So by the time we do our actual query (either with > `packed_object_info()` or just `oid_object_info_extended()`), we will > get the type filled into data->type. > > And we should be able to use that in print_default_format(), which saves > us in a couple of ways: > > - We don't have to (stack) allocate a string buffer, which then needs > to be cleaned up after printing each object. > > - (More importantly) we can avoid the extra string copy through the > intermediate buffer by using type_name() directly. Agree with your reasoning here. > > Here's a small patch on top that you could apply, if you're interested: > > --- >8 --- > > > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index ab9a49e13a..9f55afa75b 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -355,5 +355,5 @@ static int print_default_format(char *buf, int len, struct expand_data *data) > { > return xsnprintf(buf, len, "%s %s %"PRIuMAX"\n", oid_to_hex(&data->oid), > - data->info.type_name->buf, > + type_name(data->type), > (uintmax_t)*data->info.sizep); > > @@ -372,9 +372,4 @@ static void batch_object_write(const char *obj_name, > off_t offset) > { > - struct strbuf type_name = STRBUF_INIT; > - > - if (!opt->format) > - data->info.type_name = &type_name; > - > if (!data->skip_object_info) { > int ret; > @@ -391,5 +386,5 @@ static void batch_object_write(const char *obj_name, > obj_name ? obj_name : oid_to_hex(&data->oid)); > fflush(stdout); > - goto cleanup; > + return; > } > } > @@ -410,7 +405,4 @@ static void batch_object_write(const char *obj_name, > batch_write(opt, "\n", 1); > } > - > -cleanup: > - strbuf_release(&type_name); > } > > --- 8< --- > > On my copy of git.git., it shaves off around ~7ms that we're spending > just copying type names back and forth. > > (without the above) > Benchmark 1: git.compile cat-file --batch-check --batch-all-objects > Time (mean ± σ): 578.7 ms ± 12.7 ms [User: 553.4 ms, System: 25.2 ms] > Range (min … max): 568.1 ms … 600.1 ms 10 runs > > (with the above) > Benchmark 1: git.compile cat-file --batch-check --batch-all-objects > Time (mean ± σ): 571.5 ms ± 7.9 ms [User: 544.0 ms, System: 27.3 ms] > Range (min … max): 558.8 ms … 583.2 ms 10 runs Thanks for this suggestion and the benchmark! This is similar to what Junio suggested, and I do think this is the right thing to do here. > > Thanks, > Taylor Thanks, John
On Tue, Mar 08, 2022 at 05:00:53PM -0500, Taylor Blau wrote: > On my copy of git.git., it shaves off around ~7ms that we're spending > just copying type names back and forth. ...while we're at it, I think we could go a little further and avoid doing the mark_query phase altogether, by doing something like: --- 8< --- diff --git a/builtin/cat-file.c b/builtin/cat-file.c index ab9a49e13a..4b3cfb9e68 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -542,24 +542,30 @@ static int batch_objects(struct batch_options *opt) int save_warning; int retval = 0; - /* - * Expand once with our special mark_query flag, which will prime the - * object_info to be handed to oid_object_info_extended for each - * object. - */ - memset(&data, 0, sizeof(data)); - data.mark_query = 1; - strbuf_expand(&output, - opt->format ? opt->format : DEFAULT_FORMAT, - expand_format, - &data); - data.mark_query = 0; - strbuf_release(&output); if (opt->cmdmode) data.split_on_whitespace = 1; - if (opt->format && !strcmp(opt->format, DEFAULT_FORMAT)) + memset(&data, 0, sizeof(data)); + if (!opt->format || !strcmp(opt->format, DEFAULT_FORMAT)) { + data.info.sizep = &data.size; + data.info.typep = &data.type; + opt->format = NULL; + } else { + /* + * Expand once with our special mark_query flag, which will prime the + * object_info to be handed to oid_object_info_extended for each + * object. + */ + data.mark_query = 1; + strbuf_expand(&output, + opt->format ? opt->format : DEFAULT_FORMAT, + expand_format, + &data); + data.mark_query = 0; + strbuf_release(&output); + } + /* * If we are printing out the object, then always fill in the type, * since we will want to decide whether or not to stream. --- >8 --- ...but in my experiments it doesn't seem to help much. Or, at least, it doesn't obviously help, there's too much noise from run to run for me to see a worthwhile speed-up here. Thanks, Taylor
Hi Taylor, On 8 Mar 2022, at 17:24, Taylor Blau wrote: > On Tue, Mar 08, 2022 at 05:00:53PM -0500, Taylor Blau wrote: >> On my copy of git.git., it shaves off around ~7ms that we're spending >> just copying type names back and forth. > > ...while we're at it, I think we could go a little further and avoid > doing the mark_query phase altogether, by doing something like: > > --- 8< --- > > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index ab9a49e13a..4b3cfb9e68 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -542,24 +542,30 @@ static int batch_objects(struct batch_options *opt) > int save_warning; > int retval = 0; > > - /* > - * Expand once with our special mark_query flag, which will prime the > - * object_info to be handed to oid_object_info_extended for each > - * object. > - */ > - memset(&data, 0, sizeof(data)); > - data.mark_query = 1; > - strbuf_expand(&output, > - opt->format ? opt->format : DEFAULT_FORMAT, > - expand_format, > - &data); > - data.mark_query = 0; > - strbuf_release(&output); > if (opt->cmdmode) > data.split_on_whitespace = 1; > > - if (opt->format && !strcmp(opt->format, DEFAULT_FORMAT)) > + memset(&data, 0, sizeof(data)); > + if (!opt->format || !strcmp(opt->format, DEFAULT_FORMAT)) { > + data.info.sizep = &data.size; > + data.info.typep = &data.type; > + > opt->format = NULL; > + } else { > + /* > + * Expand once with our special mark_query flag, which will prime the > + * object_info to be handed to oid_object_info_extended for each > + * object. > + */ > + data.mark_query = 1; > + strbuf_expand(&output, > + opt->format ? opt->format : DEFAULT_FORMAT, > + expand_format, > + &data); > + data.mark_query = 0; > + strbuf_release(&output); > + } > + > /* > * If we are printing out the object, then always fill in the type, > * since we will want to decide whether or not to stream. > > --- >8 --- > > ...but in my experiments it doesn't seem to help much. Or, at least, it > doesn't obviously help, there's too much noise from run to run for me to > see a worthwhile speed-up here. Yeah I had the same thought. I also didn't see a noticeable difference so I'm on the fence regarding whether or not it's worth it. I'm kind of leaning towards no, since it adds some one-off logic without a clear performance gain. > > Thanks, > Taylor
diff --git a/builtin/cat-file.c b/builtin/cat-file.c index 7b3f42950ec..ab9a49e13a4 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -351,6 +351,14 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d } } +static int print_default_format(char *buf, int len, struct expand_data *data) +{ + return xsnprintf(buf, len, "%s %s %"PRIuMAX"\n", oid_to_hex(&data->oid), + data->info.type_name->buf, + (uintmax_t)*data->info.sizep); + +} + /* * If "pack" is non-NULL, then "offset" is the byte offset within the pack from * which the object may be accessed (though note that we may also rely on @@ -363,6 +371,11 @@ static void batch_object_write(const char *obj_name, struct packed_git *pack, off_t offset) { + struct strbuf type_name = STRBUF_INIT; + + if (!opt->format) + data->info.type_name = &type_name; + if (!data->skip_object_info) { int ret; @@ -377,21 +390,31 @@ static void batch_object_write(const char *obj_name, printf("%s missing\n", obj_name ? obj_name : oid_to_hex(&data->oid)); fflush(stdout); - return; + goto cleanup; } } - strbuf_reset(scratch); - strbuf_expand(scratch, opt->format, expand_format, data); - strbuf_addch(scratch, '\n'); - batch_write(opt, scratch->buf, scratch->len); + if (!opt->format) { + char buf[1024]; + int len = print_default_format(buf, 1024, data); + batch_write(opt, buf, len); + } else { + strbuf_reset(scratch); + strbuf_expand(scratch, opt->format, expand_format, data); + strbuf_addch(scratch, '\n'); + batch_write(opt, scratch->buf, scratch->len); + } if (opt->print_contents) { print_object_or_die(opt, data); batch_write(opt, "\n", 1); } + +cleanup: + strbuf_release(&type_name); } + static void batch_one_object(const char *obj_name, struct strbuf *scratch, struct batch_options *opt, @@ -508,6 +531,9 @@ static int batch_unordered_packed(const struct object_id *oid, data); } + +#define DEFAULT_FORMAT "%(objectname) %(objecttype) %(objectsize)" + static int batch_objects(struct batch_options *opt) { struct strbuf input = STRBUF_INIT; @@ -516,9 +542,6 @@ static int batch_objects(struct batch_options *opt) int save_warning; int retval = 0; - if (!opt->format) - opt->format = "%(objectname) %(objecttype) %(objectsize)"; - /* * Expand once with our special mark_query flag, which will prime the * object_info to be handed to oid_object_info_extended for each @@ -526,12 +549,17 @@ static int batch_objects(struct batch_options *opt) */ memset(&data, 0, sizeof(data)); data.mark_query = 1; - strbuf_expand(&output, opt->format, expand_format, &data); + strbuf_expand(&output, + opt->format ? opt->format : DEFAULT_FORMAT, + expand_format, + &data); data.mark_query = 0; strbuf_release(&output); if (opt->cmdmode) data.split_on_whitespace = 1; + if (opt->format && !strcmp(opt->format, DEFAULT_FORMAT)) + opt->format = NULL; /* * If we are printing out the object, then always fill in the type, * since we will want to decide whether or not to stream. diff --git a/t/perf/p1006-cat-file.sh b/t/perf/p1006-cat-file.sh new file mode 100755 index 00000000000..e463623f5a3 --- /dev/null +++ b/t/perf/p1006-cat-file.sh @@ -0,0 +1,16 @@ +#!/bin/sh + +test_description='Basic sort performance tests' +. ./perf-lib.sh + +test_perf_large_repo + +test_expect_success 'setup' ' + git rev-list --all >rla +' + +test_perf 'cat-file --batch-check' ' + git cat-file --batch-check <rla +' + +test_done