Message ID | 2a16f11790b79ab452233b6f28acac607c0abd28.1631331139.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | repack: introduce `--write-midx` | expand |
On Fri, Sep 10 2021, Taylor Blau wrote: > + struct strbuf buf = STRBUF_INIT; > + while (strbuf_getline(&buf, stdin) != EOF) { > + string_list_append(to, strbuf_detach(&buf, NULL)); If you strbuf_detach() it... > + struct string_list packs = STRING_LIST_INIT_NODUP; ...init the list with NODUP... > + string_list_clear(&packs, 0); ...and call string_list_clear() you'll leak memory, since the string_list will think that someone else is taking care of this memory. I had some recent musings about how this API is bad, and I've got local WIP patches to fix it, but in the meantime you need to: packs.strdup_strings = 1; Before calling string_list_clear(). I.e. we didn't strdup(), but during free() we pretend that we did, because we did, just not in string_list_append().
On Sat, Sep 11, 2021 at 12:05:05PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Sep 10 2021, Taylor Blau wrote: > > > + struct strbuf buf = STRBUF_INIT; > > + while (strbuf_getline(&buf, stdin) != EOF) { > > + string_list_append(to, strbuf_detach(&buf, NULL)); > > If you strbuf_detach() it... > > > > + struct string_list packs = STRING_LIST_INIT_NODUP; > > ...init the list with NODUP... > > > + string_list_clear(&packs, 0); > > ...and call string_list_clear() you'll leak memory, since the > string_list will think that someone else is taking care of this memory. > > I had some recent musings about how this API is bad, and I've got local > WIP patches to fix it, but in the meantime you need to: > > packs.strdup_strings = 1; > > Before calling string_list_clear(). I.e. we didn't strdup(), but during > free() we pretend that we did, because we did, just not in > string_list_append(). Good catch. It's kind of gross, but the result is: --- 8< --- diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 77488b6b7b..e6cab975e3 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -64,7 +64,7 @@ static struct option *add_common_options(struct option *prev) static void read_packs_from_stdin(struct string_list *to) { - struct strbuf buf = STRBUF_INIT; + struct strbuf buf = STRBUF_INIT_NODUP; while (strbuf_getline(&buf, stdin) != EOF) { string_list_append(to, strbuf_detach(&buf, NULL)); } @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) ret = write_midx_file_only(opts.object_dir, &packs, opts.preferred_pack, opts.flags); + /* + * pretend strings are duplicated to free the memory allocated + * by read_packs_from_stdin() + */ + packs.strdup_strings = 1; string_list_clear(&packs, 0); return ret;
On Sat, Sep 11, 2021 at 12:25:05PM -0400, Taylor Blau wrote: > Good catch. It's kind of gross, but the result is: > > --- 8< --- > > diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c > index 77488b6b7b..e6cab975e3 100644 > --- a/builtin/multi-pack-index.c > +++ b/builtin/multi-pack-index.c > @@ -64,7 +64,7 @@ static struct option *add_common_options(struct option *prev) > > static void read_packs_from_stdin(struct string_list *to) > { > - struct strbuf buf = STRBUF_INIT; > + struct strbuf buf = STRBUF_INIT_NODUP; Thanks, Taylor
On Sat, Sep 11, 2021 at 12:25 PM Taylor Blau <me@ttaylorr.com> wrote: > On Sat, Sep 11, 2021 at 12:05:05PM +0200, Ævar Arnfjörð Bjarmason wrote: > > Before calling string_list_clear(). I.e. we didn't strdup(), but during > > free() we pretend that we did, because we did, just not in > > string_list_append(). > > Good catch. It's kind of gross, but the result is: > > static void read_packs_from_stdin(struct string_list *to) > { > - struct strbuf buf = STRBUF_INIT; > + struct strbuf buf = STRBUF_INIT_NODUP; > while (strbuf_getline(&buf, stdin) != EOF) { > string_list_append(to, strbuf_detach(&buf, NULL)); > } > @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) > ret = write_midx_file_only(opts.object_dir, &packs, > opts.preferred_pack, opts.flags); > > + /* > + * pretend strings are duplicated to free the memory allocated > + * by read_packs_from_stdin() > + */ > + packs.strdup_strings = 1; > string_list_clear(&packs, 0); An alternative is to do this: struct strbuf buf = STRBUF_INIT; ... while (...) { string_list_append_nodup(to, strbuf_detach(&buf, NULL)); ... } ... string_list_clear(&packs, 0); That is, use string_list_append_nodup(), which is documented as existing precisely for this sort of use-case: Like string_list_append(), except string is never copied. When list->strdup_strings is set, this function can be used to hand ownership of a malloc()ed string to list without making an extra copy. (I mention this only for completeness but don't care strongly which approach is used.)
On Sat, Sep 11, 2021 at 10:08:44PM -0400, Eric Sunshine wrote: > On Sat, Sep 11, 2021 at 12:25 PM Taylor Blau <me@ttaylorr.com> wrote: > > On Sat, Sep 11, 2021 at 12:05:05PM +0200, Ævar Arnfjörð Bjarmason wrote: > > > Before calling string_list_clear(). I.e. we didn't strdup(), but during > > > free() we pretend that we did, because we did, just not in > > > string_list_append(). > > > > Good catch. It's kind of gross, but the result is: > > > > static void read_packs_from_stdin(struct string_list *to) > > { > > - struct strbuf buf = STRBUF_INIT; > > + struct strbuf buf = STRBUF_INIT_NODUP; > > while (strbuf_getline(&buf, stdin) != EOF) { > > string_list_append(to, strbuf_detach(&buf, NULL)); > > } > > @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) > > ret = write_midx_file_only(opts.object_dir, &packs, > > opts.preferred_pack, opts.flags); > > > > + /* > > + * pretend strings are duplicated to free the memory allocated > > + * by read_packs_from_stdin() > > + */ > > + packs.strdup_strings = 1; > > string_list_clear(&packs, 0); > > An alternative is to do this: > > struct strbuf buf = STRBUF_INIT; > ... > while (...) { > string_list_append_nodup(to, strbuf_detach(&buf, NULL)); > ... > } > ... > string_list_clear(&packs, 0); > > That is, use string_list_append_nodup(), which is documented as > existing precisely for this sort of use-case: > > Like string_list_append(), except string is never copied. When > list->strdup_strings is set, this function can be used to hand > ownership of a malloc()ed string to list without making an extra > copy. > > (I mention this only for completeness but don't care strongly which > approach is used.) Thanks for a great suggestion; I do prefer using the existing APIs where possible, and it seems like string_list_append_nodup() was designed exactly for this case. Thanks, Taylor
On Sat, Sep 11 2021, Taylor Blau wrote: > On Sat, Sep 11, 2021 at 10:08:44PM -0400, Eric Sunshine wrote: >> On Sat, Sep 11, 2021 at 12:25 PM Taylor Blau <me@ttaylorr.com> wrote: >> > On Sat, Sep 11, 2021 at 12:05:05PM +0200, Ævar Arnfjörð Bjarmason wrote: >> > > Before calling string_list_clear(). I.e. we didn't strdup(), but during >> > > free() we pretend that we did, because we did, just not in >> > > string_list_append(). >> > >> > Good catch. It's kind of gross, but the result is: >> > >> > static void read_packs_from_stdin(struct string_list *to) >> > { >> > - struct strbuf buf = STRBUF_INIT; >> > + struct strbuf buf = STRBUF_INIT_NODUP; >> > while (strbuf_getline(&buf, stdin) != EOF) { >> > string_list_append(to, strbuf_detach(&buf, NULL)); >> > } >> > @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) >> > ret = write_midx_file_only(opts.object_dir, &packs, >> > opts.preferred_pack, opts.flags); >> > >> > + /* >> > + * pretend strings are duplicated to free the memory allocated >> > + * by read_packs_from_stdin() >> > + */ >> > + packs.strdup_strings = 1; >> > string_list_clear(&packs, 0); >> >> An alternative is to do this: >> >> struct strbuf buf = STRBUF_INIT; >> ... >> while (...) { >> string_list_append_nodup(to, strbuf_detach(&buf, NULL)); >> ... >> } >> ... >> string_list_clear(&packs, 0); >> >> That is, use string_list_append_nodup(), which is documented as >> existing precisely for this sort of use-case: >> >> Like string_list_append(), except string is never copied. When >> list->strdup_strings is set, this function can be used to hand >> ownership of a malloc()ed string to list without making an extra >> copy. >> >> (I mention this only for completeness but don't care strongly which >> approach is used.) > > Thanks for a great suggestion; I do prefer using the existing APIs where > possible, and it seems like string_list_append_nodup() was designed > exactly for this case. I think Eric's suggestion is better in this case, maybe all cases. For what it's worth the alternate WIP approach I was hinting at is some version of this: https://lore.kernel.org/git/87bl6kq631.fsf@evledraar.gmail.com/ I might change my mind, but I think ultimately running with move of the string_list_append_nodup() approach Eric points out is probably the right thing to do. I.e. it's the difference between declaring that a string list is "dup'd" upfront, and whenever we insert into it we make sure that's the case one way or another. Then when we clear we don't need to pass any options. It does mean that instead of extra clear functions we need to have *_nodup() versions of all the utility functions for this to work, e.g. in trying to convert this now there's no string_list_insert_nodup(), which would be needed. Or maybe the whole approch of the string_list API is just a dead-end in API design, i.e. it shouldn't have any "dup" mode, just nodup, if you need something dup'd you xstrdup()-it.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > Or maybe the whole approch of the string_list API is just a dead-end in > API design, i.e. it shouldn't have any "dup" mode, just nodup, if you > need something dup'd you xstrdup()-it. ;-)
On Sun, Sep 12 2021, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >> Or maybe the whole approch of the string_list API is just a dead-end in >> API design, i.e. it shouldn't have any "dup" mode, just nodup, if you >> need something dup'd you xstrdup()-it. > > ;-) I realize in practice that's a bit too much, i.e. it's very useful to be able to use a string_list to account for a bunch of already-allocated memory without strduping everything. It's just unfortunate that you can't look at any code that works with the API and understand whether it does an implicit xstrdup() or not without tracking down its current state / where it got allocated at.
On Sat, Sep 11, 2021 at 12:25:05PM -0400, Taylor Blau wrote: > > Before calling string_list_clear(). I.e. we didn't strdup(), but during > > free() we pretend that we did, because we did, just not in > > string_list_append(). > > Good catch. It's kind of gross, but the result is: > > --- 8< --- > > diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c > index 77488b6b7b..e6cab975e3 100644 > --- a/builtin/multi-pack-index.c > +++ b/builtin/multi-pack-index.c > @@ -64,7 +64,7 @@ static struct option *add_common_options(struct option *prev) > > static void read_packs_from_stdin(struct string_list *to) > { > - struct strbuf buf = STRBUF_INIT; > + struct strbuf buf = STRBUF_INIT_NODUP; Did you mean to use STRING_LIST_INIT_NODUP on the string-list declaration? > while (strbuf_getline(&buf, stdin) != EOF) { > string_list_append(to, strbuf_detach(&buf, NULL)); > } > @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) > ret = write_midx_file_only(opts.object_dir, &packs, > opts.preferred_pack, opts.flags); > > + /* > + * pretend strings are duplicated to free the memory allocated > + * by read_packs_from_stdin() > + */ > + packs.strdup_strings = 1; > string_list_clear(&packs, 0); > > return ret; I think the root of the problem here is the non-idiomatic use of strbuf_getline(). The usual thing (and in fact the thing done by the quite-similar code in read_packs_list_from_stdin() in pack-objects.c) is not to detach, because strbuf_getline() will reset the buffer each time. I.e.: struct string_list to = STRING_LIST_INIT_DUP; ... struct strbuf buf = STRBUF while (strbuf_getline(&buf, stdin) != EOF) string_list_append(to, &buf); strbuf_release(&buf); That avoids any clever string-list allocation games. The number of heap allocations is about the same (one strbuf and N list items, versus N strbufs and 0 list items). There's a little extra copying (from the strbuf into the list items), but the strings themselves are more efficiently allocated (strbuf may over-allocate, and we lock in that choice forever by handing over the string). Not that efficiency probably matters either way for this spot. I'd do it this way because it's simple and idiomatic for our code base. -Peff
On Tue, Sep 14, 2021 at 03:02:38PM -0400, Jeff King wrote: > On Sat, Sep 11, 2021 at 12:25:05PM -0400, Taylor Blau wrote: > > > > Before calling string_list_clear(). I.e. we didn't strdup(), but during > > > free() we pretend that we did, because we did, just not in > > > string_list_append(). > > > > Good catch. It's kind of gross, but the result is: > > > > --- 8< --- > > > > diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c > > index 77488b6b7b..e6cab975e3 100644 > > --- a/builtin/multi-pack-index.c > > +++ b/builtin/multi-pack-index.c > > @@ -64,7 +64,7 @@ static struct option *add_common_options(struct option *prev) > > > > static void read_packs_from_stdin(struct string_list *to) > > { > > - struct strbuf buf = STRBUF_INIT; > > + struct strbuf buf = STRBUF_INIT_NODUP; > > Did you mean to use STRING_LIST_INIT_NODUP on the string-list > declaration? Yeah, and I thought that I even mentioned it in [1], but I apparently sent the contents of my buffer before I saved it. So, I did notice it at the time, but failed to tell anybody about it! [1]: https://lore.kernel.org/git/YTzZPti%2FasQwZC%2FD@nand.local/ > > while (strbuf_getline(&buf, stdin) != EOF) { > > string_list_append(to, strbuf_detach(&buf, NULL)); > > } > > @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) > > ret = write_midx_file_only(opts.object_dir, &packs, > > opts.preferred_pack, opts.flags); > > > > + /* > > + * pretend strings are duplicated to free the memory allocated > > + * by read_packs_from_stdin() > > + */ > > + packs.strdup_strings = 1; > > string_list_clear(&packs, 0); > > > > return ret; > > I think the root of the problem here is the non-idiomatic use of > strbuf_getline(). The usual thing (and in fact the thing done by the > quite-similar code in read_packs_list_from_stdin() in pack-objects.c) > is not to detach, because strbuf_getline() will reset the buffer each > time. I.e.: > > struct string_list to = STRING_LIST_INIT_DUP; > ... > struct strbuf buf = STRBUF > while (strbuf_getline(&buf, stdin) != EOF) > string_list_append(to, &buf); (Assuming that you meant `s/&buf/buf.buf/`, but otherwise this makes sense). There is no need to call strbuf_reset() inside the body of the loop, since strbuf_getline() calls it via strbuf_getwholeline(). > strbuf_release(&buf); ...But we should still be careful to release the memory used by the strbuf at the end (which is safe to do, since the whole point is that we copy each buffer linewise into the string_list). > That avoids any clever string-list allocation games. The number of heap > allocations is about the same (one strbuf and N list items, versus N > strbufs and 0 list items). There's a little extra copying (from the > strbuf into the list items), but the strings themselves are more > efficiently allocated (strbuf may over-allocate, and we lock in > that choice forever by handing over the string). > > Not that efficiency probably matters either way for this spot. I'd do it > this way because it's simple and idiomatic for our code base. More than anything, I'm persuaded by that (though I was quite partial to Eric's suggestion, which I found very clever). Thanks, Taylor
On Tue, Sep 14, 2021 at 7:48 PM Taylor Blau <me@ttaylorr.com> wrote: > On Tue, Sep 14, 2021 at 03:02:38PM -0400, Jeff King wrote: > > I think the root of the problem here is the non-idiomatic use of > > strbuf_getline(). The usual thing (and in fact the thing done by the > > quite-similar code in read_packs_list_from_stdin() in pack-objects.c) > > is not to detach, because strbuf_getline() will reset the buffer each > > time. I.e.: > > > > struct string_list to = STRING_LIST_INIT_DUP; > > ... > > struct strbuf buf = STRBUF > > while (strbuf_getline(&buf, stdin) != EOF) > > string_list_append(to, &buf); > > > That avoids any clever string-list allocation games. The number of heap > > allocations is about the same (one strbuf and N list items, versus N > > strbufs and 0 list items). There's a little extra copying (from the > > strbuf into the list items), but the strings themselves are more > > efficiently allocated (strbuf may over-allocate, and we lock in > > that choice forever by handing over the string). > > > > Not that efficiency probably matters either way for this spot. I'd do it > > this way because it's simple and idiomatic for our code base. > > More than anything, I'm persuaded by that (though I was quite partial to > Eric's suggestion, which I found very clever). For what it's worth, Peff's suggestion seems best of all those presented thus far.
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index a9df3dbd32..009c989ef8 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -45,6 +45,10 @@ write:: --[no-]bitmap:: Control whether or not a multi-pack bitmap is written. + + --stdin-packs:: + Write a multi-pack index containing only the set of + line-delimited pack index basenames provided over stdin. -- verify:: diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 73c0113b48..77488b6b7b 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -47,6 +47,7 @@ static struct opts_multi_pack_index { const char *preferred_pack; unsigned long batch_size; unsigned flags; + int stdin_packs; } opts; static struct option common_opts[] = { @@ -61,6 +62,15 @@ static struct option *add_common_options(struct option *prev) return parse_options_concat(common_opts, prev); } +static void read_packs_from_stdin(struct string_list *to) +{ + struct strbuf buf = STRBUF_INIT; + while (strbuf_getline(&buf, stdin) != EOF) { + string_list_append(to, strbuf_detach(&buf, NULL)); + } + string_list_sort(to); +} + static int cmd_multi_pack_index_write(int argc, const char **argv) { struct option *options; @@ -70,6 +80,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) N_("pack for reuse when computing a multi-pack bitmap")), OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"), MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX), + OPT_BOOL(0, "stdin-packs", &opts.stdin_packs, + N_("write multi-pack index containing only given indexes")), OPT_END(), }; @@ -86,6 +98,20 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) FREE_AND_NULL(options); + if (opts.stdin_packs) { + struct string_list packs = STRING_LIST_INIT_NODUP; + int ret; + + read_packs_from_stdin(&packs); + + ret = write_midx_file_only(opts.object_dir, &packs, + opts.preferred_pack, opts.flags); + + string_list_clear(&packs, 0); + + return ret; + + } return write_midx_file(opts.object_dir, opts.preferred_pack, opts.flags); } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index bb04f0f23b..385f0a3efd 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -168,6 +168,21 @@ test_expect_success 'write midx with two packs' ' compare_results_with_midx "two packs" +test_expect_success 'write midx with --stdin-packs' ' + rm -fr $objdir/pack/multi-pack-index && + + idx="$(find $objdir/pack -name "test-2-*.idx")" && + basename "$idx" >in && + + git multi-pack-index write --stdin-packs <in && + + test-tool read-midx $objdir | grep "\.idx$" >packs && + + test_cmp packs in +' + +compare_results_with_midx "mixed mode (one pack + extra)" + test_expect_success 'write progress off for redirected stderr' ' git multi-pack-index --object-dir=$objdir write 2>err && test_line_count = 0 err
To power a new `--write-midx` mode, `git repack` will want to write a multi-pack index containing a certain set of packs in the repository. This new option will be used by `git repack` to write a MIDX which contains only the packs which will survive after the repack (that is, it will exclude any packs which are about to be deleted). This patch effectively exposes the function implemented in the previous commit via the `git multi-pack-index` builtin. An alternative approach would have been to call that function from the `git repack` builtin directly, but this introduces awkward problems around closing and reopening the object store, so the MIDX will be written out-of-process. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/git-multi-pack-index.txt | 4 ++++ builtin/multi-pack-index.c | 26 ++++++++++++++++++++++++++ t/t5319-multi-pack-index.sh | 15 +++++++++++++++ 3 files changed, 45 insertions(+)