diff mbox series

[1/3] pack-objects: fix handling of multiple --filter options

Message ID 53bffbf4-8308-0dd7-bca5-7c85b8334e05@web.de (mailing list archive)
State Superseded
Headers show
Series pack-objects: fix and simplify --filter handling | expand

Commit Message

René Scharfe Nov. 12, 2022, 10:44 a.m. UTC
Since 5cb28270a1 (pack-objects: lazily set up "struct rev_info", don't
leak, 2022-03-28) --filter options given to git pack-objects overrule
earlier ones, letting only the leftmost win and leaking the memory
allocated for earlier ones.  Fix that by only initializing the rev_info
struct once.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 builtin/pack-objects.c                 |  3 ++-
 t/t5317-pack-objects-filter-objects.sh | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

--
2.38.1

Comments

Ævar Arnfjörð Bjarmason Nov. 12, 2022, 11:41 a.m. UTC | #1
On Sat, Nov 12 2022, René Scharfe wrote:

> Since 5cb28270a1 (pack-objects: lazily set up "struct rev_info", don't
> leak, 2022-03-28) --filter options given to git pack-objects overrule
> earlier ones, letting only the leftmost win and leaking the memory
> allocated for earlier ones.  Fix that by only initializing the rev_info
> struct once.

If I do e.g. this with SANITIZE=leak:

	echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects --revs --filter=blob:limit=1001 --filter=object:type=blob --stdout  >/dev/null

I see one leak that wasn't there, but two that are gone now. I haven't
looked into it, but I think the commit message should discuss what leaks
are fixed, which remain/are new etc.

> @@ -4158,7 +4158,8 @@ static struct list_objects_filter_options *po_filter_revs_init(void *value)
>  {
>  	struct po_filter_data *data = value;
>
> -	repo_init_revisions(the_repository, &data->revs, NULL);
> +	if (!data->have_revs)
> +		repo_init_revisions(the_repository, &data->revs, NULL);
>  	data->have_revs = 1;
>
>  	return &data->revs.filter;

FWIW as this goes away in your 2/3 I think just squashing the two with a
leak fix would be nice, if...

> diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
> index bb633c9b09..bd8983bb56 100755
> --- a/t/t5317-pack-objects-filter-objects.sh
> +++ b/t/t5317-pack-objects-filter-objects.sh
> @@ -178,6 +178,25 @@ test_expect_success 'verify blob:limit=1001' '
>  	test_cmp expected observed
>  '
>
> +test_expect_success 'verify blob:limit=1001+object:type=blob' '
> +	git -C r2 ls-files -s large.1000 |

Aside: Should do "git >tmp && test_parse... <tmp", we lose the exit code
of "ls-files" here.

> +	test_parse_ls_files_stage_oids |
> +	sort >expected &&
> +
> +	git -C r2 pack-objects --revs --stdout --filter=blob:limit=1001 \
> +		--filter=object:type=blob >filter.pack <<-EOF &&
> +	HEAD
> +	EOF
> +	git -C r2 index-pack ../filter.pack &&
> +
> +	git -C r2 verify-pack -v ../filter.pack >verify_result &&
> +	grep blob verify_result |
> +	parse_verify_pack_blob_oid |

Whereas this one's not a problem, no "git".

> +	sort >observed &&
> +
> +	test_cmp expected observed

Aside: It would be nice if we had a "test_cmp_sort", but some other day...

> +'
> +
>  test_expect_success 'verify blob:limit=10001' '
>  	git -C r2 ls-files -s large.1000 large.10000 |
>  	test_parse_ls_files_stage_oids |

...we can test it, but this test is in a file that's not part of "linux-leaks".

If that one leak I mentioned above can be fixed (or maybe it's not new?)
this could be tested if we put it in a file of its own with
TEST_PASSES_SANITIZE_LEAK=true.
Jeff King Nov. 12, 2022, 4:58 p.m. UTC | #2
On Sat, Nov 12, 2022 at 11:44:00AM +0100, René Scharfe wrote:

> Since 5cb28270a1 (pack-objects: lazily set up "struct rev_info", don't
> leak, 2022-03-28) --filter options given to git pack-objects overrule
> earlier ones, letting only the leftmost win and leaking the memory
> allocated for earlier ones.  Fix that by only initializing the rev_info
> struct once.

Yikes. I wondered how we managed to miss this in the tests. Surely we
have some examples that use multiple filters?

I think the answer is that we do in t5616, but in practice they didn't
hit this bug because of the way filter-specs are treated server-sid3.
Even if we say "clone --filter=foo --filter=bar" on the client, the
server upload-pack coalesces that into a single "combine:foo+bar"
string, and that's what its pack-objects sees.

Your test here triggers it by invoking pack-objects directly, which
makes sense. That's a little different from how pack-objects is invoked
by the rest of Git in practice, but it's definitely something that
_should_ work. It's conceivable that somebody outside of Git could be
driving pack-objects directly, but also it's just a trap waiting to
spring for future developers.

-Peff
Taylor Blau Nov. 13, 2022, 5:01 a.m. UTC | #3
On Sat, Nov 12, 2022 at 11:58:44AM -0500, Jeff King wrote:
> On Sat, Nov 12, 2022 at 11:44:00AM +0100, René Scharfe wrote:
>
> > Since 5cb28270a1 (pack-objects: lazily set up "struct rev_info", don't
> > leak, 2022-03-28) --filter options given to git pack-objects overrule
> > earlier ones, letting only the leftmost win and leaking the memory
> > allocated for earlier ones.  Fix that by only initializing the rev_info
> > struct once.
>
> Yikes. I wondered how we managed to miss this in the tests. Surely we
> have some examples that use multiple filters?
>
> I think the answer is that we do in t5616, but in practice they didn't
> hit this bug because of the way filter-specs are treated server-sid3.
> Even if we say "clone --filter=foo --filter=bar" on the client, the
> server upload-pack coalesces that into a single "combine:foo+bar"
> string, and that's what its pack-objects sees.

...Or in other words, clients aren't broken here because we always send
"combine" filters when multiple `--filter` arguments are passed to `git
clone`, `git fetch`, etc.

Is that always the case? I *think* so, but it would be nice to know if
there was a way that clients would not send the `combine` filter with >1
filter.

Thanks,
Taylor
Jeff King Nov. 13, 2022, 4:44 p.m. UTC | #4
On Sun, Nov 13, 2022 at 12:01:51AM -0500, Taylor Blau wrote:

> > Yikes. I wondered how we managed to miss this in the tests. Surely we
> > have some examples that use multiple filters?
> >
> > I think the answer is that we do in t5616, but in practice they didn't
> > hit this bug because of the way filter-specs are treated server-sid3.
> > Even if we say "clone --filter=foo --filter=bar" on the client, the
> > server upload-pack coalesces that into a single "combine:foo+bar"
> > string, and that's what its pack-objects sees.
> 
> ...Or in other words, clients aren't broken here because we always send
> "combine" filters when multiple `--filter` arguments are passed to `git
> clone`, `git fetch`, etc.

Right.

> Is that always the case? I *think* so, but it would be nice to know if
> there was a way that clients would not send the `combine` filter with >1
> filter.

I think it's always the case at least with Git's code, because the
client-side send_filter() sends from list_objects_filter_spec(), which
takes the single-string spec of the top-level combine filter. It would
have to explicitly iterate over the sub-filters to come up with anything
else, and I don't think we even have a function for that (we do when
evaluating them, of course, but not for generically iterating or
producing a multi-string set of specs).

And on the server side, even if a client did send multiple "filter"
lines, I think we'd hit list_objects_filter_die_if_populated(), so we'd
complain.

I don't know of any other code paths that could hit this (can you filter
on an outgoing push?). But in the end I'm not sure it even matters. We
should be fixing the bug anyway, and this is all just understanding how
far-reaching its effects are (and the answer seems to be: not very).

-Peff
René Scharfe Nov. 13, 2022, 5:31 p.m. UTC | #5
Am 13.11.22 um 06:01 schrieb Taylor Blau:
> On Sat, Nov 12, 2022 at 11:58:44AM -0500, Jeff King wrote:
>> On Sat, Nov 12, 2022 at 11:44:00AM +0100, René Scharfe wrote:
>>
>>> Since 5cb28270a1 (pack-objects: lazily set up "struct rev_info", don't
>>> leak, 2022-03-28) --filter options given to git pack-objects overrule
>>> earlier ones, letting only the leftmost win and leaking the memory
>>> allocated for earlier ones.  Fix that by only initializing the rev_info
>>> struct once.
>>
>> Yikes. I wondered how we managed to miss this in the tests. Surely we
>> have some examples that use multiple filters?
>>
>> I think the answer is that we do in t5616, but in practice they didn't
>> hit this bug because of the way filter-specs are treated server-sid3.
>> Even if we say "clone --filter=foo --filter=bar" on the client, the
>> server upload-pack coalesces that into a single "combine:foo+bar"
>> string, and that's what its pack-objects sees.
>
> ...Or in other words, clients aren't broken here because we always send
> "combine" filters when multiple `--filter` arguments are passed to `git
> clone`, `git fetch`, etc.
>
> Is that always the case? I *think* so, but it would be nice to know if
> there was a way that clients would not send the `combine` filter with >1
> filter.

Searching for pack-objects invocations like this should at least find
all internal cases, right?

   $ git grep -e '"pack-objects"' -e --filter --all-match --name-only
   builtin/pack-objects.c
   bundle.c
   upload-pack.c

git pack-objects handles --filter and doesn't pass it on.  bundle.c and
upload-pack.c only add at most a single --filter option to a strvec.
bundle.c uses list_objects_filter_spec() to get the filter
specification, upload-pack.c uses expand_list_objects_filter_spec().

René
René Scharfe Nov. 13, 2022, 5:31 p.m. UTC | #6
Am 12.11.22 um 12:41 schrieb Ævar Arnfjörð Bjarmason:
>
> On Sat, Nov 12 2022, René Scharfe wrote:
>
>> Since 5cb28270a1 (pack-objects: lazily set up "struct rev_info", don't
>> leak, 2022-03-28) --filter options given to git pack-objects overrule
>> earlier ones, letting only the leftmost win and leaking the memory
>> allocated for earlier ones.  Fix that by only initializing the rev_info
>> struct once.
>
> If I do e.g. this with SANITIZE=leak:
>
> 	echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects --revs --filter=blob:limit=1001 --filter=object:type=blob --stdout  >/dev/null
>
> I see one leak that wasn't there, but two that are gone now. I haven't
> looked into it, but I think the commit message should discuss what leaks
> are fixed, which remain/are new etc.

The leak is insubstantial; I mentioned it just because of the irony.  It
is caused by initializing an already initialized struct rev_info without
releasing it in between, as mentioned in the commit message.

.filter_data allocated by filter_combine__init() is not released by
filter_combine__init() in list-objects-filter.c.  Plugging that leak
allows your example command to run with SANITIZE=leak.  That is a matter
for a separate patch, though.

>
>> @@ -4158,7 +4158,8 @@ static struct list_objects_filter_options *po_filter_revs_init(void *value)
>>  {
>>  	struct po_filter_data *data = value;
>>
>> -	repo_init_revisions(the_repository, &data->revs, NULL);
>> +	if (!data->have_revs)
>> +		repo_init_revisions(the_repository, &data->revs, NULL);
>>  	data->have_revs = 1;
>>
>>  	return &data->revs.filter;
>
> FWIW as this goes away in your 2/3 I think just squashing the two with a
> leak fix would be nice, if...
>
>> diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
>> index bb633c9b09..bd8983bb56 100755
>> --- a/t/t5317-pack-objects-filter-objects.sh
>> +++ b/t/t5317-pack-objects-filter-objects.sh
>> @@ -178,6 +178,25 @@ test_expect_success 'verify blob:limit=1001' '
>>  	test_cmp expected observed
>>  '
>>
>> +test_expect_success 'verify blob:limit=1001+object:type=blob' '
>> +	git -C r2 ls-files -s large.1000 |
>
> Aside: Should do "git >tmp && test_parse... <tmp", we lose the exit code
> of "ls-files" here.

OK.  Copied that line from the surrounding tests.  They used temporary
files before fb2d0db502 (test-lib-functions: add parsing helpers for
ls-files and ls-tree, 2022-04-04).

>
>> +	test_parse_ls_files_stage_oids |
>> +	sort >expected &&
>> +
>> +	git -C r2 pack-objects --revs --stdout --filter=blob:limit=1001 \
>> +		--filter=object:type=blob >filter.pack <<-EOF &&
>> +	HEAD
>> +	EOF
>> +	git -C r2 index-pack ../filter.pack &&
>> +
>> +	git -C r2 verify-pack -v ../filter.pack >verify_result &&
>> +	grep blob verify_result |
>> +	parse_verify_pack_blob_oid |
>
> Whereas this one's not a problem, no "git".
>
>> +	sort >observed &&
>> +
>> +	test_cmp expected observed
>
> Aside: It would be nice if we had a "test_cmp_sort", but some other day...
>
>> +'
>> +
>>  test_expect_success 'verify blob:limit=10001' '
>>  	git -C r2 ls-files -s large.1000 large.10000 |
>>  	test_parse_ls_files_stage_oids |
>
> ...we can test it, but this test is in a file that's not part of "linux-leaks".
>
> If that one leak I mentioned above can be fixed (or maybe it's not new?)
> this could be tested if we put it in a file of its own with
> TEST_PASSES_SANITIZE_LEAK=true.

Plugging the leak in your example command is not enough to make t5317
leak free.

René
diff mbox series

Patch

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 573d0b20b7..c702c09dd4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4158,7 +4158,8 @@  static struct list_objects_filter_options *po_filter_revs_init(void *value)
 {
 	struct po_filter_data *data = value;

-	repo_init_revisions(the_repository, &data->revs, NULL);
+	if (!data->have_revs)
+		repo_init_revisions(the_repository, &data->revs, NULL);
 	data->have_revs = 1;

 	return &data->revs.filter;
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index bb633c9b09..bd8983bb56 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -178,6 +178,25 @@  test_expect_success 'verify blob:limit=1001' '
 	test_cmp expected observed
 '

+test_expect_success 'verify blob:limit=1001+object:type=blob' '
+	git -C r2 ls-files -s large.1000 |
+	test_parse_ls_files_stage_oids |
+	sort >expected &&
+
+	git -C r2 pack-objects --revs --stdout --filter=blob:limit=1001 \
+		--filter=object:type=blob >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r2 index-pack ../filter.pack &&
+
+	git -C r2 verify-pack -v ../filter.pack >verify_result &&
+	grep blob verify_result |
+	parse_verify_pack_blob_oid |
+	sort >observed &&
+
+	test_cmp expected observed
+'
+
 test_expect_success 'verify blob:limit=10001' '
 	git -C r2 ls-files -s large.1000 large.10000 |
 	test_parse_ls_files_stage_oids |