Message ID | d9e75b24-c351-e226-011d-5a5cc2e1c858@web.de (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Avoid spawning gzip in git archive | expand |
René Scharfe <l.s.r@web.de> writes: > -test_expect_success GZIP 'git archive --format=tar.gz' ' > +test_expect_success 'git archive --format=tar.gz' ' > git archive --format=tar.gz HEAD >j1.tar.gz && > test_cmp_bin j.tgz j1.tar.gz > ' Curiously, this breaks for me. It is understandable if we are not producing byte-for-byte identical output with internal gzip. With the following hack I can force the step pass, so it seems that the two invocations of internal gzip are not emitting identical result for the tar stream taken out of HEAD^{tree} object? diff --git c/t/t5000-tar-tree.sh w/t/t5000-tar-tree.sh index 1a68e89a55..c0a2cb92d4 100755 --- c/t/t5000-tar-tree.sh +++ w/t/t5000-tar-tree.sh @@ -340,14 +340,16 @@ test_expect_success 'only enabled filters are available remotely' ' ' test_expect_success 'git archive --format=tgz' ' - git archive --format=tgz HEAD >j.tgz + git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD >j.tgz ' test_expect_success 'git archive --format=tar.gz' ' - git archive --format=tar.gz HEAD >j1.tar.gz && + git -c tar.tar.gz.command="gzip -cn" archive --format=tar.gz HEAD >j1.tar.gz && test_cmp_bin j.tgz j1.tar.gz ' +exit + test_expect_success 'infer tgz from .tgz filename' ' git archive --output=j2.tgz HEAD && test_cmp_bin j.tgz j2.tgz
Hi Junio, On Mon, 13 Jun 2022, Junio C Hamano wrote: > René Scharfe <l.s.r@web.de> writes: > > > -test_expect_success GZIP 'git archive --format=tar.gz' ' > > +test_expect_success 'git archive --format=tar.gz' ' > > git archive --format=tar.gz HEAD >j1.tar.gz && > > test_cmp_bin j.tgz j1.tar.gz > > ' > > Curiously, this breaks for me. It is understandable if we are not > producing byte-for-byte identical output with internal gzip. Indeed, I can reproduce this, too. In particular, `j.tgz` and `j1.tar.gz` differ like this in my test run: -00000000 1f 8b 08 1a 00 2e ca 09 00 03 04 00 89 45 fc 83 |.............E..| +00000000 1f 8b 08 1a 00 35 2a 10 00 03 04 00 89 45 fc 83 |.....5*......E..| and -00000010 7d fc 00 f1 d0 ec b7 63 8c 30 cc 9b e6 db b6 6d |}......c.0.....m| +00000010 7d fc 00 54 ff ec b7 63 8c 30 cc 9b e6 db b6 6d |}..T...c.0.....m| According to https://datatracker.ietf.org/doc/html/rfc1952#page-5, the difference in the first line is the mtime. For reference, this is the version with `git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD`: 00000000 1f 8b 08 00 00 00 00 00 00 03 ec b7 63 8c 30 cc |............c.0.| In other words, `gzip` forces the `mtim` member to all zeros, which makes sense. The recorded mtimes are a bit funny, according to https://wolf-tungsten.github.io/gzip-analyzer/, they are 1975-03-17 00:36:32 and 1978-08-05 22:45:36, respectively... And the mtime actually changes all the time. What's even more funny: if I comment out the `deflateSetHeader()`, the mtime header field is left at all-zeros. This is on Ubuntu 18.04 with zlib1g 1:1.2.11.dfsg-0ubuntu2. So I dug in a bit deeper and what do you know, the `deflateHeader()` function is implemented like this (https://github.com/madler/zlib/blob/21767c654d31/deflate.c#L557-L565): int ZEXPORT deflateSetHeader (strm, head) z_streamp strm; gz_headerp head; { if (deflateStateCheck(strm) || strm->state->wrap != 2) return Z_STREAM_ERROR; strm->state->gzhead = head; return Z_OK; } Now, the caller is implemented like this: static void tgz_set_os(git_zstream *strm, int os) { #if ZLIB_VERNUM >= 0x1221 struct gz_header_s gzhead = { .os = os }; deflateSetHeader(&strm->z, &gzhead); #endif } The biggest problem is not that the return value of `deflateSetHeader()` is ignored. The biggest problem is that it passes the address of a heap variable to the `deflateSetHeader()` function, which then stores it away in another struct that lives beyond the point when we return from `tgz_set_os()`. In other words, this is the very issue I pointed out as GCC not catching: https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205272235220.349@tvgsbejvaqbjf.bet/ The solution is to move the heap variable back into a scope that matches the lifetime of the compression: -- snip -- diff --git a/archive-tar.c b/archive-tar.c index 60669eb7b9c..3d77e0f7509 100644 --- a/archive-tar.c +++ b/archive-tar.c @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data) static const char internal_gzip_command[] = "git archive gzip"; -static void tgz_set_os(git_zstream *strm, int os) -{ -#if ZLIB_VERNUM >= 0x1221 - struct gz_header_s gzhead = { .os = os }; - deflateSetHeader(&strm->z, &gzhead); -#endif -} - static int write_tar_filter_archive(const struct archiver *ar, struct archiver_args *args) { +#if ZLIB_VERNUM >= 0x1221 + struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */ +#endif struct strbuf cmd = STRBUF_INIT; struct child_process filter = CHILD_PROCESS_INIT; int r; @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar, if (!strcmp(ar->filter_command, internal_gzip_command)) { write_block = tgz_write_block; git_deflate_init_gzip(&gzstream, args->compression_level); - tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */ +#if ZLIB_VERNUM >= 0x1221 + if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK) + BUG("deflateSetHeader() called too late"); +#endif gzstream.next_out = outbuf; gzstream.avail_out = sizeof(outbuf); -- snap -- With this, the test passes for me. René, would you mind squashing this into your patch series? Thank you, Dscho
Am 14.06.22 um 13:27 schrieb Johannes Schindelin: > Hi Junio, > > On Mon, 13 Jun 2022, Junio C Hamano wrote: > >> René Scharfe <l.s.r@web.de> writes: >> >>> -test_expect_success GZIP 'git archive --format=tar.gz' ' >>> +test_expect_success 'git archive --format=tar.gz' ' >>> git archive --format=tar.gz HEAD >j1.tar.gz && >>> test_cmp_bin j.tgz j1.tar.gz >>> ' >> >> Curiously, this breaks for me. It is understandable if we are not >> producing byte-for-byte identical output with internal gzip. Makes sense in retrospect, there's no reason the output of gzip(1) and zlib would have to be the same exactly. It just happened to be so on my platform, so the tests deceptively passed for me. I think we simply have to drop those that try to compare compressed files made by different tools -- we can still check if their content can be extracted and matches. > Indeed, I can reproduce this, too. In particular, `j.tgz` and `j1.tar.gz` > differ like this in my test run: > > -00000000 1f 8b 08 1a 00 2e ca 09 00 03 04 00 89 45 fc 83 |.............E..| > +00000000 1f 8b 08 1a 00 35 2a 10 00 03 04 00 89 45 fc 83 |.....5*......E..| > > and > > -00000010 7d fc 00 f1 d0 ec b7 63 8c 30 cc 9b e6 db b6 6d |}......c.0.....m| > +00000010 7d fc 00 54 ff ec b7 63 8c 30 cc 9b e6 db b6 6d |}..T...c.0.....m| > > According to https://datatracker.ietf.org/doc/html/rfc1952#page-5, the > difference in the first line is the mtime. For reference, this is the > version with `git -c tar.tgz.command="gzip -cn" archive --format=tgz > HEAD`: > > 00000000 1f 8b 08 00 00 00 00 00 00 03 ec b7 63 8c 30 cc |............c.0.| > > In other words, `gzip` forces the `mtim` member to all zeros, which makes > sense. And that's what zlib does as well for me: $ ./git-archive --format=tgz HEAD | hexdump -C | head -1 00000000 1f 8b 08 00 00 00 00 00 00 03 ec bd 59 73 1c 49 |..........?Ys.I| > > The recorded mtimes are a bit funny, according to > https://wolf-tungsten.github.io/gzip-analyzer/, they are 1975-03-17 > 00:36:32 and 1978-08-05 22:45:36, respectively... > > And the mtime actually changes all the time. > > What's even more funny: if I comment out the `deflateSetHeader()`, the > mtime header field is left at all-zeros. This is on Ubuntu 18.04 with > zlib1g 1:1.2.11.dfsg-0ubuntu2. > > So I dug in a bit deeper and what do you know, the `deflateHeader()` > function is implemented like this > (https://github.com/madler/zlib/blob/21767c654d31/deflate.c#L557-L565): > > int ZEXPORT deflateSetHeader (strm, head) > z_streamp strm; > gz_headerp head; > { > if (deflateStateCheck(strm) || strm->state->wrap != 2) > return Z_STREAM_ERROR; > strm->state->gzhead = head; > return Z_OK; > } > > Now, the caller is implemented like this: > > static void tgz_set_os(git_zstream *strm, int os) > { > #if ZLIB_VERNUM >= 0x1221 > struct gz_header_s gzhead = { .os = os }; > deflateSetHeader(&strm->z, &gzhead); > #endif > } > > The biggest problem is not that the return value of `deflateSetHeader()` > is ignored. The biggest problem is that it passes the address of a heap > variable to the `deflateSetHeader()` function, which then stores it away > in another struct that lives beyond the point when we return from > `tgz_set_os()`. Ah, you mean the address of an automatic variable on the stack, but I get it. D'oh! > > In other words, this is the very issue I pointed out as GCC not catching: > https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205272235220.349@tvgsbejvaqbjf.bet/ > > The solution is to move the heap variable back into a scope that matches > the lifetime of the compression: > > -- snip -- > diff --git a/archive-tar.c b/archive-tar.c > index 60669eb7b9c..3d77e0f7509 100644 > --- a/archive-tar.c > +++ b/archive-tar.c > @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data) > > static const char internal_gzip_command[] = "git archive gzip"; > > -static void tgz_set_os(git_zstream *strm, int os) > -{ > -#if ZLIB_VERNUM >= 0x1221 > - struct gz_header_s gzhead = { .os = os }; > - deflateSetHeader(&strm->z, &gzhead); > -#endif > -} > - > static int write_tar_filter_archive(const struct archiver *ar, > struct archiver_args *args) > { > +#if ZLIB_VERNUM >= 0x1221 > + struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */ > +#endif > struct strbuf cmd = STRBUF_INIT; > struct child_process filter = CHILD_PROCESS_INIT; > int r; > @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar, > if (!strcmp(ar->filter_command, internal_gzip_command)) { > write_block = tgz_write_block; > git_deflate_init_gzip(&gzstream, args->compression_level); > - tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */ > +#if ZLIB_VERNUM >= 0x1221 > + if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK) > + BUG("deflateSetHeader() called too late"); > +#endif > gzstream.next_out = outbuf; > gzstream.avail_out = sizeof(outbuf); > > -- snap -- Good find, thank you! A shorter solution would be to make gzhead static. > > With this, the test passes for me. > > René, would you mind squashing this into your patch series? > > Thank you, > Dscho
Am 14.06.22 um 17:47 schrieb René Scharfe: > Am 14.06.22 um 13:27 schrieb Johannes Schindelin: >> Hi Junio, >> >> On Mon, 13 Jun 2022, Junio C Hamano wrote: >> >>> René Scharfe <l.s.r@web.de> writes: >>> >>>> -test_expect_success GZIP 'git archive --format=tar.gz' ' >>>> +test_expect_success 'git archive --format=tar.gz' ' >>>> git archive --format=tar.gz HEAD >j1.tar.gz && >>>> test_cmp_bin j.tgz j1.tar.gz >>>> ' >>> >>> Curiously, this breaks for me. It is understandable if we are not >>> producing byte-for-byte identical output with internal gzip. > > Makes sense in retrospect, there's no reason the output of gzip(1) and > zlib would have to be the same exactly. It just happened to be so on my > platform, so the tests deceptively passed for me. I think we simply > have to drop those that try to compare compressed files made by > different tools -- we can still check if their content can be extracted > and matches. I have to take that back, I was confused -- the tests are fine. There are no comparisons between gzip-generated and zlib-generated files. There's just the gzhead use-after-return error that Dscho discovered. René
Hi René, On Tue, 14 Jun 2022, René Scharfe wrote: > Am 14.06.22 um 13:27 schrieb Johannes Schindelin: > > > The solution is to move the heap variable back into a scope that matches > > the lifetime of the compression: > > > > -- snip -- > > diff --git a/archive-tar.c b/archive-tar.c > > index 60669eb7b9c..3d77e0f7509 100644 > > --- a/archive-tar.c > > +++ b/archive-tar.c > > @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data) > > > > static const char internal_gzip_command[] = "git archive gzip"; > > > > -static void tgz_set_os(git_zstream *strm, int os) > > -{ > > -#if ZLIB_VERNUM >= 0x1221 > > - struct gz_header_s gzhead = { .os = os }; > > - deflateSetHeader(&strm->z, &gzhead); > > -#endif > > -} > > - > > static int write_tar_filter_archive(const struct archiver *ar, > > struct archiver_args *args) > > { > > +#if ZLIB_VERNUM >= 0x1221 > > + struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */ > > +#endif > > struct strbuf cmd = STRBUF_INIT; > > struct child_process filter = CHILD_PROCESS_INIT; > > int r; > > @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar, > > if (!strcmp(ar->filter_command, internal_gzip_command)) { > > write_block = tgz_write_block; > > git_deflate_init_gzip(&gzstream, args->compression_level); > > - tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */ > > +#if ZLIB_VERNUM >= 0x1221 > > + if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK) > > + BUG("deflateSetHeader() called too late"); > > +#endif > > gzstream.next_out = outbuf; > > gzstream.avail_out = sizeof(outbuf); > > > > -- snap -- > > Good find, thank you! A shorter solution would be to make gzhead static. I should have said that I had considered this, but decided against it because it would introduce yet another issue: it would render the code needlessly un-multi-threadable. And that can be avoided _really_ easily. Ciao, Dscho
Am 14.06.22 um 18:29 schrieb Johannes Schindelin: > Hi René, > > On Tue, 14 Jun 2022, René Scharfe wrote: > >> Am 14.06.22 um 13:27 schrieb Johannes Schindelin: >> >>> The solution is to move the heap variable back into a scope that matches >>> the lifetime of the compression: >>> >>> -- snip -- >>> diff --git a/archive-tar.c b/archive-tar.c >>> index 60669eb7b9c..3d77e0f7509 100644 >>> --- a/archive-tar.c >>> +++ b/archive-tar.c >>> @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data) >>> >>> static const char internal_gzip_command[] = "git archive gzip"; >>> >>> -static void tgz_set_os(git_zstream *strm, int os) >>> -{ >>> -#if ZLIB_VERNUM >= 0x1221 >>> - struct gz_header_s gzhead = { .os = os }; >>> - deflateSetHeader(&strm->z, &gzhead); >>> -#endif >>> -} >>> - >>> static int write_tar_filter_archive(const struct archiver *ar, >>> struct archiver_args *args) >>> { >>> +#if ZLIB_VERNUM >= 0x1221 >>> + struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */ >>> +#endif >>> struct strbuf cmd = STRBUF_INIT; >>> struct child_process filter = CHILD_PROCESS_INIT; >>> int r; >>> @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar, >>> if (!strcmp(ar->filter_command, internal_gzip_command)) { >>> write_block = tgz_write_block; >>> git_deflate_init_gzip(&gzstream, args->compression_level); >>> - tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */ >>> +#if ZLIB_VERNUM >= 0x1221 >>> + if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK) >>> + BUG("deflateSetHeader() called too late"); >>> +#endif >>> gzstream.next_out = outbuf; >>> gzstream.avail_out = sizeof(outbuf); >>> >>> -- snap -- >> >> Good find, thank you! A shorter solution would be to make gzhead static. > > I should have said that I had considered this, but decided against it > because it would introduce yet another issue: it would render the code > needlessly un-multi-threadable. And that can be avoided _really_ easily. archive-tar.c (and archive-zip.c) use other static variables, so a static gzhead won't break or block anything in this regard. There was no interest in running it in parallel threads so far AFAIK, and it's hard for me to imagine the usefulness of creating multiple .tgz files at the same time. The doubled ZLIB_VERNUM is unsightly and I'm not sure the BUG check is useful -- I omitted error checking because there is no recurse for us if deflateSetHeader() doesn't work, and on ancient zlib versions we silently continue anyway. But that's all just minor quibbling -- I'll include your changes in the next version, they look fine overall. René
René Scharfe <l.s.r@web.de> writes: > archive-tar.c (and archive-zip.c) use other static variables, so a > static gzhead won't break or block anything in this regard. There was > no interest in running it in parallel threads so far AFAIK, and it's > hard for me to imagine the usefulness of creating multiple .tgz files at > the same time. ;-) FWIW I had exactly the same reaction. If this were a lot isolated piece of helper function, perhaps, but no, reentrancy for this helper function specifically written for this code path is not a very good argument. The code structure the (not very good) argument tried to suport, however, is a good thing to have regardless, and if the code is written in such a way from the beginning, there is no reason to reject it. If it opens the door to a unified way to deal with all the other static global variables (e.g. have a "archiver_state" structure that collects all of them, with this one included, and pass it around, or something), that would be great. Short of that, I do not care too much either way. As this topic is not even in 'next', "once written in one way, it is not worth the code churn to rewrite it in the other way, when these two ways are both reasonable" does not yet apply here, so ... > But that's all just minor quibbling -- I'll include your changes in the > next version, they look fine overall. ... that's fine, too. Thanks, both.
diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt index 5b017c2bdc..9de12896fc 100644 --- a/Documentation/git-archive.txt +++ b/Documentation/git-archive.txt @@ -148,8 +148,8 @@ tar.<format>.command:: format is given. + The "tar.gz" and "tgz" formats are defined automatically and default to -`gzip -cn`. You may override them with custom commands. An internal gzip -implementation can be used by specifying the value `git archive gzip`. +the magic value `git archive gzip`, which invokes an internal +implementation of gzip. You may override them with custom commands. tar.<format>.remote:: If true, enable `<format>` for use by remote clients via diff --git a/archive-tar.c b/archive-tar.c index bf7e321e0e..60669eb7b9 100644 --- a/archive-tar.c +++ b/archive-tar.c @@ -528,9 +528,9 @@ void init_tar_archiver(void) int i; register_archiver(&tar_archiver); - tar_filter_config("tar.tgz.command", "gzip -cn", NULL); + tar_filter_config("tar.tgz.command", internal_gzip_command, NULL); tar_filter_config("tar.tgz.remote", "true", NULL); - tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL); + tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL); tar_filter_config("tar.tar.gz.remote", "true", NULL); git_config(git_tar_config, NULL); for (i = 0; i < nr_tar_filters; i++) { diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh index 9ac0ec67fe..1a68e89a55 100755 --- a/t/t5000-tar-tree.sh +++ b/t/t5000-tar-tree.sh @@ -339,21 +339,21 @@ test_expect_success 'only enabled filters are available remotely' ' test_cmp_bin remote.bar config.bar ' -test_expect_success GZIP 'git archive --format=tgz' ' +test_expect_success 'git archive --format=tgz' ' git archive --format=tgz HEAD >j.tgz ' -test_expect_success GZIP 'git archive --format=tar.gz' ' +test_expect_success 'git archive --format=tar.gz' ' git archive --format=tar.gz HEAD >j1.tar.gz && test_cmp_bin j.tgz j1.tar.gz ' -test_expect_success GZIP 'infer tgz from .tgz filename' ' +test_expect_success 'infer tgz from .tgz filename' ' git archive --output=j2.tgz HEAD && test_cmp_bin j.tgz j2.tgz ' -test_expect_success GZIP 'infer tgz from .tar.gz filename' ' +test_expect_success 'infer tgz from .tar.gz filename' ' git archive --output=j3.tar.gz HEAD && test_cmp_bin j.tgz j3.tar.gz ' @@ -363,31 +363,31 @@ test_expect_success GZIP 'extract tgz file' ' test_cmp_bin b.tar j.tar ' -test_expect_success GZIP 'remote tar.gz is allowed by default' ' +test_expect_success 'remote tar.gz is allowed by default' ' git archive --remote=. --format=tar.gz HEAD >remote.tar.gz && test_cmp_bin j.tgz remote.tar.gz ' -test_expect_success GZIP 'remote tar.gz can be disabled' ' +test_expect_success 'remote tar.gz can be disabled' ' git config tar.tar.gz.remote false && test_must_fail git archive --remote=. --format=tar.gz HEAD \ >remote.tar.gz ' -test_expect_success 'git archive --format=tgz (internal gzip)' ' - test_config tar.tgz.command "git archive gzip" && - git archive --format=tgz HEAD >internal_gzip.tgz +test_expect_success GZIP 'git archive --format=tgz (external gzip)' ' + test_config tar.tgz.command "gzip -cn" && + git archive --format=tgz HEAD >external_gzip.tgz ' -test_expect_success 'git archive --format=tar.gz (internal gzip)' ' - test_config tar.tar.gz.command "git archive gzip" && - git archive --format=tar.gz HEAD >internal_gzip.tar.gz && - test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz +test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' ' + test_config tar.tar.gz.command "gzip -cn" && + git archive --format=tar.gz HEAD >external_gzip.tar.gz && + test_cmp_bin external_gzip.tgz external_gzip.tar.gz ' -test_expect_success GZIP 'extract tgz file (internal gzip)' ' - gzip -d -c <internal_gzip.tgz >internal_gzip.tar && - test_cmp_bin b.tar internal_gzip.tar +test_expect_success GZIP 'extract tgz file (external gzip)' ' + gzip -d -c <external_gzip.tgz >external_gzip.tar && + test_cmp_bin b.tar external_gzip.tar ' test_expect_success 'archive and :(glob)' '
Drop the dependency on gzip(1) and use our internal implementation to create tar.gz and tgz files. Signed-off-by: René Scharfe <l.s.r@web.de> --- Documentation/git-archive.txt | 4 ++-- archive-tar.c | 4 ++-- t/t5000-tar-tree.sh | 32 ++++++++++++++++---------------- 3 files changed, 20 insertions(+), 20 deletions(-) -- 2.36.1