Message ID | 030d76f52af654470026b0c4b1dfba2b6c996885.1667846164.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | extensions.refFormat and packed-refs v2 file format | expand |
On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <derrickstolee@github.com> > > The previous change allowed skipping the hashing portion of the > hashwrite API, using it instead as a buffered write API. Disabling the > hashwrite can be particularly helpful when the write operation is in a > critical path. > > One such critical path is the writing of the index. This operation is so > critical that the sparse index was created specifically to reduce the > size of the index to make these writes (and reads) faster. > > Following a similar approach to one used in the microsoft/git fork [1], > add a new config option that allows disabling this hashing during the > index write. The cost is that we can no longer validate the contents for > corruption-at-rest using the trailing hash. > > [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 > > While older Git versions will not recognize the null hash as a special > case, the file format itself is still being met in terms of its > structure. Using this null hash will still allow Git operations to > function across older versions. > > The one exception is 'git fsck' which checks the hash of the index file. > Here, we disable this check if the trailing hash is all zeroes. We add a > warning to the config option that this may cause undesirable behavior > with older Git versions. > > As a quick comparison, I tested 'git update-index --force-write' with > and without index.computHash=false on a copy of the Linux kernel > repository. > > Benchmark 1: with hash > Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] > Range (min … max): 34.3 ms … 79.1 ms 82 runs > > Benchmark 2: without hash > Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] > Range (min … max): 16.3 ms … 42.0 ms 69 runs > > Summary > 'without hash' ran > 1.78 ± 0.76 times faster than 'with hash' > > These performance benefits are substantial enough to allow users the > ability to opt-in to this feature, even with the potential confusion > with older 'git fsck' versions. This is impressive and interesting...but an improvement unrelated to this series other than the fact that it builds on some of it. Perhaps pull this patch out? Also, would it make sense to integrate index.computeHash with feature.manyFiles? > > Signed-off-by: Derrick Stolee <derrickstolee@github.com> > --- > Documentation/config/index.txt | 8 ++++++++ > read-cache.c | 22 +++++++++++++++++++++- > t/t1600-index.sh | 8 ++++++++ > 3 files changed, 37 insertions(+), 1 deletion(-) > > diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt > index 75f3a2d1054..709ba72f622 100644 > --- a/Documentation/config/index.txt > +++ b/Documentation/config/index.txt > @@ -30,3 +30,11 @@ index.version:: > Specify the version with which new index files should be > initialized. This does not affect existing repositories. > If `feature.manyFiles` is enabled, then the default is 4. > + > +index.computeHash:: > + When enabled, compute the hash of the index file as it is written > + and store the hash at the end of the content. This is enabled by > + default. > ++ > +If you disable `index.computHash`, then older Git clients may report that > +your index is corrupt during `git fsck`. > diff --git a/read-cache.c b/read-cache.c > index 32024029274..f24d96de4d3 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) > git_hash_ctx c; > unsigned char hash[GIT_MAX_RAWSZ]; > int hdr_version; > + int all_zeroes = 1; > + unsigned char *start, *end; > > if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) > return error(_("bad signature 0x%08x"), hdr->hdr_signature); > @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) > if (!verify_index_checksum) > return 0; > > + end = (unsigned char *)hdr + size; > + start = end - the_hash_algo->rawsz; > + while (start < end) { > + if (*start != 0) { > + all_zeroes = 0; > + break; > + } > + start++; > + } > + > + if (all_zeroes) > + return 0; > + > the_hash_algo->init_fn(&c); > the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); > the_hash_algo->final_fn(hash, &c); > - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) > + if (!hasheq(hash, end - the_hash_algo->rawsz)) > return error(_("bad index file sha1 signature")); > return 0; > } > @@ -2917,9 +2932,14 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, > int ieot_entries = 1; > struct index_entry_offset_table *ieot = NULL; > int nr, nr_threads; > + int compute_hash; > > f = hashfd(tempfile->fd, tempfile->filename.buf); > > + if (!git_config_get_maybe_bool("index.computehash", &compute_hash) && > + !compute_hash) > + f->skip_hash = 1; > + > for (i = removed = extended = 0; i < entries; i++) { > if (cache[i]->ce_flags & CE_REMOVE) > removed++; > diff --git a/t/t1600-index.sh b/t/t1600-index.sh > index 010989f90e6..24ab90ca047 100755 > --- a/t/t1600-index.sh > +++ b/t/t1600-index.sh > @@ -103,4 +103,12 @@ test_expect_success 'index version config precedence' ' > test_index_version 0 true 2 2 > ' > > +test_expect_success 'index.computeHash config option' ' > + ( > + rm -f .git/index && > + git -c index.computeHash=false add a && > + git fsck > + ) > +' > + > test_done > -- > gitgitgadget Pretty simple change, though. Very nice. :-)
On 11/11/2022 6:31 PM, Elijah Newren wrote: > On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> >> From: Derrick Stolee <derrickstolee@github.com> >> >> The previous change allowed skipping the hashing portion of the >> hashwrite API, using it instead as a buffered write API. Disabling the >> hashwrite can be particularly helpful when the write operation is in a >> critical path. >> >> One such critical path is the writing of the index. This operation is so >> critical that the sparse index was created specifically to reduce the >> size of the index to make these writes (and reads) faster. >> >> Following a similar approach to one used in the microsoft/git fork [1], >> add a new config option that allows disabling this hashing during the >> index write. The cost is that we can no longer validate the contents for >> corruption-at-rest using the trailing hash. >> >> [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 >> >> While older Git versions will not recognize the null hash as a special >> case, the file format itself is still being met in terms of its >> structure. Using this null hash will still allow Git operations to >> function across older versions. >> >> The one exception is 'git fsck' which checks the hash of the index file. >> Here, we disable this check if the trailing hash is all zeroes. We add a >> warning to the config option that this may cause undesirable behavior >> with older Git versions. >> >> As a quick comparison, I tested 'git update-index --force-write' with >> and without index.computHash=false on a copy of the Linux kernel >> repository. >> >> Benchmark 1: with hash >> Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] >> Range (min … max): 34.3 ms … 79.1 ms 82 runs >> >> Benchmark 2: without hash >> Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] >> Range (min … max): 16.3 ms … 42.0 ms 69 runs >> >> Summary >> 'without hash' ran >> 1.78 ± 0.76 times faster than 'with hash' >> >> These performance benefits are substantial enough to allow users the >> ability to opt-in to this feature, even with the potential confusion >> with older 'git fsck' versions. > > This is impressive and interesting...but an improvement unrelated to > this series other than the fact that it builds on some of it. Perhaps > pull this patch out? While patch 1 is required for the packed-refs work, this one is an easy way to take advantage of it. I'll submit these two patches soon on their own as the rest of the RFC is discussed. > Also, would it make sense to integrate index.computeHash with feature.manyFiles? It would make sense to include in feature.manyFiles and Scalar's recommended config. I expect that it would be good to have the config available in a Git release before updating those configs to include it. Perhaps that is too conservative, though. Thanks, -Stolee
On Mon, Nov 07 2022, Derrick Stolee via GitGitGadget wrote: > Summary > 'without hash' ran > 1.78 ± 0.76 times faster than 'with hash' > > These performance benefits are substantial enough to allow users the > ability to opt-in to this feature, even with the potential confusion > with older 'git fsck' versions. The 0.76 part of that is probably just fs caches etc. screwing things up. I tried it on a ramdisk with CFLAGS=-O3: $ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' -w 1 -r 10 Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write Time (mean ± σ): 13.3 ms ± 0.3 ms [User: 7.1 ms, System: 6.1 ms] Range (min … max): 12.7 ms … 13.6 ms 10 runs Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write Time (mean ± σ): 34.8 ms ± 0.4 ms [User: 28.9 ms, System: 5.8 ms] Range (min … max): 34.2 ms … 35.1 ms 10 runs Summary './git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran 2.62 ± 0.07 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write' I also see that if I compile with OPENSSL_SHA1=Y, then: $ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write Time (mean ± σ): 14.0 ms ± 1.3 ms [User: 7.7 ms, System: 6.2 ms] Range (min … max): 13.1 ms … 21.7 ms 206 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write Time (mean ± σ): 21.0 ms ± 1.0 ms [User: 15.0 ms, System: 6.0 ms] Range (min … max): 20.1 ms … 28.4 ms 138 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary './git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran 1.50 ± 0.15 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write' Which, FWIW is something worth considering. I.e. when we introduced sha1dc we did so with the "big hammer" of the existing hashing API, which is all or nothing, and we pick the hash when we compile git. But that left a lot of things slower for no good reason, e.g. when we do this hashing of the trailers. So if we could just compile with two implementations, and give users the choice of "use the faster hash when you're not communicating with other git repos" we could make things faster in some cases, without the potential format interop issues. > From: Derrick Stolee <derrickstolee@github.com> > [...] > +index.computeHash:: > + When enabled, compute the hash of the index file as it is written > + and store the hash at the end of the content. This is enabled by > + default. > ++ If we have a boolean option it makes sense to make its name reflect the opt-in nature. So "index.skipHash". Then just say "If enabled", and skip the "this is enabled by default, and then later this code: > + int compute_hash; > [...] > + if (!git_config_get_maybe_bool("index.computehash", &compute_hash) && > + !compute_hash) > + f->skip_hash = 1; Can just become: git_config_get_maybe_bool("index.skipHash", &f->skip_hash); I.e. git_config_get_maybe_bool() leaves the passed-in dest value alone if it doesn't have it in the config, and you only use this "compute_hash" as an inverted version of "skip_hash". > +If you disable `index.computHash`, then older Git clients may report that > +your index is corrupt during `git fsck`. > diff --git a/read-cache.c b/read-cache.c > index 32024029274..f24d96de4d3 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) > git_hash_ctx c; > unsigned char hash[GIT_MAX_RAWSZ]; > int hdr_version; > + int all_zeroes = 1; > + unsigned char *start, *end; > > if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) > return error(_("bad signature 0x%08x"), hdr->hdr_signature); > @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) > if (!verify_index_checksum) > return 0; > > + end = (unsigned char *)hdr + size; > + start = end - the_hash_algo->rawsz; > + while (start < end) { > + if (*start != 0) { > + all_zeroes = 0; > + break; > + } > + start++; > + } Didn't you just re-invent oidread()? :) Just to narrate my way through this. Before we called verify_hdr() we did: hdr = (const struct cache_header *)mmap; if (verify_hdr(hdr, mmap_size) < 0) So, we mmap()'d the index on disk, and whe "hdr" is the struct version of this data, we then cast that back to an "unsigned char *" here, because we're interested in just the raw bytes. Then we "jump to the end" here, and start iterating over the rawsz at the end, because we're just reading if we have a null_oid(). Then, right after that verify_hdr() call, the veriy next thing we'll do is: oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz); So, maybe I'm missing some subtlety still, and some of this is existing baggage in the pre-image (we used to have the sha1 in the struct, a *long* time ago). But isn't this equivalent?: diff --git a/read-cache.c b/read-cache.c index f24d96de4d3..39b5b8419f5 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1812,13 +1812,14 @@ int verify_index_checksum; /* Allow fsck to force verification of the cache entry order. */ int verify_ce_order; -static int verify_hdr(const struct cache_header *hdr, unsigned long size) +static int verify_hdr(const char *const mmap, const size_t size, + const struct cache_header **hdrp, struct object_id *oid) { + const struct cache_header *hdr = (const struct cache_header *)mmap; git_hash_ctx c; unsigned char hash[GIT_MAX_RAWSZ]; int hdr_version; - int all_zeroes = 1; - unsigned char *start, *end; + const unsigned char *end = (unsigned char *)mmap + size; if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) return error(_("bad signature 0x%08x"), hdr->hdr_signature); @@ -1826,20 +1827,12 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version) return error(_("bad index version %d"), hdr_version); + *hdrp = hdr; + oidread(oid, end - the_hash_algo->rawsz); + if (!verify_index_checksum) return 0; - - end = (unsigned char *)hdr + size; - start = end - the_hash_algo->rawsz; - while (start < end) { - if (*start != 0) { - all_zeroes = 0; - break; - } - start++; - } - - if (all_zeroes) + if (is_null_oid(oid)) return 0; the_hash_algo->init_fn(&c); @@ -2358,11 +2351,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) mmap_os_err()); close(fd); - hdr = (const struct cache_header *)mmap; - if (verify_hdr(hdr, mmap_size) < 0) + if (verify_hdr(mmap, mmap_size, &hdr, &istate->oid) < 0) goto unmap; - - oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz); istate->version = ntohl(hdr->hdr_version); istate->cache_nr = ntohl(hdr->hdr_entries); istate->cache_alloc = alloc_nr(istate->cache_nr); I.e. we just make the verify function be in charge of populating our "oid", which we can do that early, as we'd error out later in the function if it doesn't match. We could avoid the "hdrp" there, but if we're doing the cast it's probably good for readability to just do it once. > +test_expect_success 'index.computeHash config option' ' > + ( > + rm -f .git/index && > + git -c index.computeHash=false add a && > + git fsck > + ) > +' You can skip the subshell here, but for a non-RFC let's leave the test in a nice state for the next test someone adds, so maybe: test_when_finished "rm -rf repo" && git clone . repo && [...] Lastly, on this again: > These performance benefits are substantial enough to allow users the > ability to opt-in to this feature, even with the potential confusion > with older 'git fsck' versions. Isn't an unstated major caveat here that it's not "an older verison", but if you on *your version* set the config to "true" your index doesn't have a hash, so it's persisted until you wipe the index?
diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt index 75f3a2d1054..709ba72f622 100644 --- a/Documentation/config/index.txt +++ b/Documentation/config/index.txt @@ -30,3 +30,11 @@ index.version:: Specify the version with which new index files should be initialized. This does not affect existing repositories. If `feature.manyFiles` is enabled, then the default is 4. + +index.computeHash:: + When enabled, compute the hash of the index file as it is written + and store the hash at the end of the content. This is enabled by + default. ++ +If you disable `index.computHash`, then older Git clients may report that +your index is corrupt during `git fsck`. diff --git a/read-cache.c b/read-cache.c index 32024029274..f24d96de4d3 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) git_hash_ctx c; unsigned char hash[GIT_MAX_RAWSZ]; int hdr_version; + int all_zeroes = 1; + unsigned char *start, *end; if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) return error(_("bad signature 0x%08x"), hdr->hdr_signature); @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) if (!verify_index_checksum) return 0; + end = (unsigned char *)hdr + size; + start = end - the_hash_algo->rawsz; + while (start < end) { + if (*start != 0) { + all_zeroes = 0; + break; + } + start++; + } + + if (all_zeroes) + return 0; + the_hash_algo->init_fn(&c); the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); the_hash_algo->final_fn(hash, &c); - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) + if (!hasheq(hash, end - the_hash_algo->rawsz)) return error(_("bad index file sha1 signature")); return 0; } @@ -2917,9 +2932,14 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, int ieot_entries = 1; struct index_entry_offset_table *ieot = NULL; int nr, nr_threads; + int compute_hash; f = hashfd(tempfile->fd, tempfile->filename.buf); + if (!git_config_get_maybe_bool("index.computehash", &compute_hash) && + !compute_hash) + f->skip_hash = 1; + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) removed++; diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 010989f90e6..24ab90ca047 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -103,4 +103,12 @@ test_expect_success 'index version config precedence' ' test_index_version 0 true 2 2 ' +test_expect_success 'index.computeHash config option' ' + ( + rm -f .git/index && + git -c index.computeHash=false add a && + git fsck + ) +' + test_done