diff mbox series

[02/30] read-cache: add index.computeHash config option

Message ID 030d76f52af654470026b0c4b1dfba2b6c996885.1667846164.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series extensions.refFormat and packed-refs v2 file format | expand

Commit Message

Derrick Stolee Nov. 7, 2022, 6:35 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option that allows disabling this hashing during the
index write. The cost is that we can no longer validate the contents for
corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.computHash=false on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt |  8 ++++++++
 read-cache.c                   | 22 +++++++++++++++++++++-
 t/t1600-index.sh               |  8 ++++++++
 3 files changed, 37 insertions(+), 1 deletion(-)

Comments

Elijah Newren Nov. 11, 2022, 11:31 p.m. UTC | #1
On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <derrickstolee@github.com>
>
> The previous change allowed skipping the hashing portion of the
> hashwrite API, using it instead as a buffered write API. Disabling the
> hashwrite can be particularly helpful when the write operation is in a
> critical path.
>
> One such critical path is the writing of the index. This operation is so
> critical that the sparse index was created specifically to reduce the
> size of the index to make these writes (and reads) faster.
>
> Following a similar approach to one used in the microsoft/git fork [1],
> add a new config option that allows disabling this hashing during the
> index write. The cost is that we can no longer validate the contents for
> corruption-at-rest using the trailing hash.
>
> [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
>
> While older Git versions will not recognize the null hash as a special
> case, the file format itself is still being met in terms of its
> structure. Using this null hash will still allow Git operations to
> function across older versions.
>
> The one exception is 'git fsck' which checks the hash of the index file.
> Here, we disable this check if the trailing hash is all zeroes. We add a
> warning to the config option that this may cause undesirable behavior
> with older Git versions.
>
> As a quick comparison, I tested 'git update-index --force-write' with
> and without index.computHash=false on a copy of the Linux kernel
> repository.
>
> Benchmark 1: with hash
>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>
> Benchmark 2: without hash
>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>
> Summary
>   'without hash' ran
>     1.78 ± 0.76 times faster than 'with hash'
>
> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.

This is impressive and interesting...but an improvement unrelated to
this series other than the fact that it builds on some of it.  Perhaps
pull this patch out?

Also, would it make sense to integrate index.computeHash with feature.manyFiles?

>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/config/index.txt |  8 ++++++++
>  read-cache.c                   | 22 +++++++++++++++++++++-
>  t/t1600-index.sh               |  8 ++++++++
>  3 files changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
> index 75f3a2d1054..709ba72f622 100644
> --- a/Documentation/config/index.txt
> +++ b/Documentation/config/index.txt
> @@ -30,3 +30,11 @@ index.version::
>         Specify the version with which new index files should be
>         initialized.  This does not affect existing repositories.
>         If `feature.manyFiles` is enabled, then the default is 4.
> +
> +index.computeHash::
> +       When enabled, compute the hash of the index file as it is written
> +       and store the hash at the end of the content. This is enabled by
> +       default.
> ++
> +If you disable `index.computHash`, then older Git clients may report that
> +your index is corrupt during `git fsck`.
> diff --git a/read-cache.c b/read-cache.c
> index 32024029274..f24d96de4d3 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>         git_hash_ctx c;
>         unsigned char hash[GIT_MAX_RAWSZ];
>         int hdr_version;
> +       int all_zeroes = 1;
> +       unsigned char *start, *end;
>
>         if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
>                 return error(_("bad signature 0x%08x"), hdr->hdr_signature);
> @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>         if (!verify_index_checksum)
>                 return 0;
>
> +       end = (unsigned char *)hdr + size;
> +       start = end - the_hash_algo->rawsz;
> +       while (start < end) {
> +               if (*start != 0) {
> +                       all_zeroes = 0;
> +                       break;
> +               }
> +               start++;
> +       }
> +
> +       if (all_zeroes)
> +               return 0;
> +
>         the_hash_algo->init_fn(&c);
>         the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
>         the_hash_algo->final_fn(hash, &c);
> -       if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
> +       if (!hasheq(hash, end - the_hash_algo->rawsz))
>                 return error(_("bad index file sha1 signature"));
>         return 0;
>  }
> @@ -2917,9 +2932,14 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>         int ieot_entries = 1;
>         struct index_entry_offset_table *ieot = NULL;
>         int nr, nr_threads;
> +       int compute_hash;
>
>         f = hashfd(tempfile->fd, tempfile->filename.buf);
>
> +       if (!git_config_get_maybe_bool("index.computehash", &compute_hash) &&
> +           !compute_hash)
> +               f->skip_hash = 1;
> +
>         for (i = removed = extended = 0; i < entries; i++) {
>                 if (cache[i]->ce_flags & CE_REMOVE)
>                         removed++;
> diff --git a/t/t1600-index.sh b/t/t1600-index.sh
> index 010989f90e6..24ab90ca047 100755
> --- a/t/t1600-index.sh
> +++ b/t/t1600-index.sh
> @@ -103,4 +103,12 @@ test_expect_success 'index version config precedence' '
>         test_index_version 0 true 2 2
>  '
>
> +test_expect_success 'index.computeHash config option' '
> +       (
> +               rm -f .git/index &&
> +               git -c index.computeHash=false add a &&
> +               git fsck
> +       )
> +'
> +
>  test_done
> --
> gitgitgadget

Pretty simple change, though.  Very nice.  :-)
Derrick Stolee Nov. 14, 2022, 4:30 p.m. UTC | #2
On 11/11/2022 6:31 PM, Elijah Newren wrote:
> On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> The previous change allowed skipping the hashing portion of the
>> hashwrite API, using it instead as a buffered write API. Disabling the
>> hashwrite can be particularly helpful when the write operation is in a
>> critical path.
>>
>> One such critical path is the writing of the index. This operation is so
>> critical that the sparse index was created specifically to reduce the
>> size of the index to make these writes (and reads) faster.
>>
>> Following a similar approach to one used in the microsoft/git fork [1],
>> add a new config option that allows disabling this hashing during the
>> index write. The cost is that we can no longer validate the contents for
>> corruption-at-rest using the trailing hash.
>>
>> [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
>>
>> While older Git versions will not recognize the null hash as a special
>> case, the file format itself is still being met in terms of its
>> structure. Using this null hash will still allow Git operations to
>> function across older versions.
>>
>> The one exception is 'git fsck' which checks the hash of the index file.
>> Here, we disable this check if the trailing hash is all zeroes. We add a
>> warning to the config option that this may cause undesirable behavior
>> with older Git versions.
>>
>> As a quick comparison, I tested 'git update-index --force-write' with
>> and without index.computHash=false on a copy of the Linux kernel
>> repository.
>>
>> Benchmark 1: with hash
>>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>>
>> Benchmark 2: without hash
>>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>>
>> Summary
>>   'without hash' ran
>>     1.78 ± 0.76 times faster than 'with hash'
>>
>> These performance benefits are substantial enough to allow users the
>> ability to opt-in to this feature, even with the potential confusion
>> with older 'git fsck' versions.
> 
> This is impressive and interesting...but an improvement unrelated to
> this series other than the fact that it builds on some of it.  Perhaps
> pull this patch out?

While patch 1 is required for the packed-refs work, this one is an easy
way to take advantage of it. I'll submit these two patches soon on their
own as the rest of the RFC is discussed.

> Also, would it make sense to integrate index.computeHash with feature.manyFiles?

It would make sense to include in feature.manyFiles and Scalar's recommended
config. I expect that it would be good to have the config available in a Git
release before updating those configs to include it. Perhaps that is too
conservative, though.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Nov. 17, 2022, 4:13 p.m. UTC | #3
On Mon, Nov 07 2022, Derrick Stolee via GitGitGadget wrote:

> Summary
>   'without hash' ran
>     1.78 ± 0.76 times faster than 'with hash'
>
> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.

The 0.76 part of that is probably just fs caches etc. screwing things
up. I tried it on a ramdisk with CFLAGS=-O3:
	
	$ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' -w 1 -r 10
	Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      13.3 ms ±   0.3 ms    [User: 7.1 ms, System: 6.1 ms]
	  Range (min … max):    12.7 ms …  13.6 ms    10 runs
	 
	Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      34.8 ms ±   0.4 ms    [User: 28.9 ms, System: 5.8 ms]
	  Range (min … max):    34.2 ms …  35.1 ms    10 runs
	 
	Summary
	  './git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran
	    2.62 ± 0.07 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write'

I also see that if I compile with OPENSSL_SHA1=Y, then:
	
	$ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' 
	Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      14.0 ms ±   1.3 ms    [User: 7.7 ms, System: 6.2 ms]
	  Range (min … max):    13.1 ms …  21.7 ms    206 runs
	 
	  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' 
	options.
	 
	Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      21.0 ms ±   1.0 ms    [User: 15.0 ms, System: 6.0 ms]
	  Range (min … max):    20.1 ms …  28.4 ms    138 runs
	 
	  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' 
	options.
	 
	Summary
	  './git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran
	    1.50 ± 0.15 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write'

Which, FWIW is something worth considering. I.e. when we introduced
sha1dc we did so with the "big hammer" of the existing hashing API,
which is all or nothing, and we pick the hash when we compile git.

But that left a lot of things slower for no good reason, e.g. when we do
this hashing of the trailers. So if we could just compile with two
implementations, and give users the choice of "use the faster hash when
you're not communicating with other git repos" we could make things
faster in some cases, without the potential format interop issues.

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> +index.computeHash::
> +	When enabled, compute the hash of the index file as it is written
> +	and store the hash at the end of the content. This is enabled by
> +	default.
> ++

If we have a boolean option it makes sense to make its name reflect the
opt-in nature. So "index.skipHash". Then just say "If enabled", and skip
the "this is enabled by default, and then later this code:

> +	int compute_hash;
> [...]
> +	if (!git_config_get_maybe_bool("index.computehash", &compute_hash) &&
> +	    !compute_hash)
> +		f->skip_hash = 1;

Can just become:

	git_config_get_maybe_bool("index.skipHash", &f->skip_hash);

I.e. git_config_get_maybe_bool() leaves the passed-in dest value alone
if it doesn't have it in the config, and you only use this
"compute_hash" as an inverted version of "skip_hash".

> +If you disable `index.computHash`, then older Git clients may report that
> +your index is corrupt during `git fsck`.
> diff --git a/read-cache.c b/read-cache.c
> index 32024029274..f24d96de4d3 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	git_hash_ctx c;
>  	unsigned char hash[GIT_MAX_RAWSZ];
>  	int hdr_version;
> +	int all_zeroes = 1;
> +	unsigned char *start, *end;
>  
>  	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
>  		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
> @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	if (!verify_index_checksum)
>  		return 0;
>  
> +	end = (unsigned char *)hdr + size;
> +	start = end - the_hash_algo->rawsz;
> +	while (start < end) {
> +		if (*start != 0) {
> +			all_zeroes = 0;
> +			break;
> +		}
> +		start++;
> +	}

Didn't you just re-invent oidread()? :)

Just to narrate my way through this. Before we called verify_hdr() we
did:

        hdr = (const struct cache_header *)mmap;
        if (verify_hdr(hdr, mmap_size) < 0)

So, we mmap()'d the index on disk, and whe "hdr" is the struct version
of this data, we then cast that back to an "unsigned char *" here,
because we're interested in just the raw bytes.

Then we "jump to the end" here, and start iterating over the rawsz at
the end, because we're just reading if we have a null_oid().

Then, right after that verify_hdr() call, the veriy next thing we'll do is:

	oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz);

So, maybe I'm missing some subtlety still, and some of this is existing
baggage in the pre-image (we used to have the sha1 in the struct, a
*long* time ago).

But isn't this equivalent?:
	
	diff --git a/read-cache.c b/read-cache.c
	index f24d96de4d3..39b5b8419f5 100644
	--- a/read-cache.c
	+++ b/read-cache.c
	@@ -1812,13 +1812,14 @@ int verify_index_checksum;
	 /* Allow fsck to force verification of the cache entry order. */
	 int verify_ce_order;
	 
	-static int verify_hdr(const struct cache_header *hdr, unsigned long size)
	+static int verify_hdr(const char *const mmap, const size_t size,
	+		      const struct cache_header **hdrp, struct object_id *oid)
	 {
	+	const struct cache_header *hdr = (const struct cache_header *)mmap;
	 	git_hash_ctx c;
	 	unsigned char hash[GIT_MAX_RAWSZ];
	 	int hdr_version;
	-	int all_zeroes = 1;
	-	unsigned char *start, *end;
	+	const unsigned char *end = (unsigned char *)mmap + size;
	 
	 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
	 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
	@@ -1826,20 +1827,12 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
	 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
	 		return error(_("bad index version %d"), hdr_version);
	 
	+	*hdrp = hdr;
	+	oidread(oid, end - the_hash_algo->rawsz);
	+
	 	if (!verify_index_checksum)
	 		return 0;
	-
	-	end = (unsigned char *)hdr + size;
	-	start = end - the_hash_algo->rawsz;
	-	while (start < end) {
	-		if (*start != 0) {
	-			all_zeroes = 0;
	-			break;
	-		}
	-		start++;
	-	}
	-
	-	if (all_zeroes)
	+	if (is_null_oid(oid))
	 		return 0;
	 
	 	the_hash_algo->init_fn(&c);
	@@ -2358,11 +2351,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
	 			mmap_os_err());
	 	close(fd);
	 
	-	hdr = (const struct cache_header *)mmap;
	-	if (verify_hdr(hdr, mmap_size) < 0)
	+	if (verify_hdr(mmap, mmap_size, &hdr, &istate->oid) < 0)
	 		goto unmap;
	-
	-	oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz);
	 	istate->version = ntohl(hdr->hdr_version);
	 	istate->cache_nr = ntohl(hdr->hdr_entries);
	 	istate->cache_alloc = alloc_nr(istate->cache_nr);

I.e. we just make the verify function be in charge of populating our
"oid", which we can do that early, as we'd error out later in the
function if it doesn't match.

We could avoid the "hdrp" there, but if we're doing the cast it's
probably good for readability to just do it once.

> +test_expect_success 'index.computeHash config option' '
> +	(
> +		rm -f .git/index &&
> +		git -c index.computeHash=false add a &&
> +		git fsck
> +	)
> +'

You can skip the subshell here, but for a non-RFC let's leave the test
in a nice state for the next test someone adds, so maybe:

	test_when_finished "rm -rf repo" &&
	git clone . repo &&
	[...]

Lastly, on this again:

> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.

Isn't an unstated major caveat here that it's not "an older verison",
but if you on *your version* set the config to "true" your index doesn't
have a hash, so it's persisted until you wipe the index?
diff mbox series

Patch

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..709ba72f622 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,11 @@  index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.computeHash::
+	When enabled, compute the hash of the index file as it is written
+	and store the hash at the end of the content. This is enabled by
+	default.
++
+If you disable `index.computHash`, then older Git clients may report that
+your index is corrupt during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 32024029274..f24d96de4d3 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@  static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	int all_zeroes = 1;
+	unsigned char *start, *end;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,23 @@  static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	while (start < end) {
+		if (*start != 0) {
+			all_zeroes = 0;
+			break;
+		}
+		start++;
+	}
+
+	if (all_zeroes)
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, end - the_hash_algo->rawsz))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2917,9 +2932,14 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	int ieot_entries = 1;
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
+	int compute_hash;
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	if (!git_config_get_maybe_bool("index.computehash", &compute_hash) &&
+	    !compute_hash)
+		f->skip_hash = 1;
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..24ab90ca047 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -103,4 +103,12 @@  test_expect_success 'index version config precedence' '
 	test_index_version 0 true 2 2
 '
 
+test_expect_success 'index.computeHash config option' '
+	(
+		rm -f .git/index &&
+		git -c index.computeHash=false add a &&
+		git fsck
+	)
+'
+
 test_done