diff mbox series

[2/4] read-cache: add index.skipHash config option

Message ID 5fb4b5a36ac806f3ee07a614bcb93df2c430507c.1670433958.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optionally skip hashing index on write | expand

Commit Message

Derrick Stolee Dec. 7, 2022, 5:25 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option (index.skipHash) that allows disabling this
hashing during the index write. The cost is that we can no longer
validate the contents for corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
This used to be a check on every index read, but was split out to just
the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
2017-04-14).

Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.computeHash=false on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

It is critical that this test is placed before the test_index_version
tests, since those tests obliterate the .git/config file and hence lose
the setting from GIT_TEST_DEFAULT_HASH, if set.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt |  8 ++++++++
 read-cache.c                   | 14 +++++++++++++-
 t/t1600-index.sh               |  8 ++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

Comments

Eric Sunshine Dec. 7, 2022, 6:59 p.m. UTC | #1
On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> The previous change allowed skipping the hashing portion of the
> hashwrite API, using it instead as a buffered write API. Disabling the
> hashwrite can be particularly helpful when the write operation is in a
> critical path.
>
> One such critical path is the writing of the index. This operation is so
> critical that the sparse index was created specifically to reduce the
> size of the index to make these writes (and reads) faster.
>
> Following a similar approach to one used in the microsoft/git fork [1],
> add a new config option (index.skipHash) that allows disabling this
> hashing during the index write. The cost is that we can no longer
> validate the contents for corruption-at-rest using the trailing hash.
> [...]
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
> diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
> @@ -30,3 +30,11 @@ index.version::
> +index.skipHash::
> +       When enabled, do not compute the trailing hash for the index file.
> +       Instead, write a trailing set of bytes with value zero, indicating
> +       that the computation was skipped.
> ++
> +If you enable `index.skipHash`, then older Git clients may report that
> +your index is corrupt during `git fsck`.

This documentation is rather minimal. Given this description, are
readers going to understand the purpose of the option, when they
should use it, what the impact will be, when and why they should avoid
it, etc.?

> diff --git a/t/t1600-index.sh b/t/t1600-index.sh
> @@ -65,6 +65,14 @@ test_expect_success 'out of bounds index.version issues warning' '
> +test_expect_success 'index.skipHash config option' '
> +       (
> +               rm -f .git/index &&
> +               git -c index.skipHash=true add a &&
> +               git fsck
> +       )
> +'

What is the purpose of the subshell here?
Ævar Arnfjörð Bjarmason Dec. 7, 2022, 10:25 p.m. UTC | #2
On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> diff --git a/read-cache.c b/read-cache.c
> index 46f5e497b14..fb4d6fb6387 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	git_hash_ctx c;
>  	unsigned char hash[GIT_MAX_RAWSZ];
>  	int hdr_version;
> +	unsigned char *start, *end;
> +	struct object_id oid;
>  
>  	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
>  		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
> @@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	if (!verify_index_checksum)
>  		return 0;
>  
> +	end = (unsigned char *)hdr + size;
> +	start = end - the_hash_algo->rawsz;
> +	oidread(&oid, start);
> +	if (oideq(&oid, null_oid()))
> +		return 0;

It's good to see this use the existing hashing support, as I suggested
in the RFC comments. Glad it helped.

>  	int ieot_entries = 1;
>  	struct index_entry_offset_table *ieot = NULL;
>  	int nr, nr_threads;
> +	int skip_hash;

You don't need this variable.
>  
>  	f = hashfd(tempfile->fd, tempfile->filename.buf);
>  
> +	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
> +		f->skip_hash = skip_hash;

Because this can just be:

	git_config_get_maybe_bool("index.skiphash", &f->skip_hash);

I.e. the config API guarantees that it won't touch the variable if the
key doesn't exist, so no need for the intermediate variable.

In a later commit you convert this to that very API use when moving this
to the repo-settings.

Can we maybe skip straight to that step?

> +test_expect_success 'index.skipHash config option' '
> +	(
> +		rm -f .git/index &&
> +		git -c index.skipHash=true add a &&
> +		git fsck
> +	)

Why the subshell?
Ævar Arnfjörð Bjarmason Dec. 7, 2022, 11:06 p.m. UTC | #3
On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> While older Git versions will not recognize the null hash as a special
> case, the file format itself is still being met in terms of its
> structure. Using this null hash will still allow Git operations to
> function across older versions.

That's good news, but...

> The one exception is 'git fsck' which checks the hash of the index file.
> This used to be a check on every index read, but was split out to just
> the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
> 2017-04-14).

...uh, what?

Is there an implied claim here that versions before v2.13.0 don't count
as "older versions"?

I.e. doesn't v2.12.0 hard fail the verification for all index writing?
It's only after v2.13.0 that we do it only for the fsck.

That seems like a rather significant caveat that we should be noting
prominently in the docs added in 4/4.

> As a quick comparison, I tested 'git update-index --force-write' with
> and without index.computeHash=false on a copy of the Linux kernel
> repository.

It took me a bit to see why I was failing to reproduce this, before
finding that it's because you mention index.computeHash here, but it's
index.skipHash now.
>
> Benchmark 1: with hash
>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>
> Benchmark 2: without hash
>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>
> Summary
>   'without hash' ran
>     1.78 ± 0.76 times faster than 'with hash'

I suggested in
https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/
earlier to benchmark this against not-sha1collisiondetection.

I don't think I get HW-accelerated SHA-1 on that box with OPENSSL (how
do I check...?). The results on my main development box are:
	
	$ hyperfine -L g sha1dc,openssl -L v false,true -n '{g} with index.skipHash={v}' './git.{g} -c index.skipHash={v} -C /dev/shm/linux-mem update-index --force-write' -w 5 -r 10
	Benchmark 1: sha1dc with index.skipHash=false
	  Time (mean ± σ):      37.0 ms ±   2.3 ms    [User: 30.8 ms, System: 6.0 ms]
	  Range (min … max):    35.1 ms …  41.4 ms    10 runs
	 
	Benchmark 2: openssl with index.skipHash=false
	  Time (mean ± σ):      21.5 ms ±   0.4 ms    [User: 15.0 ms, System: 6.3 ms]
	  Range (min … max):    20.7 ms …  22.0 ms    10 runs
	 
	Benchmark 3: sha1dc with index.skipHash=true
	  Time (mean ± σ):      13.5 ms ±   0.4 ms    [User: 7.9 ms, System: 5.4 ms]
	  Range (min … max):    13.0 ms …  14.2 ms    10 runs
	 
	Benchmark 4: openssl with index.skipHash=true
	  Time (mean ± σ):      14.2 ms ±   0.3 ms    [User: 9.5 ms, System: 4.6 ms]
	  Range (min … max):    13.6 ms …  14.6 ms    10 runs
	 
	Summary
	  'sha1dc with index.skipHash=true' ran
	    1.05 ± 0.04 times faster than 'openssl with index.skipHash=true'
	    1.60 ± 0.05 times faster than 'openssl with index.skipHash=false'
	    2.75 ± 0.19 times faster than 'sha1dc with index.skipHash=false'

So, curiously it's proportionally much slower for me with the hash
checking, and skipping it with openssl is on the order of the results
you see.
Junio C Hamano Dec. 8, 2022, 12:05 a.m. UTC | #4
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Is there an implied claim here that versions before v2.13.0 don't count
> as "older versions"?
>
> I.e. doesn't v2.12.0 hard fail the verification for all index writing?
> It's only after v2.13.0 that we do it only for the fsck.
>
> That seems like a rather significant caveat that we should be noting
> prominently in the docs added in 4/4.

True enough.

It seems that we only did security releases from 2.30.x track and
upwards for the past year or so, and anything older may not matter
anymore.  Documenting it should be sufficient.

I actually was wondering what impact it would have if we made this
change unconditionally.

Thanks.
Derrick Stolee Dec. 12, 2022, 1:59 p.m. UTC | #5
On 12/7/2022 1:59 PM, Eric Sunshine wrote:
> On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> +If you enable `index.skipHash`, then older Git clients may report that
>> +your index is corrupt during `git fsck`.
> 
> This documentation is rather minimal. Given this description, are
> readers going to understand the purpose of the option, when they
> should use it, what the impact will be, when and why they should avoid
> it, etc.?

I will expand this with explicit version numbers for older Git versions.

>> diff --git a/t/t1600-index.sh b/t/t1600-index.sh
>> @@ -65,6 +65,14 @@ test_expect_success 'out of bounds index.version issues warning' '
>> +test_expect_success 'index.skipHash config option' '
>> +       (
>> +               rm -f .git/index &&
>> +               git -c index.skipHash=true add a &&
>> +               git fsck
>> +       )
>> +'
> 
> What is the purpose of the subshell here?

I was matching the style of the nearby tests, but they are all
modifying GIT_INDEX_VERSION, which isn't necessary here.

Thanks,
-Stolee
Derrick Stolee Dec. 12, 2022, 2:05 p.m. UTC | #6
On 12/7/2022 6:06 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>> [...]
>> While older Git versions will not recognize the null hash as a special
>> case, the file format itself is still being met in terms of its
>> structure. Using this null hash will still allow Git operations to
>> function across older versions.
> 
> That's good news, but...
> 
>> The one exception is 'git fsck' which checks the hash of the index file.
>> This used to be a check on every index read, but was split out to just
>> the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
>> 2017-04-14).
> 
> ...uh, what?
> 
> Is there an implied claim here that versions before v2.13.0 don't count
> as "older versions"?
> 
> I.e. doesn't v2.12.0 hard fail the verification for all index writing?
> It's only after v2.13.0 that we do it only for the fsck.
> 
> That seems like a rather significant caveat that we should be noting
> prominently in the docs added in 4/4.

I can add those details.
 
>> As a quick comparison, I tested 'git update-index --force-write' with
>> and without index.computeHash=false on a copy of the Linux kernel
>> repository.
> 
> It took me a bit to see why I was failing to reproduce this, before
> finding that it's because you mention index.computeHash here, but it's
> index.skipHash now.
>>
>> Benchmark 1: with hash
>>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>>
>> Benchmark 2: without hash
>>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>>
>> Summary
>>   'without hash' ran
>>     1.78 ± 0.76 times faster than 'with hash'
> 
> I suggested in
> https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/
> earlier to benchmark this against not-sha1collisiondetection.

Generally, I'm avoiding that benchmark because sha1dc is here to stay.

If users want to go through the trouble of compiling to use the non-dc
version, then I would expect the difference to be less noticeable, but
still significant. However, I would strongly avoid considering compiling
both into the client by default, letting certain paths use sha1dc and
others using non-dc. Certain secure environments currently only use Git
under exceptions that allow SHA1 for "non-cryptographic" reasons, but
also with the understanding that sha1dc is used as a safety measure.
Adding the non-dc version back in would put that understanding at risk.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Dec. 12, 2022, 6:01 p.m. UTC | #7
On Mon, Dec 12 2022, Derrick Stolee wrote:

> On 12/7/2022 6:06 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
>> 
>>> From: Derrick Stolee <derrickstolee@github.com>
>>> [...]
>>> While older Git versions will not recognize the null hash as a special
>>> case, the file format itself is still being met in terms of its
>>> structure. Using this null hash will still allow Git operations to
>>> function across older versions.
>> 
>> That's good news, but...
>> 
>>> The one exception is 'git fsck' which checks the hash of the index file.
>>> This used to be a check on every index read, but was split out to just
>>> the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
>>> 2017-04-14).
>> 
>> ...uh, what?
>> 
>> Is there an implied claim here that versions before v2.13.0 don't count
>> as "older versions"?
>> 
>> I.e. doesn't v2.12.0 hard fail the verification for all index writing?
>> It's only after v2.13.0 that we do it only for the fsck.
>> 
>> That seems like a rather significant caveat that we should be noting
>> prominently in the docs added in 4/4.
>
> I can add those details.
>  
>>> As a quick comparison, I tested 'git update-index --force-write' with
>>> and without index.computeHash=false on a copy of the Linux kernel
>>> repository.
>> 
>> It took me a bit to see why I was failing to reproduce this, before
>> finding that it's because you mention index.computeHash here, but it's
>> index.skipHash now.
>>>
>>> Benchmark 1: with hash
>>>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>>>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>>>
>>> Benchmark 2: without hash
>>>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>>>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>>>
>>> Summary
>>>   'without hash' ran
>>>     1.78 ± 0.76 times faster than 'with hash'
>> 
>> I suggested in
>> https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/
>> earlier to benchmark this against not-sha1collisiondetection.
>
> Generally, I'm avoiding that benchmark because sha1dc is here to stay.
>
> If users want to go through the trouble of compiling to use the non-dc
> version, then I would expect the difference to be less noticeable, but
> still significant. However, I would strongly avoid considering compiling
> both into the client by default, letting certain paths use sha1dc and
> others using non-dc. Certain secure environments currently only use Git
> under exceptions that allow SHA1 for "non-cryptographic" reasons, but
> also with the understanding that sha1dc is used as a safety measure.
> Adding the non-dc version back in would put that understanding at risk.

Doesn't using a checksum for our own index count as a "non-cryptographic
reason"? I.e. we control the .git/index file, and the context is that
we're checking if bytes we wrote to disk are corrupt since we last saw
them.

Even if hypothetically an attacker could craft files to go into the
index (knowing our envelope) in such a way as to craft a collision
between that index file and some other index file I don't see how that
would give the attacker anything. We'd still have a valid index, and
we'd probably be replacing that crafted index with a new one anyway.

I understand that some organizations have SHA-1 on some naughty list,
and using it again in non-SHA1DC contexts might trigger some audit.

So it wouldn't be something for everyone, and it's orthagonal to the
benefits of a new ref format or index format.

But if we're considering new formats, I think it's worth considering a
non-format change which doesn't get us all of the way of no
checksumming, but more than halfway there.

Maybe we'll still want the "don't do any checksumming", but maybe some
would find that enough (particularly if SHA-1 HW acceleration is
available).
Eric Sunshine Dec. 12, 2022, 6:55 p.m. UTC | #8
On Mon, Dec 12, 2022 at 8:59 AM Derrick Stolee <derrickstolee@github.com> wrote:
> On 12/7/2022 1:59 PM, Eric Sunshine wrote:
> > On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >> +index.skipHash::
> >> +       When enabled, do not compute the trailing hash for the index file.
> >> +       Instead, write a trailing set of bytes with value zero, indicating
> >> +       that the computation was skipped.
> >> ++
> >> +If you enable `index.skipHash`, then older Git clients may report that
> >> +your index is corrupt during `git fsck`.
> >
> > This documentation is rather minimal. Given this description, are
> > readers going to understand the purpose of the option, when they
> > should use it, what the impact will be, when and why they should avoid
> > it, etc.?
>
> I will expand this with explicit version numbers for older Git versions.

Okay, but that doesn't address the larger questions I asked. The
documentation, as written, gives no explanation of the purpose of this
option. Since you conceived of the option and implemented it, you
implicitly understand its use-case and repercussions which might arise
from using it, but is the typical reader going to understand all that?
Namely, is the reader going to understand:

* why this option exists
* what problem it is trying to solve
* when to use it
* when not to use it
* what the repercussions are of not computing a hash for the index
* etc.

Are the answers to those questions documented somewhere? If so, then
the documentation for this option should link to that discussion (and
vice-versa). If not, then those questions should be answered by this
documentation.
diff mbox series

Patch

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..3ea0962631d 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,11 @@  index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.skipHash::
+	When enabled, do not compute the trailing hash for the index file.
+	Instead, write a trailing set of bytes with value zero, indicating
+	that the computation was skipped.
++
+If you enable `index.skipHash`, then older Git clients may report that
+your index is corrupt during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 46f5e497b14..fb4d6fb6387 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@  static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	unsigned char *start, *end;
+	struct object_id oid;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,16 @@  static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	oidread(&oid, start);
+	if (oideq(&oid, null_oid()))
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, end - the_hash_algo->rawsz))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2915,9 +2923,13 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	int ieot_entries = 1;
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
+	int skip_hash;
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
+		f->skip_hash = skip_hash;
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..df07c587e0e 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -65,6 +65,14 @@  test_expect_success 'out of bounds index.version issues warning' '
 	)
 '
 
+test_expect_success 'index.skipHash config option' '
+	(
+		rm -f .git/index &&
+		git -c index.skipHash=true add a &&
+		git fsck
+	)
+'
+
 test_index_version () {
 	INDEX_VERSION_CONFIG=$1 &&
 	FEATURE_MANY_FILES=$2 &&