diff mbox series

[1/4] hashfile: allow skipping the hash function

Message ID 40ee8dbaef06f8f4265d12436455279499d7ac01.1670433958.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optionally skip hashing index on write | expand

Commit Message

Derrick Stolee Dec. 7, 2022, 5:25 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.

Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.

However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.

This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.

We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.

Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.

This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.

A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 csum-file.c | 14 +++++++++++---
 csum-file.h |  7 +++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Dec. 7, 2022, 10:13 p.m. UTC | #1
On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> However, hashing the file contents during write comes at a performance
> penalty. It's slower to hash the bytes on their way to the disk than
> without that step. This problem is made worse by the replacement of
> hardware-accelerated SHA1 computations with the software-based sha1dc
> computation.

More on that lack of HW accel later...

> This write cost is significant

Don't you mean hashing cost, or do we also do additional writes if we do
the hashing?

> , and the checksum capability is likely
> not worth that cost for such a short-lived file. The index is rewritten
> frequently and the only time the checksum is checked is during 'git
> fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
> computation.

I didn't know that, and had assumed that we at least checked it on the
full read (and I found this bit of the commit message after writing the
last paragraphs here at the end, so maybe skipping this is fine...).

> [...]
> @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
>  	int fd;
>  
>  	hashflush(f);
> -	the_hash_algo->final_fn(f->buffer, &f->ctx);
> +
> +	if (f->skip_hash)
> +		memset(f->buffer, 0, the_hash_algo->rawsz);

Here you're hardcoding a new version of null_oid(), but we can use it
instead. Perhaps:
	
	diff --git a/csum-file.c b/csum-file.c
	index 3243473c3d7..b54c4f66cbb 100644
	--- a/csum-file.c
	+++ b/csum-file.c
	@@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
	 		      enum fsync_component component, unsigned int flags)
	 {
	 	int fd;
	+	const struct object_id *const noid = null_oid();
	 
	 	hashflush(f);
	 
	 	if (f->skip_hash)
	-		memset(f->buffer, 0, the_hash_algo->rawsz);
	+		memcpy(f->buffer, noid, sizeof(*noid));
	 	else
	 		the_hash_algo->final_fn(f->buffer, &f->ctx);
	 
> @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
>  	f->tp = tp;
>  	f->name = name;
>  	f->do_crc = 0;
> +	f->skip_hash = 0;
>  	the_hash_algo->init_fn(&f->ctx);
>  
>  	f->buffer_len = buffer_len;

I think I pointed out in the RFC that we'd be much faster with
non-sha1collisiondetection, and that maybe this would get us partway to
the performance you desired (or maybe we'd think that was a more
acceptable trade-off, as it didn't make the format
backwards-incompatible).

But just from seeing "do_crc" here in the context, did you benchmark it
against that? How does it perform?

There's no place to put that in the index, but we *could* encode it in
the hash, just with a lot of leading zeros.

Maybe that would give us some/most of the performance benefits, with the
benefit of a checksum?

Or maybe not, but I think it's worth exploring & supporting a different
& faster SHA-1 implementation before making (even opt-in) backwards
incompatible format changes for performance reasons, and if even that's
too slow maybe crc32 would be sufficient (but not compatible), but still
safer?
Jeff King Dec. 8, 2022, 7:32 a.m. UTC | #2
On Wed, Dec 07, 2022 at 11:13:15PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > +	if (f->skip_hash)
> > +		memset(f->buffer, 0, the_hash_algo->rawsz);
> 
> Here you're hardcoding a new version of null_oid(), but we can use it
> instead. Perhaps:
> 	
> 	diff --git a/csum-file.c b/csum-file.c
> 	index 3243473c3d7..b54c4f66cbb 100644
> 	--- a/csum-file.c
> 	+++ b/csum-file.c
> 	@@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
> 	 		      enum fsync_component component, unsigned int flags)
> 	 {
> 	 	int fd;
> 	+	const struct object_id *const noid = null_oid();
> 	 
> 	 	hashflush(f);
> 	 
> 	 	if (f->skip_hash)
> 	-		memset(f->buffer, 0, the_hash_algo->rawsz);
> 	+		memcpy(f->buffer, noid, sizeof(*noid));
> 	 	else
> 	 		the_hash_algo->final_fn(f->buffer, &f->ctx);

I don't think that's quite right; the object_id struct has other stuff
in it.  You'd probably want:

  hashcpy(f->buffer, null_oid());

But I think we already have a helper to just do it directly:

  hashclr(f->buffer);

-Peff
diff mbox series

Patch

diff --git a/csum-file.c b/csum-file.c
index 59ef3398ca2..3243473c3d7 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -45,7 +45,8 @@  void hashflush(struct hashfile *f)
 	unsigned offset = f->offset;
 
 	if (offset) {
-		the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
+		if (!f->skip_hash)
+			the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -64,7 +65,12 @@  int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	int fd;
 
 	hashflush(f);
-	the_hash_algo->final_fn(f->buffer, &f->ctx);
+
+	if (f->skip_hash)
+		memset(f->buffer, 0, the_hash_algo->rawsz);
+	else
+		the_hash_algo->final_fn(f->buffer, &f->ctx);
+
 	if (result)
 		hashcpy(result, f->buffer);
 	if (flags & CSUM_HASH_IN_STREAM)
@@ -108,7 +114,8 @@  void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * the hashfile's buffer. In this block,
 			 * f->offset is necessarily zero.
 			 */
-			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			if (!f->skip_hash)
+				the_hash_algo->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -153,6 +160,7 @@  static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->tp = tp;
 	f->name = name;
 	f->do_crc = 0;
+	f->skip_hash = 0;
 	the_hash_algo->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
diff --git a/csum-file.h b/csum-file.h
index 0d29f528fbc..29468067f81 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,13 @@  struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+
+	/**
+	 * If set to 1, skip_hash indicates that we should
+	 * not actually compute the hash for this hashfile and
+	 * instead only use it as a buffered write.
+	 */
+	unsigned int skip_hash;
 };
 
 /* Checkpoint */