Message ID | 40ee8dbaef06f8f4265d12436455279499d7ac01.1670433958.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optionally skip hashing index on write | expand |
On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <derrickstolee@github.com> > [...] > However, hashing the file contents during write comes at a performance > penalty. It's slower to hash the bytes on their way to the disk than > without that step. This problem is made worse by the replacement of > hardware-accelerated SHA1 computations with the software-based sha1dc > computation. More on that lack of HW accel later... > This write cost is significant Don't you mean hashing cost, or do we also do additional writes if we do the hashing? > , and the checksum capability is likely > not worth that cost for such a short-lived file. The index is rewritten > frequently and the only time the checksum is checked is during 'git > fsck'. Thus, it would be helpful to allow a user to opt-out of the hash > computation. I didn't know that, and had assumed that we at least checked it on the full read (and I found this bit of the commit message after writing the last paragraphs here at the end, so maybe skipping this is fine...). > [...] > @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, > int fd; > > hashflush(f); > - the_hash_algo->final_fn(f->buffer, &f->ctx); > + > + if (f->skip_hash) > + memset(f->buffer, 0, the_hash_algo->rawsz); Here you're hardcoding a new version of null_oid(), but we can use it instead. Perhaps: diff --git a/csum-file.c b/csum-file.c index 3243473c3d7..b54c4f66cbb 100644 --- a/csum-file.c +++ b/csum-file.c @@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, enum fsync_component component, unsigned int flags) { int fd; + const struct object_id *const noid = null_oid(); hashflush(f); if (f->skip_hash) - memset(f->buffer, 0, the_hash_algo->rawsz); + memcpy(f->buffer, noid, sizeof(*noid)); else the_hash_algo->final_fn(f->buffer, &f->ctx); > @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, > f->tp = tp; > f->name = name; > f->do_crc = 0; > + f->skip_hash = 0; > the_hash_algo->init_fn(&f->ctx); > > f->buffer_len = buffer_len; I think I pointed out in the RFC that we'd be much faster with non-sha1collisiondetection, and that maybe this would get us partway to the performance you desired (or maybe we'd think that was a more acceptable trade-off, as it didn't make the format backwards-incompatible). But just from seeing "do_crc" here in the context, did you benchmark it against that? How does it perform? There's no place to put that in the index, but we *could* encode it in the hash, just with a lot of leading zeros. Maybe that would give us some/most of the performance benefits, with the benefit of a checksum? Or maybe not, but I think it's worth exploring & supporting a different & faster SHA-1 implementation before making (even opt-in) backwards incompatible format changes for performance reasons, and if even that's too slow maybe crc32 would be sufficient (but not compatible), but still safer?
On Wed, Dec 07, 2022 at 11:13:15PM +0100, Ævar Arnfjörð Bjarmason wrote: > > + if (f->skip_hash) > > + memset(f->buffer, 0, the_hash_algo->rawsz); > > Here you're hardcoding a new version of null_oid(), but we can use it > instead. Perhaps: > > diff --git a/csum-file.c b/csum-file.c > index 3243473c3d7..b54c4f66cbb 100644 > --- a/csum-file.c > +++ b/csum-file.c > @@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, > enum fsync_component component, unsigned int flags) > { > int fd; > + const struct object_id *const noid = null_oid(); > > hashflush(f); > > if (f->skip_hash) > - memset(f->buffer, 0, the_hash_algo->rawsz); > + memcpy(f->buffer, noid, sizeof(*noid)); > else > the_hash_algo->final_fn(f->buffer, &f->ctx); I don't think that's quite right; the object_id struct has other stuff in it. You'd probably want: hashcpy(f->buffer, null_oid()); But I think we already have a helper to just do it directly: hashclr(f->buffer); -Peff
diff --git a/csum-file.c b/csum-file.c index 59ef3398ca2..3243473c3d7 100644 --- a/csum-file.c +++ b/csum-file.c @@ -45,7 +45,8 @@ void hashflush(struct hashfile *f) unsigned offset = f->offset; if (offset) { - the_hash_algo->update_fn(&f->ctx, f->buffer, offset); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, f->buffer, offset); flush(f, f->buffer, offset); f->offset = 0; } @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, int fd; hashflush(f); - the_hash_algo->final_fn(f->buffer, &f->ctx); + + if (f->skip_hash) + memset(f->buffer, 0, the_hash_algo->rawsz); + else + the_hash_algo->final_fn(f->buffer, &f->ctx); + if (result) hashcpy(result, f->buffer); if (flags & CSUM_HASH_IN_STREAM) @@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count) * the hashfile's buffer. In this block, * f->offset is necessarily zero. */ - the_hash_algo->update_fn(&f->ctx, buf, nr); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, buf, nr); flush(f, buf, nr); } else { /* @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, f->tp = tp; f->name = name; f->do_crc = 0; + f->skip_hash = 0; the_hash_algo->init_fn(&f->ctx); f->buffer_len = buffer_len; diff --git a/csum-file.h b/csum-file.h index 0d29f528fbc..29468067f81 100644 --- a/csum-file.h +++ b/csum-file.h @@ -20,6 +20,13 @@ struct hashfile { size_t buffer_len; unsigned char *buffer; unsigned char *check_buffer; + + /** + * If set to 1, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ + unsigned int skip_hash; }; /* Checkpoint */