Message ID | pull.914.git.1616608219602.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | csum-file: flush less often | expand |
On 3/24/2021 1:50 PM, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> Let me walk this back a bit. > Next, I inspected the buffering code itself, and found an important > difference. Specifically, every call to hashwrite() was causing a flush > of the filestream, even if it was a very small write. With many callers > using helpers like hashwrite_be32() to write integers in network-byte > order, this was leading to many more file flushes than necessary. This is incorrect. I misinterpreted the logic inside the loop, and I later confirmed using trace2 that the number of flushes is the same between versions. So, what happened with my performance tests? > As for performance, I focused on known commands that spend a significant > amount of time writing through the hashfile API, especially if using > small buffers as in hashwrite_be32(). 'git multi-pack-index write' was > an excellent example (deleting the multi-pack-index file between runs) > and demonstrated this performance change in the Linux kernal repo: ... > Summary > 'new' ran > 1.03 ± 0.07 times faster than 'old' > > Similarly, the same command on the Git repository gave these numbers: ... > Summary > 'new' ran > 1.05 ± 0.04 times faster than 'old' > > Finally, to demonstrate that performance holds when frequently using > large buffers, the numbers below are for 'git pack-objects' packing all > objects in the Git repository between v2.30.0 and v2.31.1: ... > Summary > 'new' ran > 1.03 ± 0.06 times faster than 'old' > > With these consistent improvements of 3-5%, ... These numbers seems consistent, across repos and test commands. They seem to be the inverse of the slowdown I was seeing in the index refactor. These caused me to use confirmation bias to assume I had done something clever. I was using hyperfine to run these numbers, with the hope that it provides a consistent scenario worthy of testing. I used this command, roughly (in a script): hyperfine \ -n "old" "$1 && $OLD_GIT $2 <input" \ -n "new" "$1 && $NEW_GIT $2 <input" \ --warmup=3 \ --min-runs=20 where I would pass some preparatory step as "$1" and the Git commands to run as "$2", and have an input file (necessary for the pack-objects command). The first thing I did when confronted with the flush problem was swap the order of the "old" and "new" lines, and that caused the performance difference to go away, hinting that the number of warmups needed to increase. Changing to "--warmup=20" and "--min-runs=50", the change in timing went away entirely. I did the same with my draft changes to the index write code, and that caused the 1-2% performance drop go away, too. So, this whole adventure was based on a faulty performance test. But...is there something we could still do here? My confusion about flushing is mostly due to my error, but upon reflection the loop is doing a lot of different things, but most of the time we know which behavior we need at the start, in the middle, and at the end: 1. Fill the existing buffer with the beginning of 'buf'. If the hashfile's buffer is full, then flush. 2. Flush sizeof(f->buffer) chunks directly out of 'buf' as long as possible. 3. Copy the remaining byes out of 'buf' into the hashfile's buffer. Here is a rewrite that more explicitly follows this flow: void hashwrite(struct hashfile *f, const void *buf, unsigned int count) { const int full_buffer = sizeof(f->buffer); unsigned left = full_buffer - f->offset; unsigned nr = count > left ? left : count; /* * Initially fill the buffer in a batch until it * is full, then flush. */ if (f->do_crc) f->crc32 = crc32(f->crc32, buf, nr); memcpy(f->buffer + f->offset, buf, nr); f->offset += nr; count -= nr; buf = (char *) buf + nr; if (left == nr) hashflush(f); /* * After filling the hashfile's buffer and flushing, take * batches of full_buffer bytes directly from the input * buffer. */ while (count >= full_buffer) { if (f->do_crc) f->crc32 = crc32(f->crc32, buf, full_buffer); the_hash_algo->update_fn(&f->ctx, buf, full_buffer); flush(f, buf, full_buffer); count -= full_buffer; buf = (char *) buf + full_buffer; } /* * Capture any remaining bytes at the end of the input buffer * into the hashfile's buffer. We do not need to flush because * count is strictly less than full_buffer here. */ if (count) { if (f->do_crc) f->crc32 = crc32(f->crc32, buf, count); memcpy(f->buffer + f->offset, buf, count); f->offset = count; } if (f->base) hashwrite(f->base, buf, count); } With this implementation (and the more robust performance test), the performance for pack-objects and index-pack remains constant, but there is a slight improvement for 'git multi-pack-index write', which is mostly translating data from the pack-indexes into a multi-pack- index: Using the Git repository: Benchmark #1: old Time (mean ± σ): 270.4 ms ± 6.9 ms [User: 184.6 ms, System: 38.6 ms] Range (min … max): 258.6 ms … 283.2 ms 50 runs Benchmark #2: new Time (mean ± σ): 265.3 ms ± 6.0 ms [User: 180.9 ms, System: 37.8 ms] Range (min … max): 257.4 ms … 282.0 ms 50 runs Summary 'new' ran 1.02 ± 0.03 times faster than 'old' Using the Linux kernel repository: Benchmark #1: old Time (mean ± σ): 2.321 s ± 0.011 s [User: 1.538 s, System: 0.335 s] Range (min … max): 2.301 s … 2.353 s 50 runs Benchmark #2: new Time (mean ± σ): 2.290 s ± 0.011 s [User: 1.513 s, System: 0.329 s] Range (min … max): 2.273 s … 2.318 s 50 runs Summary 'new' ran 1.01 ± 0.01 times faster than 'old' Again, variance might be at play here, but after running this test multiple times, I was never able to see less than 1% reported here. So, I'm of two minds here: 1. This is embarassing. I wasted everyone's time for nothing. I can retract this patch. 2. This is embarassing. I overstated the problem here. But we might be able to eke out a tiny performance boost here. I'm open to either. I think we should default to dropping this patch unless someone thinks the rewrite above is a better organization of the logic. (I can then send a v2 including that version and an updated commit message.) Thanks, -Stolee P.S. Special thanks to Peff who pointed out my error in private.
Derrick Stolee <stolee@gmail.com> writes: > But...is there something we could still do here? > > My confusion about flushing is mostly due to my error, but upon > reflection the loop is doing a lot of different things, but most of > the time we know which behavior we need at the start, in the middle, > and at the end: > > 1. Fill the existing buffer with the beginning of 'buf'. If the > hashfile's buffer is full, then flush. "But do not do this if f->buffer is empty, and we are writing out more than sizeof(f->buffer)." is missing, isn't it? > 2. Flush sizeof(f->buffer) chunks directly out of 'buf' as long as > possible. > > 3. Copy the remaining bytes out of 'buf' into the hashfile's buffer. It is debatable if the existing loop, which came mostly from Nico's a8032d12 (sha1write: don't copy full sized buffers, 2008-09-02), is too clever; I personally find it concise and readable enough, but my reading is tainted. If you use a couple of helpers to reduce the repeated "crc and hash" pattern in your variant, it may become easier to follow than the original, but I dunno. > Here is a rewrite that more explicitly follows this flow: > > void hashwrite(struct hashfile *f, const void *buf, unsigned int count) > { > const int full_buffer = sizeof(f->buffer); > unsigned left = full_buffer - f->offset; > unsigned nr = count > left ? left : count; > > /* > * Initially fill the buffer in a batch until it > * is full, then flush. > */ > if (f->do_crc) > f->crc32 = crc32(f->crc32, buf, nr); > > memcpy(f->buffer + f->offset, buf, nr); Here, if the f->buffer was empty, we end up memcpy a full bufferful unconditionally. Nico's original cleverly takes advantage of the fact that 'nr' would be the full buffer size only when the f->buffer was empty upon entry to the function and we have more byte than the size of the buffer to copy out directly from 'buf'. > f->offset += nr; > count -= nr; > buf = (char *) buf + nr; > > if (left == nr) > hashflush(f); > > /* > * After filling the hashfile's buffer and flushing, take > * batches of full_buffer bytes directly from the input > * buffer. > */ > while (count >= full_buffer) { > if (f->do_crc) > f->crc32 = crc32(f->crc32, buf, full_buffer); > > the_hash_algo->update_fn(&f->ctx, buf, full_buffer); > flush(f, buf, full_buffer); > > count -= full_buffer; > buf = (char *) buf + full_buffer; > } > > /* > * Capture any remaining bytes at the end of the input buffer > * into the hashfile's buffer. We do not need to flush because > * count is strictly less than full_buffer here. > */ > if (count) { > if (f->do_crc) > f->crc32 = crc32(f->crc32, buf, count); > > memcpy(f->buffer + f->offset, buf, count); > f->offset = count; > } > > if (f->base) > hashwrite(f->base, buf, count); > } > ... > So, I'm of two minds here: > > 1. This is embarassing. I wasted everyone's time for nothing. I can retract > this patch. > > 2. This is embarassing. I overstated the problem here. But we might be able > to eke out a tiny performance boost here. > > I'm open to either. I think we should default to dropping this patch unless > someone thinks the rewrite above is a better organization of the logic. (I > can then send a v2 including that version and an updated commit message.) 3. The current code around "if (nr == sizeof(f->buffer))" might be a bit too clever for readers who try to understand what is going on, and the whole "while" loop may deserve a comment based on what you wrote before your replacement implementation.
Junio C Hamano <gitster@pobox.com> writes: >> So, I'm of two minds here: >> >> 1. This is embarassing. I wasted everyone's time for nothing. I can retract >> this patch. >> >> 2. This is embarassing. I overstated the problem here. But we might be able >> to eke out a tiny performance boost here. >> >> I'm open to either. I think we should default to dropping this patch unless >> someone thinks the rewrite above is a better organization of the logic. (I >> can then send a v2 including that version and an updated commit message.) > > 3. The current code around "if (nr == sizeof(f->buffer))" might be a > bit too clever for readers who try to understand what is going > on, and the whole "while" loop may deserve a comment based on > what you wrote before your replacement implementation. Having said all that, comparing the original and the version updated with your "flush less often" patch, I find the latter quite easier to read, so as long as the update does not give us 1% slowdown, it may be worth adopting for the readability improvement alone. Of course, if we were to go that route, the sales pitch in the log message needs to be updated. Thanks.
On Thu, Mar 25, 2021 at 11:52:29AM -0700, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > >> So, I'm of two minds here: > >> > >> 1. This is embarassing. I wasted everyone's time for nothing. I can retract > >> this patch. > >> > >> 2. This is embarassing. I overstated the problem here. But we might be able > >> to eke out a tiny performance boost here. > >> > >> I'm open to either. I think we should default to dropping this patch unless > >> someone thinks the rewrite above is a better organization of the logic. (I > >> can then send a v2 including that version and an updated commit message.) > > > > 3. The current code around "if (nr == sizeof(f->buffer))" might be a > > bit too clever for readers who try to understand what is going > > on, and the whole "while" loop may deserve a comment based on > > what you wrote before your replacement implementation. Yes, my first thought on reading Stolee's post-image was: wait, how do we know when data needed flushed from the buffer? But that is not new in his patch. It is confusing before and after. :) > Having said all that, comparing the original and the version updated > with your "flush less often" patch, I find the latter quite easier > to read, so as long as the update does not give us 1% slowdown, it > may be worth adopting for the readability improvement alone. > > Of course, if we were to go that route, the sales pitch in the log > message needs to be updated. Yeah, I am OK with either version, as long as it is justified correctly in the commit message. IMHO the big difference is that the original is using local data/offset variables in order to provide a layer of indirection when we get to the hash+flush code. And Stolee's patch is calling the same code in the two places instead. It's quite possible that gives the compiler slightly more opportunity to micro-optimize (which doesn't matter if you are feeding big blocks, but may if you are feeding 4 bytes at a time as in the midx code; though in that case it is entirely possible that the caller allocating a single array, writing it, and then feeding it to hashwrite() would be faster still, though a little more cumbersome). -Peff
diff --git a/csum-file.c b/csum-file.c index 0f35fa5ee47c..39644af590a5 100644 --- a/csum-file.c +++ b/csum-file.c @@ -89,32 +89,26 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl void hashwrite(struct hashfile *f, const void *buf, unsigned int count) { while (count) { - unsigned offset = f->offset; - unsigned left = sizeof(f->buffer) - offset; + unsigned left = sizeof(f->buffer) - f->offset; unsigned nr = count > left ? left : count; - const void *data; if (f->do_crc) f->crc32 = crc32(f->crc32, buf, nr); if (nr == sizeof(f->buffer)) { /* process full buffer directly without copy */ - data = buf; + the_hash_algo->update_fn(&f->ctx, buf, nr); + flush(f, buf, nr); } else { - memcpy(f->buffer + offset, buf, nr); - data = f->buffer; + memcpy(f->buffer + f->offset, buf, nr); + f->offset += nr; + left -= nr; + if (!left) + hashflush(f); } count -= nr; - offset += nr; buf = (char *) buf + nr; - left -= nr; - if (!left) { - the_hash_algo->update_fn(&f->ctx, data, offset); - flush(f, data, offset); - offset = 0; - } - f->offset = offset; } }