Message ID | ab64354851e2aa61e901e37814b2ae33d8f855d1.1605123653.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | pack-bitmap: bitmap generation improvements | expand |
On Wed, Nov 11, 2020 at 02:43:51PM -0500, Taylor Blau wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > The bitmap_writer_build() method calls bitmap_builder_init() to > construct a list of commits reachable from the selected commits along > with a "reverse graph". This reverse graph has edges pointing from a > commit to other commits that can reach that commit. After computing a > reachability bitmap for a commit, the values in that bitmap are then > copied to the reachability bitmaps across the edges in the reverse > graph. > > We can now relax the role of the reverse graph to greatly reduce the > number of intermediate reachability bitmaps we compute during this > reverse walk. The end result is that we walk objects the same number of > times as before when constructing the reachability bitmaps, but we also > spend much less time copying bits between bitmaps and have much lower > memory pressure in the process. > > The core idea is to select a set of "important" commits based on > interactions among the sets of commits reachable from each selected commit. This patch breaks the test 'truncated bitmap fails gracefully (ewah)' when run with GIT_TEST_DEFAULT_HASH=sha256: expecting success of 5310.66 'truncated bitmap fails gracefully (ewah)': test_config pack.writebitmaphashcache false && git repack -ad && git rev-list --use-bitmap-index --count --all >expect && bitmap=$(ls .git/objects/pack/*.bitmap) && test_when_finished "rm -f $bitmap" && test_copy_bytes 256 <$bitmap >$bitmap.tmp && mv -f $bitmap.tmp $bitmap && git rev-list --use-bitmap-index --count --all >actual 2>stderr && test_cmp expect actual && test_i18ngrep corrupt.ewah.bitmap stderr + test_config pack.writebitmaphashcache false + git repack -ad + git rev-list --use-bitmap-index --count --all + ls .git/objects/pack/pack-23fe19963d67a1d4797a39622c15144bbf35ab76a2c0638ba9288cc688c24c16.bitmap + bitmap=.git/objects/pack/pack-23fe19963d67a1d4797a39622c15144bbf35ab76a2c0638ba9288cc688c24c16.bitmap + test_when_finished rm -f .git/objects/pack/pack-23fe19963d67a1d4797a39622c15144bbf35ab76a2c0638ba9288cc688c24c16.bitmap + test_copy_bytes 256 + mv -f .git/objects/pack/pack-23fe19963d67a1d4797a39622c15144bbf35ab76a2c0638ba9288cc688c24c16.bitmap.tmp .git/objects/pack/pack-23fe19963d67a1d4797a39622c15144bbf35ab76a2c0638ba9288cc688c24c16.bitmap + git rev-list --use-bitmap-index --count --all + test_cmp expect actual --- expect 2020-11-13 22:20:39.246355100 +0000 +++ actual 2020-11-13 22:20:39.254355294 +0000 @@ -1 +1 @@ -239 +236 error: last command exited with $?=1
On Fri, Nov 13, 2020 at 11:23:28PM +0100, SZEDER Gábor wrote: > This patch breaks the test 'truncated bitmap fails gracefully (ewah)' > when run with GIT_TEST_DEFAULT_HASH=sha256: Thanks for reporting. It's mostly unluckiness that is unrelated to this commit. We're corrupting the bitmap by truncating it: > test_copy_bytes 256 <$bitmap >$bitmap.tmp && > mv -f $bitmap.tmp $bitmap && and then expecting to notice the problem. But it really depends on which bitmaps we try to look at, and exactly where the truncation is. And this commit just happens to rearrange the exact bytes we write to the bitmap file. If I do this: diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 68badd63cb..a83e7a93fb 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -436,7 +436,7 @@ test_expect_success 'truncated bitmap fails gracefully (ewah)' ' git rev-list --use-bitmap-index --count --all >expect && bitmap=$(ls .git/objects/pack/*.bitmap) && test_when_finished "rm -f $bitmap" && - test_copy_bytes 256 <$bitmap >$bitmap.tmp && + test_copy_bytes 270 <$bitmap >$bitmap.tmp && mv -f $bitmap.tmp $bitmap && git rev-list --use-bitmap-index --count --all >actual 2>stderr && test_cmp expect actual && then it passes with both sha1 and sha256. But what's slightly disturbing is this output: > --- expect 2020-11-13 22:20:39.246355100 +0000 > +++ actual 2020-11-13 22:20:39.254355294 +0000 > @@ -1 +1 @@ > -239 > +236 > error: last command exited with $?=1 We're actually producing the wrong answer here, which implies that ewah_read_mmap() is not being careful enough. Or possibly we are feeding it extra bytes (e.g., letting it run over into the name-hash cache or into the trailer checksum). I think we'll have to dig further into this, probably running the sha256 case in a debugger to see what offsets we actually end up reading. -Peff
On Fri, Nov 13, 2020 at 06:03:24PM -0500, Jeff King wrote: > But what's slightly disturbing is this output: > > > --- expect 2020-11-13 22:20:39.246355100 +0000 > > +++ actual 2020-11-13 22:20:39.254355294 +0000 > > @@ -1 +1 @@ > > -239 > > +236 > > error: last command exited with $?=1 > > We're actually producing the wrong answer here, which implies that > ewah_read_mmap() is not being careful enough. Or possibly we are feeding > it extra bytes (e.g., letting it run over into the name-hash cache or > into the trailer checksum). > > I think we'll have to dig further into this, probably running the sha256 > case in a debugger to see what offsets we actually end up reading. Yep, the problem is in the caller, which is not careful about size checks before reading the header before the actual ewah. The first hunk here fixes it (the second is just another possible corruption I noticed, but not triggered by the test): diff --git a/pack-bitmap.c b/pack-bitmap.c index dc811ebae8..785009b04e 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -229,11 +229,16 @@ static int load_bitmap_entries_v1(struct bitmap_index *index) uint32_t commit_idx_pos; struct object_id oid; + if (index->map_size - index->map_pos < 6) + return error("corrupt ewah bitmap: truncated header for entry %d", i); + commit_idx_pos = read_be32(index->map, &index->map_pos); xor_offset = read_u8(index->map, &index->map_pos); flags = read_u8(index->map, &index->map_pos); - nth_packed_object_id(&oid, index->pack, commit_idx_pos); + if (nth_packed_object_id(&oid, index->pack, commit_idx_pos) < 0) + return error("corrupt ewah bitmap: commit index %u out of range", + (unsigned)commit_idx_pos); bitmap = read_bitmap_1(index); if (!bitmap) We should definitely do something like this, but there are some possible further improvements: - I think that map_size includes the trailing hash, and almost certainly any post-index extensions. We could probably compute the correct boundary of the bitmaps themselves in the caller and make sure we don't read past it. I'm not sure if it's worth the effort, though. In a truncation situation, basically all bets are off (is the trailer still there and the bitmap entries malformed, or is the trailer truncated?). The best we can do is try to read what's there as if it's correct data (and protect ourselves when it's obviously bogus). - we could avoid the magic "6" if read_be32() and read_u8(), which are custom helpers for this function, checked sizes before advancing the pointers. - I'm hesitant to add more tests in this area. As you can see from the commit which "broke" the test, truncating at byte N is going to be sensitive to small variations in the bitmap generation. So unless we're actually parsing the bitmaps and doing targeted corruptions, the tests will be somewhat brittle. -Peff
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 369c76a87c..7b4fc0f304 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -180,8 +180,10 @@ static void compute_xor_offsets(void) struct bb_commit { struct commit_list *reverse_edges; + struct bitmap *commit_mask; struct bitmap *bitmap; - unsigned selected:1; + unsigned selected:1, + maximal:1; unsigned idx; /* within selected array */ }; @@ -198,7 +200,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb, { struct rev_info revs; struct commit *commit; - unsigned int i; + unsigned int i, num_maximal; memset(bb, 0, sizeof(*bb)); init_bb_data(&bb->data); @@ -210,27 +212,85 @@ static void bitmap_builder_init(struct bitmap_builder *bb, for (i = 0; i < writer->selected_nr; i++) { struct commit *c = writer->selected[i].commit; struct bb_commit *ent = bb_data_at(&bb->data, c); + ent->selected = 1; + ent->maximal = 1; ent->idx = i; + + ent->commit_mask = bitmap_new(); + bitmap_set(ent->commit_mask, i); + add_pending_object(&revs, &c->object, ""); } + num_maximal = writer->selected_nr; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); while ((commit = get_revision(&revs))) { struct commit_list *p; + struct bb_commit *c_ent; parse_commit_or_die(commit); - ALLOC_GROW(bb->commits, bb->commits_nr + 1, bb->commits_alloc); - bb->commits[bb->commits_nr++] = commit; + c_ent = bb_data_at(&bb->data, commit); + + if (c_ent->maximal) { + if (!c_ent->selected) { + bitmap_set(c_ent->commit_mask, num_maximal); + num_maximal++; + } + + ALLOC_GROW(bb->commits, bb->commits_nr + 1, bb->commits_alloc); + bb->commits[bb->commits_nr++] = commit; + } for (p = commit->parents; p; p = p->next) { - struct bb_commit *ent = bb_data_at(&bb->data, p->item); - commit_list_insert(commit, &ent->reverse_edges); + struct bb_commit *p_ent = bb_data_at(&bb->data, p->item); + int c_not_p, p_not_c; + + if (!p_ent->commit_mask) { + p_ent->commit_mask = bitmap_new(); + c_not_p = 1; + p_not_c = 0; + } else { + c_not_p = bitmap_diff_nonzero(c_ent->commit_mask, p_ent->commit_mask); + p_not_c = bitmap_diff_nonzero(p_ent->commit_mask, c_ent->commit_mask); + } + + if (!c_not_p) + continue; + + bitmap_or(p_ent->commit_mask, c_ent->commit_mask); + + if (p_not_c) + p_ent->maximal = 1; + else { + p_ent->maximal = 0; + free_commit_list(p_ent->reverse_edges); + p_ent->reverse_edges = NULL; + } + + if (c_ent->maximal) { + commit_list_insert(commit, &p_ent->reverse_edges); + } else { + struct commit_list *cc = c_ent->reverse_edges; + + for (; cc; cc = cc->next) { + if (!commit_list_contains(cc->item, p_ent->reverse_edges)) + commit_list_insert(cc->item, &p_ent->reverse_edges); + } + } } + + bitmap_free(c_ent->commit_mask); + c_ent->commit_mask = NULL; } + + trace2_data_intmax("pack-bitmap-write", the_repository, + "num_selected_commits", writer->selected_nr); + trace2_data_intmax("pack-bitmap-write", the_repository, + "num_maximal_commits", num_maximal); } static void bitmap_builder_clear(struct bitmap_builder *bb) diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 6bf68fee85..33ef9a098d 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -20,11 +20,87 @@ has_any () { grep -Ff "$1" "$2" } +# To ensure the logic for "maximal commits" is exercised, make +# the repository a bit more complicated. +# +# other master +# * * +# (99 commits) (99 commits) +# * * +# |\ /| +# | * octo-other octo-master * | +# |/|\_________ ____________/|\| +# | \ \/ __________/ | +# | | ________/\ / | +# * |/ * merge-right * +# | _|__________/ \____________ | +# |/ | \| +# (l1) * * merge-left * (r1) +# | / \________________________ | +# |/ \| +# (l2) * * (r2) +# \____________...____________ | +# \| +# * (base) +# +# The important part for the maximal commit algorithm is how +# the bitmasks are extended. Assuming starting bit positions +# for master (bit 0) and other (bit 1), and some flexibility +# in the order that merge bases are visited, the bitmasks at +# the end should be: +# +# master: 1 (maximal, selected) +# other: 01 (maximal, selected) +# octo-master: 1 +# octo-other: 01 +# merge-right: 111 (maximal) +# (l1): 111 +# (r1): 111 +# merge-left: 1101 (maximal) +# (l2): 11111 (maximal) +# (r2): 111101 (maximal) +# (base): 1111111 (maximal) + test_expect_success 'setup repo with moderate-sized history' ' - test_commit_bulk --id=file 100 && + test_commit_bulk --id=file 10 && git checkout -b other HEAD~5 && test_commit_bulk --id=side 10 && + + # add complicated history setup, including merges and + # ambiguous merge-bases + + git checkout -b merge-left other~2 && + git merge master~2 -m "merge-left" && + + git checkout -b merge-right master~1 && + git merge other~1 -m "merge-right" && + + git checkout -b octo-master master && + git merge merge-left merge-right -m "octopus-master" && + + git checkout -b octo-other other && + git merge merge-left merge-right -m "octopus-other" && + + git checkout other && + git merge octo-other -m "pull octopus" && + git checkout master && + git merge octo-master -m "pull octopus" && + + # Remove these branches so they are not selected + # as bitmap tips + git branch -D merge-left && + git branch -D merge-right && + git branch -D octo-other && + git branch -D octo-master && + + # add padding to make these merges less interesting + # and avoid having them selected for bitmaps + test_commit_bulk --id=file 100 && + git checkout other && + test_commit_bulk --id=side 100 && + git checkout master && + bitmaptip=$(git rev-parse master) && blob=$(echo tagged-blob | git hash-object -w --stdin) && git tag tagged-blob $blob && @@ -32,9 +108,12 @@ test_expect_success 'setup repo with moderate-sized history' ' ' test_expect_success 'full repack creates bitmaps' ' - git repack -ad && + GIT_TRACE2_EVENT_NESTING=4 GIT_TRACE2_EVENT="$(pwd)/trace" \ + git repack -ad && ls .git/objects/pack/ | grep bitmap >output && - test_line_count = 1 output + test_line_count = 1 output && + grep "\"key\":\"num_selected_commits\",\"value\":\"106\"" trace && + grep "\"key\":\"num_maximal_commits\",\"value\":\"111\"" trace ' test_expect_success 'rev-list --test-bitmap verifies bitmaps' '