Message ID | 5262daa3300114fbaccdbc7393882c5435f95f4f.1605649533.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | pack-bitmap: bitmap generation improvements | expand |
I think this is the "big patch" mentioned in IRC [1]? I'll review just the commit message of this one first, then go back and review from patch 10 ("pack-bitmap-write: reimplement bitmap writing") up to this one inclusive. [1] https://colabti.org/irclogger/irclogger_log/git-devel?date=2020-11-23 > From: Derrick Stolee <dstolee@microsoft.com> > > The bitmap_writer_build() method calls bitmap_builder_init() to > construct a list of commits reachable from the selected commits along > with a "reverse graph". This reverse graph has edges pointing from a > commit to other commits that can reach that commit. After computing a > reachability bitmap for a commit, the values in that bitmap are then > copied to the reachability bitmaps across the edges in the reverse > graph. > > We can now relax the role of the reverse graph to greatly reduce the > number of intermediate reachability bitmaps we compute during this > reverse walk. The end result is that we walk objects the same number of > times as before when constructing the reachability bitmaps, but we also > spend much less time copying bits between bitmaps and have much lower > memory pressure in the process. OK - as I have seen in the previous patches, and as said here, the edges of the graph were previously parent to immediate child, but I believe that patch 12 ("pack-bitmap-write: fill bitmap with commit history") makes it so that the edges don't need to be direct parent-child relationships. That patch does not, however, decide what the vertices and edges should thus be, so that ability could not be used. This patch seems to do so, and thus makes use of that ability. > The core idea is to select a set of "important" commits based on > interactions among the sets of commits reachable from each selected commit. Makes sense. > The first technical concept is to create a new 'commit_mask' member in the > bb_commit struct. Note that the selected commits are provided in an > ordered array. The first thing to do is to mark the ith bit in the > commit_mask for the ith selected commit. OK - so this commit_mask is like the bitmaps in Git. There is an array, initially populated with the list of selected commits (which are selected using another algorithm, which is a separate concern from this patch set), and each bit in commit_mask corresponds to the corresponding entry in that array. From this, I assume that the commit_mask values in the selected bb_commit structs will start with a nonzero value (written in binary, having 1 bit set), and the other commit_mask values will start with zero. > As we walk the commit-graph, we > copy the bits in a commit's commit_mask to its parents. At the end of > the walk, the ith bit in the commit_mask for a commit C stores a boolean > representing "The ith selected commit can reach C." The walk is done in topological order - visiting children before parents. Copying makes sense - if a commit can reach me, it can reach my parents as well. > As we walk, we will discover non-selected commits that are important. We > will get into this later, but those important commits must also receive > bit positions, growing the width of the bitmasks as we walk. At the true > end of the walk, the ith bit means "the ith _important_ commit can reach > C." OK - so the initial array, initially populated with the list of selected commits, can be grown and will include other important commits as well. This is similar to the bitmap revwalk algorithm - the bitmaps in that algorithm can be grown to include other objects as well. > MAXIMAL COMMITS > --------------- > > We use a new 'maximal' bit in the bb_commit struct to represent whether > a commit is important or not. The term "maximal" comes from the > partially-ordered set of commits in the commit-graph where C >= P if P > is a parent of C, and then extending the relationship transitively. I had to look up what "maximal" means in a partial order. :-P An element of a partially ordered set is "maximal" if there is no other element that is "greater" than it. Here, all descendants are "greater" than their ancestors. > Instead of taking the maximal commits across the entire commit-graph, I was wondering about this :-) > we > instead focus on selecting each commit that is maximal among commits > with the same bits on in their commit_mask. Ah, OK. Two commits will have the same commit_mask if the exact same set of important commits can reach them. > This definition is > important, so let's consider an example. > > Suppose we have three selected commits A, B, and C. These are assigned > bitmasks 100, 010, and 001 to start. Each of these can be marked as > maximal immediately because they each will be the uniquely maximal > commit that contains their own bit. That is correct. To further elaborate on this explanation, let's say we have a selected commit C (and since it is selected, the commit_mask in each commit will have a bit corresponding to whether C can reach it). Each other commit is either an ancestor, a descendant, or unrelated. - C cannot reach descendants. - C cannot reach unrelated commits. - C can reach all ancestors, but in the partial order, C compares "greater" to them anyway. So every other commit cannot affect C's maximal status. > Keep in mind that that these commits > may have different bitmasks after the walk; for example, if B can reach > C but A cannot, then the final bitmask for C is 011. Even in these > cases, C would still be a maximal commit among all commits with the > third bit on in their masks. Yes. > Now define sets X, Y, and Z to be the sets of commits reachable from A, > B, and C, respectively. The intersections of these sets correspond to > different bitmasks: > > * 100: X - (Y union Z) > * 010: Y - (X union Z) > * 001: Z - (X union Y) > * 110: (X intersect Y) - Z > * 101: (X intersect Z) - Y > * 011: (Y intersect Z) - X > * 111: X intersect Y intersect Z > > This can be visualized with the following Hasse diagram: > > 100 010 001 > | \ / \ / | > | \/ \/ | > | /\ /\ | > | / \ / \ | > 110 101 011 > \___ | ___/ > \ | / > 111 > > Some of these bitmasks may not be represented, depending on the topology > of the commit-graph. In fact, we are counting on it, since the number of > possible bitmasks is exponential in the number of selected commits, but > is also limited by the total number of commits. In practice, very few > bitmasks are possible because most commits converge on a common "trunk" > in the commit history. This section wasn't very useful to me - but I would appreciate it if others chimed in to say it was useful to them. > With this three-bit example, we wish to find commits that are maximal > for each bitmask. How can we identify this as we are walking? OK - now we come to the algorithm. I presume the algorithm doesn't only find commits that are maximal for each bitmask, but also updates the list of important commits (and thus increasing the size of the bitmask)? Reading below, I see that the answer to my question is yes. Ah...it wasn't clear to me that the purpose of finding the maximal commits is also to add to the list of important commits, but perhaps it will be obvious to other reasons. I'll work through the algorithm using the butterfly example below, reproduced here: > I J > |\ /| > | \/ | > | /\ | > |/ \| > M N > \ / > |/ > Q I was going to suggest that we suppose that there are no selected commits, but it looks like the algorithm would optimize itself out (meaning that it won't make any commit maximal - which makes sense, I guess). The example below had I and J as selected commits (which I know because the commit_mask values for I and J are "0b10" and "0b01" respectively), so let's go with that. > As we walk, we visit a commit C. Since we are walking the commits in > topo-order, we know that C is visited after all of its children are > visited. Thus, when we get C from the revision walk we inspect the > 'maximal' property of its bb_data and use that to determine if C is truly > important. Its commit_mask is also nearly final. OK - when a commit is visited, we would already know its "maximal" status because when we had visited its parents, we already modified "maximal" (because we update a commit's children when we visit it - details about this are to follow, presumably). > If C is not one of the > originally-selected commits, then assign a bit position to C (by > incrementing num_maximal) and set that bit on in commit_mask. See > "MULTIPLE MAXIMAL COMMITS" below for more detail on this. I presume we only assign a bit position to C if it is "maximal"? > Now that the commit C is known to be maximal or not, consider each > parent P of C. Compute two new values: > > * c_not_p : true if and only if the commit_mask for C contains a bit > that is not contained in the commit_mask for P. > > * p_not_c : true if and only if the commit_mask for P contains a bit > that is not contained in the commit_mask for P. OK, let's try this with I. I'll use the same <commit letter>:<commit_mask in little-endian order> notation as the one the commit author uses below to indicate the commit_mask of a commit. We have I:10 with 2 parents M:00 and N:00, so for both parents, c_not_p is true and p_not_c is false. > If c_not_p is false, then P already has all of the bits that C would > provide to its commit_mask. In this case, move on to other parents as C > has nothing to contribute to P's state that was not already provided by > other children of P. To emphasize, we "move on" regardless of what p_not_c is. In our example, this is not true in I's case, so let's read on. After the analysis below, I see why we can "move on". > We continue with the case that c_not_p is true. This means there are > bits in C's commit_mask to copy to P's commit_mask, so use bitmap_or() > to add those bits. OK. So we have I:10 (unchanged), M:10, N:10. Which as I said above, makes sense, since if a commit can reach I, it can reach M and N. > If p_not_c is also true, then set the maximal bit for P to one. This means > that if no other commit has P as a parent, then P is definitely maximal. > This is because no child had the same bitmask. It is important to think > about the maximal bit for P at this point as a temporary state: "P is > maximal based on current information." > > In contrast, if p_not_c is false, then set the maximal bit for P to > zero. At first it was confusing that (1) the last maximal bit is used when there is no guarantee that the children are iterated in any particular order, and (2) the maximal bit is never updated when c_not_p is false. So I was thinking of a counterexample, but couldn't think of one. So this algorithm looks correct so far. For (2), I think of it this way: C1 C2 C3 \ |/ P Let's say that C1 has a commit_mask, and C2 and C3 have the exact same commit_mask. P's maximal bit will be the exact same in the following situation: C1 C2 \ | P So skipping over C3 is correct. And C3 must be skipped because the algorithm as written is not idempotent (during the C2 iteration, commit_mask of P is updated). For (1), I think of it as follows. The only one that counts is the very last calculation (which you can see from the algorithm - the maximal bit is constantly being overridden). During the very last iteration, does P have any information that Cx does not? If yes, P has a commit_mask that is unique w.r.t. all its children (since it combines unique information from its other children that Cx does not, plus some other unique information that Cx has and its other children do not) and is therefore maximal. If not, all the information P got is also known by Cx, so it is definitely not maximal (Cx shares the commit_mask with P, and is "greater" than P). > Further, clear all reverse_edges for P since any edges that were > previously assigned to P are no longer important. P will gain all > reverse edges based on C. (I read ahead to see what the reverse edges are.) Indeed, C has all the information. > The final thing we need to do is to update the reverse edges for P. > These reverse edges respresent "which closest maximal commits > contributed bits to my commit_mask?" Since C contributed bits to P's > commit_mask in this case, C must add to the reverse edges of P. > > If C is maximal, then C is a 'closest' maximal commit that contributed > bits to P. Add C to P's reverse_edges list. > > Otherwise, C has a list of maximal commits that contributed bits to its > bitmask (and this list is exactly one element). Add all of these items > to P's reverse_edges list. Be careful to ignore duplicates here. OK - the other end of reverse edges are always to maximal commits. Propagation in this way (if C is not maximal) makes sense. > After inspecting all parents P for a commit C, we can clear the > commit_mask for C. This reduces the memory load to be limited to the > "width" of the commit graph. Optimization - OK. I might as well finish working through the example. Let's start from the beginning. I J I:10()M J:01()M |\ /| | \/ | | /\ | |/ \| M N \ / |/ Q (Brackets are the destinations of reverse edges. M is maximal, ~M is not maximal.) Iteration starts with I. I is maximal, but it is also one of the selected commits, so we need not do anything else. Look at its parents, starting with M. c_of_p is true, so copy the bits over. p_of_c is false, so the maximal bit of M is zero, clear all its reverse edges (no-op in this case, since it has none), and update its reverse edges: C is maximal so it will just be C. The procedure for N is exactly the same. So we have: I J I:10()M J:01()M |\ /| | \/ | | /\ | |/ \| M N M:10(I)~M N:10(I)~M \ / |/ Q Now onto J. J is maximal, but it is also one of the selected commits, so we need not do anything else. Look at its parent M. c_of_p is true, so copy the bits over. p_of_c is true this time. So set the maximal bit to true. (Indeed, M has information that is independent of J - the stuff that it got from I.) We need not clear any reverse edges, but must still update them: J is maximal so we add J. The procedure for N is exactly the same. So we have: I J I:10()M J:01()M |\ /| | \/ | | /\ | |/ \| M N M:11(I,J)M N:11(I,J)M \ / |/ Q Let's go to M. M is maximal, and it is not one of the selected commits, so widen the commit_mask and set the corresponding bit on M. Look at its only parent Q. c_of_p is true, so copy the bits over. p_of_c is false, so the maximal bit of Q is zero, and update its reverse edges as usual. I J I:10()M J:01()M |\ /| | \/ | | /\ | |/ \| M N M:111(I,J)M N:11(I,J)M \ / |/ Q Q:111(M)~M Now, N. N is maximal, and it is not one of the selected commits, so widen the commit_mask and set the corresponding bit on N. Look at its only parent Q. c_of_p is true; copy bits; but this time p_of_c is true, so set the maximal bit to true. Don't clear reverse edges; N is maximal so add it. I J I:10()M J:01()M |\ /| | \/ | | /\ | |/ \| M N M:111(I,J)M N:1101(I,J)M \ / |/ Q Q:1111(M,N)M Checking the answer below, it looks like the commit author and I agree. > Consider our ABC/XYZ example from earlier and let's inspect the state of > the commits for an interesting bitmask, say 011. Suppose that D is the > only maximal commit with this bitmask (in the first three bits). All > other commits with bitmask 011 have D as the only entry in their > reverse_edges list. D's reverse_edges list contains B and C. Yes, this makes sense. Let me write about "D's reverse_edges list contains B and C" first: The fact that D has a bitmask of 011 shows no important commits can reach it other than B or C. Any intermediate commits between B and D would not be maximal (because B is "greater" than such an intermediate commit, and we already established that no other important commit can reach D, and therefore no other important commit can reach any of the commits on the path between B and D). Same analysis applies for C instead of B. So the propagated reverse edges would just be B and C. Now "all other commits with bitmask 011 have D as the only entry in their reverse_edges list". Since D is the only maximal commit with 011, any other commit that has 011 (1) must have got it from D, and (2) cannot be reached by any other important commit. A similar analysis as in the previous paragraph shows why the reverse_edges list for all these commits would only contain D. > COMPUTING REACHABILITY BITMAPS > ------------------------------ > > Now that we have our definition, let's zoom out and consider what > happens with our new reverse graph when computing reachability bitmaps. > We walk the reverse graph in reverse-topo-order, so we visit commits > with largest commit_masks first. After we compute the reachability > bitmap for a commit C, we push the bits in that bitmap to each commit D > in the reverse edge list for C. That makes sense - the reachability bitmap for D is the reachability bitmap for C + the reachability bitmap of (D-C). Here we have the first operand. > Then, when we finally visit D we already > have the bits for everything reachable from maximal commits that D can > reach and we only need to walk the objects in the set-difference. Walking the objects in the set-difference gives us the second operand. Makes sense. > In our ABC/XYZ example, when we finally walk for the commit A we only > need to walk commits with bitmask equal to A's bitmask. If that bitmask > is 100, then we are only walking commits in X - (Y union Z) because the > bitmap already contains the bits for objects reachable from (X intersect > Y) union (X intersect Z) (i.e. the bits from the reachability bitmaps > for the maximal commits with bitmasks 110 and 101). This is probably correct, but I've lost track of what X, Y, and Z are so I'll just skip this paragraph. > The behavior is intended to walk each commit (and the trees that commit > introduces) at most once while allocating and copying fewer reachability > bitmaps. Yes, I can see how this algorithm causes this behavior. > There is one caveat: what happens when there are multiple > maximal commits with the same bitmask, with respect to the initial set > of selected commits? I'm going to be lazy here and ask, is this possible? As described below in "MULTIPLE MAXIMAL COMMITS", if a non-selected commit turns out to be maximal, it will have its very own bit, and thus become the "progenitor" of all commits with that bit set. (This is not a true progenitor, because this bit propagates from the children to the parents and not the other way round - unlike a gene.) > MULTIPLE MAXIMAL COMMITS > ------------------------ I think I discussed everything here earlier, so [skip]. > PERFORMANCE MEASUREMENTS > ------------------------ Numbers look good. [skip] As discussed above, I'll not review the code this round. [skip code] Phew...this took longer than expected. I'll see if I can review the rest of the patches tomorrow.
I've reviewed the commit message already [1]; now let's look at the code. [1] https://lore.kernel.org/git/20201124060738.762751-1-jonathantanmy@google.com/ > struct bb_commit { > struct commit_list *reverse_edges; > + struct bitmap *commit_mask; > struct bitmap *bitmap; > - unsigned selected:1; > + unsigned selected:1, > + maximal:1; The code itself probably should contain comments about the algorithm, but I'm not sure of the best way to do it. (E.g. I would say that "maximal" should be "When iteration in bitmap_builder_init() reaches this bb_commit, this is true iff none of its descendants has or will ever have the exact same commit_mask" - but then when do we explain why the commit_mask matters?) > unsigned idx; /* within selected array */ > }; > > @@ -198,7 +200,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb, > { > struct rev_info revs; > struct commit *commit; > - unsigned int i; > + unsigned int i, num_maximal; > > memset(bb, 0, sizeof(*bb)); > init_bb_data(&bb->data); > @@ -210,27 +212,85 @@ static void bitmap_builder_init(struct bitmap_builder *bb, > for (i = 0; i < writer->selected_nr; i++) { > struct commit *c = writer->selected[i].commit; > struct bb_commit *ent = bb_data_at(&bb->data, c); > + > ent->selected = 1; > + ent->maximal = 1; > ent->idx = i; > + > + ent->commit_mask = bitmap_new(); > + bitmap_set(ent->commit_mask, i); > + > add_pending_object(&revs, &c->object, ""); > } > + num_maximal = writer->selected_nr; > > if (prepare_revision_walk(&revs)) > die("revision walk setup failed"); > > while ((commit = get_revision(&revs))) { > struct commit_list *p; > + struct bb_commit *c_ent; > > parse_commit_or_die(commit); > > - ALLOC_GROW(bb->commits, bb->commits_nr + 1, bb->commits_alloc); > - bb->commits[bb->commits_nr++] = commit; > + c_ent = bb_data_at(&bb->data, commit); > + > + if (c_ent->maximal) { > + if (!c_ent->selected) { > + bitmap_set(c_ent->commit_mask, num_maximal); > + num_maximal++; > + } > + > + ALLOC_GROW(bb->commits, bb->commits_nr + 1, bb->commits_alloc); > + bb->commits[bb->commits_nr++] = commit; So the order of bit assignments in the commit_mask and the order of commits in bb->commits are not the same. In the commit_mask, bits are first assigned for selected commits and then the rest for commits we discover to be maximal. But in bb->commits, the order follows the topologically-sorted iteration. This is fine, but might be worth a comment (to add to the already big comment burden...) > + } > > for (p = commit->parents; p; p = p->next) { > - struct bb_commit *ent = bb_data_at(&bb->data, p->item); > - commit_list_insert(commit, &ent->reverse_edges); > + struct bb_commit *p_ent = bb_data_at(&bb->data, p->item); > + int c_not_p, p_not_c; > + > + if (!p_ent->commit_mask) { > + p_ent->commit_mask = bitmap_new(); > + c_not_p = 1; > + p_not_c = 0; > + } else { > + c_not_p = bitmap_diff_nonzero(c_ent->commit_mask, p_ent->commit_mask); > + p_not_c = bitmap_diff_nonzero(p_ent->commit_mask, c_ent->commit_mask); > + } > + > + if (!c_not_p) > + continue; > + > + bitmap_or(p_ent->commit_mask, c_ent->commit_mask); > + > + if (p_not_c) > + p_ent->maximal = 1; > + else { > + p_ent->maximal = 0; > + free_commit_list(p_ent->reverse_edges); > + p_ent->reverse_edges = NULL; > + } > + > + if (c_ent->maximal) { > + commit_list_insert(commit, &p_ent->reverse_edges); > + } else { > + struct commit_list *cc = c_ent->reverse_edges; > + > + for (; cc; cc = cc->next) { > + if (!commit_list_contains(cc->item, p_ent->reverse_edges)) > + commit_list_insert(cc->item, &p_ent->reverse_edges); > + } > + } > } > + > + bitmap_free(c_ent->commit_mask); > + c_ent->commit_mask = NULL; > } > + > + trace2_data_intmax("pack-bitmap-write", the_repository, > + "num_selected_commits", writer->selected_nr); > + trace2_data_intmax("pack-bitmap-write", the_repository, > + "num_maximal_commits", num_maximal); > } The rest looks like a faithful implementation of the algorithm. Now let's look at the tests. > +# To ensure the logic for "maximal commits" is exercised, make > +# the repository a bit more complicated. > +# > +# other master > +# * * > +# (99 commits) (99 commits) > +# * * > +# |\ /| > +# | * octo-other octo-master * | > +# |/|\_________ ____________/|\| > +# | \ \/ __________/ | > +# | | ________/\ / | > +# * |/ * merge-right * > +# | _|__________/ \____________ | > +# |/ | \| > +# (l1) * * merge-left * (r1) > +# | / \________________________ | > +# |/ \| > +# (l2) * * (r2) > +# \____________...____________ | What does the ... represent? If a certain number of commits, it would be clearer to write that there. > +# \| > +# * (base) OK - some of the crosses are unclear, but from the bitmask given below, I know where the lines should go. > +# > +# The important part for the maximal commit algorithm is how > +# the bitmasks are extended. Assuming starting bit positions > +# for master (bit 0) and other (bit 1), and some flexibility > +# in the order that merge bases are visited, the bitmasks at > +# the end should be: > +# > +# master: 1 (maximal, selected) > +# other: 01 (maximal, selected) > +# octo-master: 1 > +# octo-other: 01 > +# merge-right: 111 (maximal) > +# (l1): 111 > +# (r1): 111 > +# merge-left: 1101 (maximal) > +# (l2): 11111 (maximal) > +# (r2): 111101 (maximal) > +# (base): 1111111 (maximal) This makes sense. (l1) and (r1) are not maximal because everything that can reach merge-right can also reach them. [snip] > test_expect_success 'full repack creates bitmaps' ' > - git repack -ad && > + GIT_TRACE2_EVENT_NESTING=4 GIT_TRACE2_EVENT="$(pwd)/trace" \ > + git repack -ad && > ls .git/objects/pack/ | grep bitmap >output && > - test_line_count = 1 output > + test_line_count = 1 output && > + grep "\"key\":\"num_selected_commits\",\"value\":\"106\"" trace && > + grep "\"key\":\"num_maximal_commits\",\"value\":\"111\"" trace From the diagram and bit masks, I see that the important number for "maximal" is 7. Could this test be run twice - one without the crosses and one with, and we can verify that the difference between the maximal commits is 7? As it is, this 111 number is hard to verify.
On 11/24/2020 8:46 PM, Jonathan Tan wrote: > I've reviewed the commit message already [1]; now let's look at the code. > > [1] https://lore.kernel.org/git/20201124060738.762751-1-jonathantanmy@google.com/ > >> struct bb_commit { >> struct commit_list *reverse_edges; >> + struct bitmap *commit_mask; >> struct bitmap *bitmap; >> - unsigned selected:1; >> + unsigned selected:1, >> + maximal:1; > > The code itself probably should contain comments about the algorithm, > but I'm not sure of the best way to do it. (E.g. I would say that > "maximal" should be "When iteration in bitmap_builder_init() reaches > this bb_commit, this is true iff none of its descendants has or will > ever have the exact same commit_mask" - but then when do we explain why > the commit_mask matters?) Comments are tricky, as they are likely to go stale. In fact, this algorithm changes dramatically later in this very series! How much can we expect a reader to inspect the commit history to discover the lengthy commit message? The message explains the algorithm and its many subtleties versus reading comments that may be too specific to this initial version. At this point, "maximal" is a property that doesn't mean much without inspecting the places where we set or check that bit. >> + if (c_ent->maximal) { >> + if (!c_ent->selected) { >> + bitmap_set(c_ent->commit_mask, num_maximal); >> + num_maximal++; >> + } >> + >> + ALLOC_GROW(bb->commits, bb->commits_nr + 1, bb->commits_alloc); >> + bb->commits[bb->commits_nr++] = commit; > > So the order of bit assignments in the commit_mask and the order of > commits in bb->commits are not the same. In the commit_mask, bits are > first assigned for selected commits and then the rest for commits we > discover to be maximal. But in bb->commits, the order follows the > topologically-sorted iteration. This is fine, but might be worth a > comment (to add to the already big comment burden...) This one I waffled on a bit. There _is_ a difference between the bitmask order and this list order. This isn't a problem as long as no one attempts to use the bitmasks to navigate into this list. Further, it is entirely possible that the tests to never demonstrate that difference, so pointing out may help a future developer from making that mistake. Could be good to add this comment over the bb->commits definition: /* * 'commits' stores the list of maximal commits, in visited * order. This can be different than the bitmask order for * the selected commits. */ >> +# To ensure the logic for "maximal commits" is exercised, make >> +# the repository a bit more complicated. >> +# >> +# other master >> +# * * >> +# (99 commits) (99 commits) >> +# * * >> +# |\ /| >> +# | * octo-other octo-master * | >> +# |/|\_________ ____________/|\| >> +# | \ \/ __________/ | >> +# | | ________/\ / | >> +# * |/ * merge-right * >> +# | _|__________/ \____________ | >> +# |/ | \| >> +# (l1) * * merge-left * (r1) >> +# | / \________________________ | >> +# |/ \| >> +# (l2) * * (r2) >> +# \____________...____________ | > > What does the ... represent? If a certain number of commits, it would be > clearer to write that there. The ... are unnecessary and should be ___ instead. Thanks. >> +# \| >> +# * (base) > > OK - some of the crosses are unclear, but from the bitmask given below, > I know where the lines should go. > >> +# >> +# The important part for the maximal commit algorithm is how >> +# the bitmasks are extended. Assuming starting bit positions >> +# for master (bit 0) and other (bit 1), and some flexibility >> +# in the order that merge bases are visited, the bitmasks at >> +# the end should be: >> +# >> +# master: 1 (maximal, selected) >> +# other: 01 (maximal, selected) >> +# octo-master: 1 >> +# octo-other: 01 >> +# merge-right: 111 (maximal) >> +# (l1): 111 >> +# (r1): 111 >> +# merge-left: 1101 (maximal) >> +# (l2): 11111 (maximal) >> +# (r2): 111101 (maximal) >> +# (base): 1111111 (maximal) > > This makes sense. (l1) and (r1) are not maximal because everything that > can reach merge-right can also reach them. > > [snip] > >> test_expect_success 'full repack creates bitmaps' ' >> - git repack -ad && >> + GIT_TRACE2_EVENT_NESTING=4 GIT_TRACE2_EVENT="$(pwd)/trace" \ >> + git repack -ad && >> ls .git/objects/pack/ | grep bitmap >output && >> - test_line_count = 1 output >> + test_line_count = 1 output && >> + grep "\"key\":\"num_selected_commits\",\"value\":\"106\"" trace && >> + grep "\"key\":\"num_maximal_commits\",\"value\":\"111\"" trace > > From the diagram and bit masks, I see that the important number for > "maximal" is 7. Could this test be run twice - one without the crosses > and one with, and we can verify that the difference between the maximal > commits is 7? As it is, this 111 number is hard to verify. This number _is_ hard to verify. It is fragile to many behaviors inside the code of Git, including the selection algorithm and some hard-coded limits (there's a reason we insert 100 commits on top of each side). Further, this number changes as the algorithm is modified. Perhaps the best way to recognize this number is that it changes by adding 5 to the previous number (these are the 5 "newly maximal" commits, since two are already selected as tips). This number changes again later, and the difference is justified by the number of maximal commits dropping by 4. Thanks, -Stolee
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 369c76a87c..7b4fc0f304 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -180,8 +180,10 @@ static void compute_xor_offsets(void) struct bb_commit { struct commit_list *reverse_edges; + struct bitmap *commit_mask; struct bitmap *bitmap; - unsigned selected:1; + unsigned selected:1, + maximal:1; unsigned idx; /* within selected array */ }; @@ -198,7 +200,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb, { struct rev_info revs; struct commit *commit; - unsigned int i; + unsigned int i, num_maximal; memset(bb, 0, sizeof(*bb)); init_bb_data(&bb->data); @@ -210,27 +212,85 @@ static void bitmap_builder_init(struct bitmap_builder *bb, for (i = 0; i < writer->selected_nr; i++) { struct commit *c = writer->selected[i].commit; struct bb_commit *ent = bb_data_at(&bb->data, c); + ent->selected = 1; + ent->maximal = 1; ent->idx = i; + + ent->commit_mask = bitmap_new(); + bitmap_set(ent->commit_mask, i); + add_pending_object(&revs, &c->object, ""); } + num_maximal = writer->selected_nr; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); while ((commit = get_revision(&revs))) { struct commit_list *p; + struct bb_commit *c_ent; parse_commit_or_die(commit); - ALLOC_GROW(bb->commits, bb->commits_nr + 1, bb->commits_alloc); - bb->commits[bb->commits_nr++] = commit; + c_ent = bb_data_at(&bb->data, commit); + + if (c_ent->maximal) { + if (!c_ent->selected) { + bitmap_set(c_ent->commit_mask, num_maximal); + num_maximal++; + } + + ALLOC_GROW(bb->commits, bb->commits_nr + 1, bb->commits_alloc); + bb->commits[bb->commits_nr++] = commit; + } for (p = commit->parents; p; p = p->next) { - struct bb_commit *ent = bb_data_at(&bb->data, p->item); - commit_list_insert(commit, &ent->reverse_edges); + struct bb_commit *p_ent = bb_data_at(&bb->data, p->item); + int c_not_p, p_not_c; + + if (!p_ent->commit_mask) { + p_ent->commit_mask = bitmap_new(); + c_not_p = 1; + p_not_c = 0; + } else { + c_not_p = bitmap_diff_nonzero(c_ent->commit_mask, p_ent->commit_mask); + p_not_c = bitmap_diff_nonzero(p_ent->commit_mask, c_ent->commit_mask); + } + + if (!c_not_p) + continue; + + bitmap_or(p_ent->commit_mask, c_ent->commit_mask); + + if (p_not_c) + p_ent->maximal = 1; + else { + p_ent->maximal = 0; + free_commit_list(p_ent->reverse_edges); + p_ent->reverse_edges = NULL; + } + + if (c_ent->maximal) { + commit_list_insert(commit, &p_ent->reverse_edges); + } else { + struct commit_list *cc = c_ent->reverse_edges; + + for (; cc; cc = cc->next) { + if (!commit_list_contains(cc->item, p_ent->reverse_edges)) + commit_list_insert(cc->item, &p_ent->reverse_edges); + } + } } + + bitmap_free(c_ent->commit_mask); + c_ent->commit_mask = NULL; } + + trace2_data_intmax("pack-bitmap-write", the_repository, + "num_selected_commits", writer->selected_nr); + trace2_data_intmax("pack-bitmap-write", the_repository, + "num_maximal_commits", num_maximal); } static void bitmap_builder_clear(struct bitmap_builder *bb) diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 6bf68fee85..1691710ec1 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -20,11 +20,87 @@ has_any () { grep -Ff "$1" "$2" } +# To ensure the logic for "maximal commits" is exercised, make +# the repository a bit more complicated. +# +# other master +# * * +# (99 commits) (99 commits) +# * * +# |\ /| +# | * octo-other octo-master * | +# |/|\_________ ____________/|\| +# | \ \/ __________/ | +# | | ________/\ / | +# * |/ * merge-right * +# | _|__________/ \____________ | +# |/ | \| +# (l1) * * merge-left * (r1) +# | / \________________________ | +# |/ \| +# (l2) * * (r2) +# \____________...____________ | +# \| +# * (base) +# +# The important part for the maximal commit algorithm is how +# the bitmasks are extended. Assuming starting bit positions +# for master (bit 0) and other (bit 1), and some flexibility +# in the order that merge bases are visited, the bitmasks at +# the end should be: +# +# master: 1 (maximal, selected) +# other: 01 (maximal, selected) +# octo-master: 1 +# octo-other: 01 +# merge-right: 111 (maximal) +# (l1): 111 +# (r1): 111 +# merge-left: 1101 (maximal) +# (l2): 11111 (maximal) +# (r2): 111101 (maximal) +# (base): 1111111 (maximal) + test_expect_success 'setup repo with moderate-sized history' ' - test_commit_bulk --id=file 100 && + test_commit_bulk --id=file 10 && git checkout -b other HEAD~5 && test_commit_bulk --id=side 10 && + + # add complicated history setup, including merges and + # ambiguous merge-bases + + git checkout -b merge-left other~2 && + git merge master~2 -m "merge-left" && + + git checkout -b merge-right master~1 && + git merge other~1 -m "merge-right" && + + git checkout -b octo-master master && + git merge merge-left merge-right -m "octopus-master" && + + git checkout -b octo-other other && + git merge merge-left merge-right -m "octopus-other" && + + git checkout other && + git merge octo-other -m "pull octopus" && + git checkout master && + git merge octo-master -m "pull octopus" && + + # Remove these branches so they are not selected + # as bitmap tips + git branch -D merge-left && + git branch -D merge-right && + git branch -D octo-other && + git branch -D octo-master && + + # add padding to make these merges less interesting + # and avoid having them selected for bitmaps + test_commit_bulk --id=file 100 && + git checkout other && + test_commit_bulk --id=side 100 && + git checkout master && + bitmaptip=$(git rev-parse master) && blob=$(echo tagged-blob | git hash-object -w --stdin) && git tag tagged-blob $blob && @@ -32,9 +108,12 @@ test_expect_success 'setup repo with moderate-sized history' ' ' test_expect_success 'full repack creates bitmaps' ' - git repack -ad && + GIT_TRACE2_EVENT_NESTING=4 GIT_TRACE2_EVENT="$(pwd)/trace" \ + git repack -ad && ls .git/objects/pack/ | grep bitmap >output && - test_line_count = 1 output + test_line_count = 1 output && + grep "\"key\":\"num_selected_commits\",\"value\":\"106\"" trace && + grep "\"key\":\"num_maximal_commits\",\"value\":\"111\"" trace ' test_expect_success 'rev-list --test-bitmap verifies bitmaps' ' @@ -356,7 +435,7 @@ test_expect_success 'truncated bitmap fails gracefully (ewah)' ' git rev-list --use-bitmap-index --count --all >expect && bitmap=$(ls .git/objects/pack/*.bitmap) && test_when_finished "rm -f $bitmap" && - test_copy_bytes 256 <$bitmap >$bitmap.tmp && + test_copy_bytes 270 <$bitmap >$bitmap.tmp && mv -f $bitmap.tmp $bitmap && git rev-list --use-bitmap-index --count --all >actual 2>stderr && test_cmp expect actual &&