Message ID | 20190909190130.146613-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] cache-tree: do not lazy-fetch merge tree | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > When cherry-picking (for example), new trees may be constructed. During > this process, Git constructs the new tree in a struct strbuf, computes > the OID of the new tree, and checks if the new OID already exists on > disk. However, in a partial clone, the disk check causes a lazy fetch to > occur, which is both unnecessary (because we have the tree in the struct > strbuf) and likely to fail (because the remote probably doesn't have > this tree). I somehow smell that the above misses the point of the check in the first place, though. The reason why we are computing the tree object's name and seeing if we have it locally on disk is to decide if we want to record it in the cache tree, *without* writing the tree out to our object store, no? It is not just unnecessary but also against the goal of the codepath to lazily download it, even if the tree is available remotely. And it is irrelevant that there are cases the remote does not have it---we have no need to mention that we must be prepared to see the lazy fetch to fail. Even when they do have one, we do not want to fetch it and write to our object store. Isn't that what is going on? I thought I dug up the original that introduced the has_object_file() call to this codepath to make sure we understand why we make the check (and I expected the person who is proposing this change to do the same and record the finding in the proposed log message). I am running out of time today, and will revisit later this week (I'll be down for at least two days starting tomorrow, by the way). Thanks.
Junio C Hamano <gitster@pobox.com> writes: > Isn't that what is going on? I thought I dug up the original that > introduced the has_object_file() call to this codepath to make sure > we understand why we make the check (and I expected the person who > is proposing this change to do the same and record the finding in > the proposed log message). > > I am running out of time today, and will revisit later this week > (I'll be down for at least two days starting tomorrow, by the way). Here is what I came up with. The cache-tree datastructure is used to speed up the comparison between the HEAD and the index, and when the index is updated by a cherry-pick (for example), a tree object that would represent the paths in the index in a directory is constructed in-core, to see if such a tree object exists already in the object store. When the lazy-fetch mechanism was introduced, we converted this "does the tree exist?" check into an "if it does not, and if we lazily cloned, see if the remote has it" call by mistake. Since the whole point of this check is to repair the cache-tree by recording an already existing tree object opportunistically, we shouldn't even try to fetch one from the remote. Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only check for existence in the local object store without triggering the lazy fetch mechanism.
On Mon, Sep 09, 2019 at 02:05:53PM -0700, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > Isn't that what is going on? I thought I dug up the original that > > introduced the has_object_file() call to this codepath to make sure > > we understand why we make the check (and I expected the person who > > is proposing this change to do the same and record the finding in > > the proposed log message). > > > > I am running out of time today, and will revisit later this week > > (I'll be down for at least two days starting tomorrow, by the way). > > Here is what I came up with. > > The cache-tree datastructure is used to speed up the comparison > between the HEAD and the index, and when the index is updated by > a cherry-pick (for example), a tree object that would represent > the paths in the index in a directory is constructed in-core, to > see if such a tree object exists already in the object store. > > When the lazy-fetch mechanism was introduced, we converted this > "does the tree exist?" check into an "if it does not, and if we > lazily cloned, see if the remote has it" call by mistake. Since > the whole point of this check is to repair the cache-tree by > recording an already existing tree object opportunistically, we > shouldn't even try to fetch one from the remote. > > Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only > check for existence in the local object store without triggering the > lazy fetch mechanism. As a third-party observer, that explanation makes sense to me. I wondered also if this means we should be using OBJECT_INFO_QUICK. I.e., do we expect to see a "miss" here often, forcing us to re-scan the packed directory? Reading dd0c34c46b (cache-tree: protect against "git prune"., 2006-04-24), I think the answer is "no". -Peff
Jeff King <peff@peff.net> writes: > I wondered also if this means we should be using OBJECT_INFO_QUICK. > I.e., do we expect to see a "miss" here often, forcing us to re-scan the > packed directory? As a performance optimization hack, it is OK if we did not notice that the tree object, which corresponds to what is currently prepared for a directory in the index, does exist in the object store. It is not worth rescanning the packs to "protect" against races, I think, in the "repair" codepath. When the user actually wants to write the index out as a tree, we would write it out as a loose object (or omit doing so if we know there are already copies), but because it is not a crime to create a duplicate loose object when we already have a packed copy, I do not think we need to rescan in that context, either. But I do not think the codepath Jonathan's patch touches is about that operation.
On Mon, Sep 09, 2019 at 12:01:30PM -0700, Jonathan Tan wrote: > diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh > index 6415063980..3e434b6a81 100755 > --- a/t/t0410-partial-clone.sh > +++ b/t/t0410-partial-clone.sh > @@ -492,6 +492,20 @@ test_expect_success 'gc stops traversal when a missing but promised object is re > ! grep "$TREE_HASH" out > ' > > +test_expect_success 'do not fetch when checking existence of tree we construct ourselves' ' > + rm -rf repo && > + test_create_repo repo && > + test_commit -C repo base && > + test_commit -C repo side1 && > + git -C repo checkout base && > + test_commit -C repo side2 && > + > + git -C repo config core.repositoryformatversion 1 && > + git -C repo config extensions.partialclone "arbitrary string" && > + > + git -C repo cherry-pick side1 > +' > + Sidenote, just curious: did you originally intend to add this test before the test script sources 'lib-httpd.sh', or you were about to append it at the end as usual, but then noticed the warning comment telling you not to do so? > . "$TEST_DIRECTORY"/lib-httpd.sh > start_httpd
> Junio C Hamano <gitster@pobox.com> writes: > > > Isn't that what is going on? I thought I dug up the original that > > introduced the has_object_file() call to this codepath to make sure > > we understand why we make the check (and I expected the person who > > is proposing this change to do the same and record the finding in > > the proposed log message). > > > > I am running out of time today, and will revisit later this week > > (I'll be down for at least two days starting tomorrow, by the way). > > Here is what I came up with. > > The cache-tree datastructure is used to speed up the comparison > between the HEAD and the index, and when the index is updated by > a cherry-pick (for example), a tree object that would represent > the paths in the index in a directory is constructed in-core, to > see if such a tree object exists already in the object store. > > When the lazy-fetch mechanism was introduced, we converted this > "does the tree exist?" check into an "if it does not, and if we > lazily cloned, see if the remote has it" call by mistake. Since > the whole point of this check is to repair the cache-tree by > recording an already existing tree object opportunistically, we > shouldn't even try to fetch one from the remote. > > Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only > check for existence in the local object store without triggering the > lazy fetch mechanism. This commit message looks good to me. Thanks for writing the commit message - I thought that the justification in the commit message I wrote would be sufficient, but it makes sense to look into why the check was done.
> Sidenote, just curious: did you originally intend to add this test > before the test script sources 'lib-httpd.sh', or you were about to > append it at the end as usual, but then noticed the warning comment > telling you not to do so? Honestly, I don't remember. I do try to put tests near similar tests, so I might have seen that we had HTTP tests at the bottom and non-HTTP tests at the top, but in this case, I don't remember if I had that thought before putting this test where it is now.
diff --git a/cache-tree.c b/cache-tree.c index c22161f987..9e596893bc 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -407,7 +407,7 @@ static int update_one(struct cache_tree *it, if (repair) { struct object_id oid; hash_object_file(buffer.buf, buffer.len, tree_type, &oid); - if (has_object_file(&oid)) + if (has_object_file_with_flags(&oid, OBJECT_INFO_SKIP_FETCH_OBJECT)) oidcpy(&it->oid, &oid); else to_invalidate = 1; diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index 6415063980..3e434b6a81 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -492,6 +492,20 @@ test_expect_success 'gc stops traversal when a missing but promised object is re ! grep "$TREE_HASH" out ' +test_expect_success 'do not fetch when checking existence of tree we construct ourselves' ' + rm -rf repo && + test_create_repo repo && + test_commit -C repo base && + test_commit -C repo side1 && + git -C repo checkout base && + test_commit -C repo side2 && + + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.partialclone "arbitrary string" && + + git -C repo cherry-pick side1 +' + . "$TEST_DIRECTORY"/lib-httpd.sh start_httpd
When cherry-picking (for example), new trees may be constructed. During this process, Git constructs the new tree in a struct strbuf, computes the OID of the new tree, and checks if the new OID already exists on disk. However, in a partial clone, the disk check causes a lazy fetch to occur, which is both unnecessary (because we have the tree in the struct strbuf) and likely to fail (because the remote probably doesn't have this tree). Do not lazy fetch in this situation. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- As requested in What's Cooking [1], here's a patch with an updated commit message. Otherwise, the patch is exactly the same. [1] https://public-inbox.org/git/xmqqd0gcm2zm.fsf@gitster-ct.c.googlers.com/ --- cache-tree.c | 2 +- t/t0410-partial-clone.sh | 14 ++++++++++++++ 2 files changed, 15 insertions(+), 1 deletion(-)