Message ID | pull.1490.v2.git.1678301252360.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 057a59ecf3425f4a6e0f7166c02476244316b35c |
Headers | show |
Series | [v2] object-file: reprepare alternates when necessary | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <derrickstolee@github.com> > > When an object is not found in a repository's object store, we sometimes > call reprepare_packed_git() to see if the object was temporarily moved > into a new pack-file (and its old pack-file or loose object was > deleted). This process does a scan of each pack directory within each > odb, but does not reevaluate if the odb list needs updating. Very well explained, except I do not know what is meant by "temporarily", which to me implies what was moved is moved back later. Without "temporarily", it reads perfectly well. While I do not think of a good word to use to tell the readers that the moving is done by somebody else while we are trying to find the object, if there were one, that would replace "temporarily" very well. But other than that tiny nit that does not have to be addressed at all, everything in this patch looks good. Thanks.
On Wed, Mar 08, 2023 at 11:35:06AM -0800, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Derrick Stolee <derrickstolee@github.com> > > > > When an object is not found in a repository's object store, we sometimes > > call reprepare_packed_git() to see if the object was temporarily moved > > into a new pack-file (and its old pack-file or loose object was > > deleted). This process does a scan of each pack directory within each > > odb, but does not reevaluate if the odb list needs updating. > > Very well explained, except I do not know what is meant by > "temporarily", which to me implies what was moved is moved back > later. Without "temporarily", it reads perfectly well. While I do > not think of a good word to use to tell the readers that the moving > is done by somebody else while we are trying to find the object, if > there were one, that would replace "temporarily" very well. > > But other than that tiny nit that does not have to be addressed at > all, everything in this patch looks good. s/temporarily/concurrently? I think that concurrently conveys that another process was doing the moving. But I agree that it doesn't matter here to aid in the overall understanding of what's happening in this patch. So I'd be fine with this patch as-is, with s/temporarily/concurrently, or with s/temporarily//. Thanks, Taylor
On Wed, Mar 08, 2023 at 06:47:32PM +0000, Derrick Stolee via GitGitGadget wrote: > Extend reprepare_packed_git() to also reprepate the alternate odb list s/reprepate/reprepare/ Unless you were inventing a new word. :) > This change is specifically for concurrent changes to the repository, so > it is difficult to create a test that guarantees this behavior is > correct. I manually verified by introducing a reprepare_packed_git() call > into get_revision() and stepped into that call in a debugger with a > parent 'git log' process. Multiple runs of prepare_alt_odb() kept > the_repository->objects->odb as a single-item chain until I added a > .git/objects/info/alternates file in a different process. The next run > added the new odb to the chain and subsequent runs did not add to the > chain. One thing that test wouldn't cover is loading alternates from $GIT_ALTERNATE_OBJECT_DIRECTORIES. Once upon a time, I think we were pretty inconsistent in finding duplicates. But these days it looks like it's all done centrally in link_alt_odb_entry(), via alt_odb_usable(). So it should be good. > object-file: reprepare alternates when necessary > > This subtlety was notice by Michael Haggerty due to how alternates are > used server-side at $DAYJOB. Moving pack-files from a repository to the > alternate occasionally causes failures because processes that start > before the alternate exists don't know how to find that alternate at > run-time. Yeah, I don't think there's any real reason not to do this. It simply doesn't come up in the same way that packfile-repreparing does, because there it is routine for a concurrent gc to remove an existing packfile and add the objects elsewhere. Whereas modifying alternates was never seen as something that would happen often or automatically. I guess in GitHub's case it is from converting a repository from stand-alone into a fork, migrating its objects to a shared one, and adding an alternate. The only downside might be performance. For sane cases, I think scanning the new alternates is OK. I know Eric (cc'd) has some crazy 100k-alternate setup (from 407532f82d, etc), but I'd expect a reprepare there is already expensive (we already have to re-scan every one of those directories for packfiles, and throw out any loose object caches). The patch itself looks good to me (and I agree with the sentiment to just inline the lines rather than adding a function). -Peff
Jeff King <peff@peff.net> wrote: > The only downside might be performance. For sane cases, I think scanning > the new alternates is OK. I know Eric (cc'd) has some crazy > 100k-alternate setup (from 407532f82d, etc), but I'd expect a reprepare > there is already expensive (we already have to re-scan every one of > those directories for packfiles, and throw out any loose object caches). I'm not sure if that 100k alternate thing is happening, yet... (initial specs called for ~30k, but I figured it might grow) If it does, I'm thinking about enhancing --batch-command, to support `add-alternate' to dynamically add alternates while running cat-file. Right now, my biggest use case is only 250 alternates or so.
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > Update in v2 > ============ > > * Instead of creating a new public reprepare_alt_odb() method, inline > its implementation inside reprepare_packed_git(). [snip] > diff --git a/packfile.c b/packfile.c > index 79e21ab18e7..06419c8e8ca 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -1008,6 +1008,16 @@ void reprepare_packed_git(struct repository *r) > struct object_directory *odb; > > obj_read_lock(); > + > + /* > + * Reprepare alt odbs, in case the alternates file was modified > + * during the course of this process. This only _adds_ odbs to > + * the linked list, so existing odbs will continue to exist for > + * the lifetime of the process. > + */ > + r->objects->loaded_alternates = 0; > + prepare_alt_odb(r); I understand that the change to inline what was originally reprepare_alt_odb() was in response to a reviewer comment, but I would prefer the original version, since this assumes that prepare_alt_odb() only adds to what's there instead of first clearing the odb linked list and odb_by_path hashmap. But I guess clearing a linked list and hashmap can be a bit cumbersome in C, so maybe it would be reasonable to assume that this behavior would not change. In any case, maybe a comment should be added to prepare_alt_odb() saying that if an update of the alternates is needed, one can do so by clearing loaded_alternates and re- invoking prepare_alt_odb() (at least so that a developer changing prepare_alt_odb() can see the comment and understand what behaviors this function needs to preserve). Everything else (the new functionality, the commit message etc.) looks good, of course.
Jonathan Tan <jonathantanmy@google.com> writes: > But I guess clearing a linked list and hashmap can be a bit cumbersome > in C, so maybe it would be reasonable to assume that this behavior > would not change. I think the original reason why we did not clear was because we never knew (and we do not know) what is still in use. Can anybody explain why it would be a safe thing to do to rebuild the alt-odb list from scratch? Surely we can have a big central lock to do the "list manipulation" part safely, but is it safe to access the objects we obtained from one of the odb's that no longer is listed on the alt-odb list, for example? The code that this patch touches clears the loose object cache after updating the odb list, so loose object cache of any odb that goes away will not be cleared here, which means that the code that reconstruts alt-odb list would need to clear the loose object cache automatically? What should we do to packfiles that are mmaped from these alt-odbs that goes away? Etc. etc. > In any case, maybe a comment should be added to prepare_alt_odb() > saying that if an update of the alternates is needed, one can do > so by clearing loaded_alternates and re- invoking > prepare_alt_odb() (at least so that a developer changing > prepare_alt_odb() can see the comment and understand what > behaviors this function needs to preserve). Sounds good. Thanks.
Junio C Hamano <gitster@pobox.com> writes: > Jonathan Tan <jonathantanmy@google.com> writes: > > > But I guess clearing a linked list and hashmap can be a bit cumbersome > > in C, so maybe it would be reasonable to assume that this behavior > > would not change. > > I think the original reason why we did not clear was because we > never knew (and we do not know) what is still in use. Ah, I meant that prepare_alt_odb() assumes that the data structures are as they were originally initialized. So I can think of at least 2 ways that we could implement prepare_alt_odb(): (a) Repeatedly modify r->objects->odb_tail and repeatedly assign to kh_value(r->objects->odb_by_path, pos). (b) Call make_odb(), which returns a pointer that we assign to r->objects->odb, and call make_odb_hashmap(), which returns a pointer that we assign to r->objects->odb_by_path. Remember that prepare_alt_odb() is meant to start from scratch, so it is reasonable to do either. In Java, for example, it would be very reasonable to do (b). But this patch assumes that we have implemented it as (a). Indeed we have implemented it as (a) (although we have never documented it as such), and my quoted statement above is meant to say that it might be reasonable to keep assuming that it would be (a).
diff --git a/packfile.c b/packfile.c index 79e21ab18e7..06419c8e8ca 100644 --- a/packfile.c +++ b/packfile.c @@ -1008,6 +1008,16 @@ void reprepare_packed_git(struct repository *r) struct object_directory *odb; obj_read_lock(); + + /* + * Reprepare alt odbs, in case the alternates file was modified + * during the course of this process. This only _adds_ odbs to + * the linked list, so existing odbs will continue to exist for + * the lifetime of the process. + */ + r->objects->loaded_alternates = 0; + prepare_alt_odb(r); + for (odb = r->objects->odb; odb; odb = odb->next) odb_clear_loose_cache(odb);