diff mbox series

[v2] object-file: reprepare alternates when necessary

Message ID pull.1490.v2.git.1678301252360.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 057a59ecf3425f4a6e0f7166c02476244316b35c
Headers show
Series [v2] object-file: reprepare alternates when necessary | expand

Commit Message

Derrick Stolee March 8, 2023, 6:47 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

When an object is not found in a repository's object store, we sometimes
call reprepare_packed_git() to see if the object was temporarily moved
into a new pack-file (and its old pack-file or loose object was
deleted). This process does a scan of each pack directory within each
odb, but does not reevaluate if the odb list needs updating.

Extend reprepare_packed_git() to also reprepate the alternate odb list
by setting loaded_alternates to zero and calling prepare_alt_odb(). This
will add newly-discoverd odbs to the linked list, but will not duplicate
existing ones nor will it remove existing ones that are no longer listed
in the alternates file. Do this under the object read lock to avoid
readers from interacting with a potentially incomplete odb being added
to the odb list.

If the alternates file was edited to _remove_ some alternates during the
course of the Git process, Git will continue to see alternates that were
ever valid for that repository. ODBs are not removed from the list, the
same as the existing behavior before this change. Git already has
protections against an alternate directory disappearing from the
filesystem during the lifetime of a process, and those are still in
effect.

This change is specifically for concurrent changes to the repository, so
it is difficult to create a test that guarantees this behavior is
correct. I manually verified by introducing a reprepare_packed_git() call
into get_revision() and stepped into that call in a debugger with a
parent 'git log' process. Multiple runs of prepare_alt_odb() kept
the_repository->objects->odb as a single-item chain until I added a
.git/objects/info/alternates file in a different process. The next run
added the new odb to the chain and subsequent runs did not add to the
chain.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
    object-file: reprepare alternates when necessary
    
    This subtlety was notice by Michael Haggerty due to how alternates are
    used server-side at $DAYJOB. Moving pack-files from a repository to the
    alternate occasionally causes failures because processes that start
    before the alternate exists don't know how to find that alternate at
    run-time.
    
    
    Update in v2
    ============
    
     * Instead of creating a new public reprepare_alt_odb() method, inline
       its implementation inside reprepare_packed_git().
     * Update commit message to be explicit about behavior with alternates
       being removed from the alternates file or from disk.
    
    Thanks,
    
     * Stolee

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1490%2Fderrickstolee%2Fstolee%2Freprepare-alternates-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1490/derrickstolee/stolee/reprepare-alternates-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1490

Range-diff vs v1:

 1:  3f42c9cdef7 ! 1:  9a5e1d9a9da object-file: reprepare alternates when necessary
     @@ Commit message
          deleted). This process does a scan of each pack directory within each
          odb, but does not reevaluate if the odb list needs updating.
      
     -    Create a new reprepare_alt_odb() method that is a similar wrapper around
     -    prepare_alt_odb(). Call it from reprepare_packed_git() under the object
     -    read lock to avoid readers from interacting with a potentially
     -    incomplete odb being added to the odb list.
     -
     -    prepare_alt_odb() already avoids adding duplicate odbs to the list
     -    during its progress, so it is safe to call it again from
     -    reprepare_alt_odb() without worrying about duplicate odbs.
     +    Extend reprepare_packed_git() to also reprepate the alternate odb list
     +    by setting loaded_alternates to zero and calling prepare_alt_odb(). This
     +    will add newly-discoverd odbs to the linked list, but will not duplicate
     +    existing ones nor will it remove existing ones that are no longer listed
     +    in the alternates file. Do this under the object read lock to avoid
     +    readers from interacting with a potentially incomplete odb being added
     +    to the odb list.
     +
     +    If the alternates file was edited to _remove_ some alternates during the
     +    course of the Git process, Git will continue to see alternates that were
     +    ever valid for that repository. ODBs are not removed from the list, the
     +    same as the existing behavior before this change. Git already has
     +    protections against an alternate directory disappearing from the
     +    filesystem during the lifetime of a process, and those are still in
     +    effect.
      
          This change is specifically for concurrent changes to the repository, so
          it is difficult to create a test that guarantees this behavior is
          correct. I manually verified by introducing a reprepare_packed_git() call
          into get_revision() and stepped into that call in a debugger with a
     -    parent 'git log' process. Multiple runs of reprepare_alt_odb() kept
     +    parent 'git log' process. Multiple runs of prepare_alt_odb() kept
          the_repository->objects->odb as a single-item chain until I added a
          .git/objects/info/alternates file in a different process. The next run
          added the new odb to the chain and subsequent runs did not add to the
     @@ Commit message
      
          Signed-off-by: Derrick Stolee <derrickstolee@github.com>
      
     - ## object-file.c ##
     -@@ object-file.c: void prepare_alt_odb(struct repository *r)
     - 	r->objects->loaded_alternates = 1;
     - }
     - 
     -+void reprepare_alt_odb(struct repository *r)
     -+{
     -+	r->objects->loaded_alternates = 0;
     -+	prepare_alt_odb(r);
     -+}
     -+
     - /* Returns 1 if we have successfully freshened the file, 0 otherwise. */
     - static int freshen_file(const char *fn)
     - {
     -
     - ## object-store.h ##
     -@@ object-store.h: KHASH_INIT(odb_path_map, const char * /* key: odb_path */,
     - 	struct object_directory *, 1, fspathhash, fspatheq)
     - 
     - void prepare_alt_odb(struct repository *r);
     -+void reprepare_alt_odb(struct repository *r);
     - char *compute_alternate_path(const char *path, struct strbuf *err);
     - struct object_directory *find_odb(struct repository *r, const char *obj_dir);
     - typedef int alt_odb_fn(struct object_directory *, void *);
     -
       ## packfile.c ##
      @@ packfile.c: void reprepare_packed_git(struct repository *r)
       	struct object_directory *odb;
       
       	obj_read_lock();
     -+	reprepare_alt_odb(r);
     ++
     ++	/*
     ++	 * Reprepare alt odbs, in case the alternates file was modified
     ++	 * during the course of this process. This only _adds_ odbs to
     ++	 * the linked list, so existing odbs will continue to exist for
     ++	 * the lifetime of the process.
     ++	 */
     ++	r->objects->loaded_alternates = 0;
     ++	prepare_alt_odb(r);
     ++
       	for (odb = r->objects->odb; odb; odb = odb->next)
       		odb_clear_loose_cache(odb);
       


 packfile.c | 10 ++++++++++
 1 file changed, 10 insertions(+)


base-commit: d15644fe0226af7ffc874572d968598564a230dd

Comments

Junio C Hamano March 8, 2023, 7:35 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> When an object is not found in a repository's object store, we sometimes
> call reprepare_packed_git() to see if the object was temporarily moved
> into a new pack-file (and its old pack-file or loose object was
> deleted). This process does a scan of each pack directory within each
> odb, but does not reevaluate if the odb list needs updating.

Very well explained, except I do not know what is meant by
"temporarily", which to me implies what was moved is moved back
later.  Without "temporarily", it reads perfectly well.  While I do
not think of a good word to use to tell the readers that the moving
is done by somebody else while we are trying to find the object, if
there were one, that would replace "temporarily" very well.

But other than that tiny nit that does not have to be addressed at
all, everything in this patch looks good.

Thanks.
Taylor Blau March 8, 2023, 8:47 p.m. UTC | #2
On Wed, Mar 08, 2023 at 11:35:06AM -0800, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Derrick Stolee <derrickstolee@github.com>
> >
> > When an object is not found in a repository's object store, we sometimes
> > call reprepare_packed_git() to see if the object was temporarily moved
> > into a new pack-file (and its old pack-file or loose object was
> > deleted). This process does a scan of each pack directory within each
> > odb, but does not reevaluate if the odb list needs updating.
>
> Very well explained, except I do not know what is meant by
> "temporarily", which to me implies what was moved is moved back
> later.  Without "temporarily", it reads perfectly well.  While I do
> not think of a good word to use to tell the readers that the moving
> is done by somebody else while we are trying to find the object, if
> there were one, that would replace "temporarily" very well.
>
> But other than that tiny nit that does not have to be addressed at
> all, everything in this patch looks good.

s/temporarily/concurrently?

I think that concurrently conveys that another process was doing the
moving. But I agree that it doesn't matter here to aid in the overall
understanding of what's happening in this patch.

So I'd be fine with this patch as-is, with s/temporarily/concurrently,
or with s/temporarily//.

Thanks,
Taylor
Jeff King March 9, 2023, 7:24 a.m. UTC | #3
On Wed, Mar 08, 2023 at 06:47:32PM +0000, Derrick Stolee via GitGitGadget wrote:

> Extend reprepare_packed_git() to also reprepate the alternate odb list

s/reprepate/reprepare/

Unless you were inventing a new word. :)

> This change is specifically for concurrent changes to the repository, so
> it is difficult to create a test that guarantees this behavior is
> correct. I manually verified by introducing a reprepare_packed_git() call
> into get_revision() and stepped into that call in a debugger with a
> parent 'git log' process. Multiple runs of prepare_alt_odb() kept
> the_repository->objects->odb as a single-item chain until I added a
> .git/objects/info/alternates file in a different process. The next run
> added the new odb to the chain and subsequent runs did not add to the
> chain.

One thing that test wouldn't cover is loading alternates from
$GIT_ALTERNATE_OBJECT_DIRECTORIES. Once upon a time, I think we were
pretty inconsistent in finding duplicates. But these days it looks like
it's all done centrally in link_alt_odb_entry(), via alt_odb_usable().
So it should be good.

>     object-file: reprepare alternates when necessary
>     
>     This subtlety was notice by Michael Haggerty due to how alternates are
>     used server-side at $DAYJOB. Moving pack-files from a repository to the
>     alternate occasionally causes failures because processes that start
>     before the alternate exists don't know how to find that alternate at
>     run-time.

Yeah, I don't think there's any real reason not to do this. It simply
doesn't come up in the same way that packfile-repreparing does, because
there it is routine for a concurrent gc to remove an existing packfile
and add the objects elsewhere.

Whereas modifying alternates was never seen as something that would
happen often or automatically. I guess in GitHub's case it is from
converting a repository from stand-alone into a fork, migrating its
objects to a shared one, and adding an alternate.

The only downside might be performance. For sane cases, I think scanning
the new alternates is OK. I know Eric (cc'd) has some crazy
100k-alternate setup (from 407532f82d, etc), but I'd expect a reprepare
there is already expensive (we already have to re-scan every one of
those directories for packfiles, and throw out any loose object caches).

The patch itself looks good to me (and I agree with the sentiment to
just inline the lines rather than adding a function).

-Peff
Eric Wong March 9, 2023, 9:06 a.m. UTC | #4
Jeff King <peff@peff.net> wrote:
> The only downside might be performance. For sane cases, I think scanning
> the new alternates is OK. I know Eric (cc'd) has some crazy
> 100k-alternate setup (from 407532f82d, etc), but I'd expect a reprepare
> there is already expensive (we already have to re-scan every one of
> those directories for packfiles, and throw out any loose object caches).

I'm not sure if that 100k alternate thing is happening, yet...
(initial specs called for ~30k, but I figured it might grow)

If it does, I'm thinking about enhancing --batch-command, to support
`add-alternate' to dynamically add alternates while running cat-file.

Right now, my biggest use case is only 250 alternates or so.
Jonathan Tan March 10, 2023, 9:29 p.m. UTC | #5
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>     Update in v2
>     ============
>     
>      * Instead of creating a new public reprepare_alt_odb() method, inline
>        its implementation inside reprepare_packed_git().

[snip]

> diff --git a/packfile.c b/packfile.c
> index 79e21ab18e7..06419c8e8ca 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -1008,6 +1008,16 @@ void reprepare_packed_git(struct repository *r)
>  	struct object_directory *odb;
>  
>  	obj_read_lock();
> +
> +	/*
> +	 * Reprepare alt odbs, in case the alternates file was modified
> +	 * during the course of this process. This only _adds_ odbs to
> +	 * the linked list, so existing odbs will continue to exist for
> +	 * the lifetime of the process.
> +	 */
> +	r->objects->loaded_alternates = 0;
> +	prepare_alt_odb(r);

I understand that the change to inline what was originally
reprepare_alt_odb() was in response to a reviewer comment, but I would
prefer the original version, since this assumes that prepare_alt_odb()
only adds to what's there instead of first clearing the odb linked list
and odb_by_path hashmap.

But I guess clearing a linked list and hashmap can be a bit cumbersome
in C, so maybe it would be reasonable to assume that this behavior
would not change. In any case, maybe a comment should be added
to prepare_alt_odb() saying that if an update of the alternates
is needed, one can do so by clearing loaded_alternates and re-
invoking prepare_alt_odb() (at least so that a developer changing
prepare_alt_odb() can see the comment and understand what behaviors this
function needs to preserve).

Everything else (the new functionality, the commit message etc.) looks
good, of course.
Junio C Hamano March 11, 2023, 12:01 a.m. UTC | #6
Jonathan Tan <jonathantanmy@google.com> writes:

> But I guess clearing a linked list and hashmap can be a bit cumbersome
> in C, so maybe it would be reasonable to assume that this behavior
> would not change.

I think the original reason why we did not clear was because we
never knew (and we do not know) what is still in use.  Can anybody
explain why it would be a safe thing to do to rebuild the alt-odb
list from scratch?  Surely we can have a big central lock to do the
"list manipulation" part safely, but is it safe to access the
objects we obtained from one of the odb's that no longer is listed
on the alt-odb list, for example?  The code that this patch touches
clears the loose object cache after updating the odb list, so loose
object cache of any odb that goes away will not be cleared here,
which means that the code that reconstruts alt-odb list would need
to clear the loose object cache automatically?  What should we do to
packfiles that are mmaped from these alt-odbs that goes away?  Etc.
etc.

> In any case, maybe a comment should be added to prepare_alt_odb()
> saying that if an update of the alternates is needed, one can do
> so by clearing loaded_alternates and re- invoking
> prepare_alt_odb() (at least so that a developer changing
> prepare_alt_odb() can see the comment and understand what
> behaviors this function needs to preserve).

Sounds good.

Thanks.
Jonathan Tan March 11, 2023, 3:09 a.m. UTC | #7
Junio C Hamano <gitster@pobox.com> writes:
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > But I guess clearing a linked list and hashmap can be a bit cumbersome
> > in C, so maybe it would be reasonable to assume that this behavior
> > would not change.
> 
> I think the original reason why we did not clear was because we
> never knew (and we do not know) what is still in use.  

Ah, I meant that prepare_alt_odb() assumes that the data structures
are as they were originally initialized. So I can think of at least
2 ways that we could implement prepare_alt_odb():

 (a) Repeatedly modify r->objects->odb_tail and repeatedly assign to
     kh_value(r->objects->odb_by_path, pos).
 (b) Call make_odb(), which returns a pointer that we assign to
     r->objects->odb, and call make_odb_hashmap(), which returns a
     pointer that we assign to r->objects->odb_by_path.

Remember that prepare_alt_odb() is meant to start from scratch, so it is
reasonable to do either.

In Java, for example, it would be very reasonable to do (b). But
this patch assumes that we have implemented it as (a). Indeed we
have implemented it as (a) (although we have never documented it as
such), and my quoted statement above is meant to say that it might be
reasonable to keep assuming that it would be (a).
diff mbox series

Patch

diff --git a/packfile.c b/packfile.c
index 79e21ab18e7..06419c8e8ca 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1008,6 +1008,16 @@  void reprepare_packed_git(struct repository *r)
 	struct object_directory *odb;
 
 	obj_read_lock();
+
+	/*
+	 * Reprepare alt odbs, in case the alternates file was modified
+	 * during the course of this process. This only _adds_ odbs to
+	 * the linked list, so existing odbs will continue to exist for
+	 * the lifetime of the process.
+	 */
+	r->objects->loaded_alternates = 0;
+	prepare_alt_odb(r);
+
 	for (odb = r->objects->odb; odb; odb = odb->next)
 		odb_clear_loose_cache(odb);