[v3] repack: enable bitmaps by default on bare repos

Message ID	20190314091254.nescpfp3n6mbjpmh@dcvr (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Thu, 14 Mar 2019 09:12:54 +0000 From: Eric Wong <e@80x24.org> To: Jeff King <peff@peff.net> Cc: git@vger.kernel.org, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com> Subject: [PATCH v3] repack: enable bitmaps by default on bare repos Message-ID: <20190314091254.nescpfp3n6mbjpmh@dcvr> References: <20190214043127.GA19019@sigill.intra.peff.net> <20190214043743.GB19183@sigill.intra.peff.net> <20190309024944.zcbwgvn52jsw2a2e@dcvr> <20190310233956.GB3059@sigill.intra.peff.net> <20190312031303.5tutut7zzvxne5dw@dcvr> <20190312104954.GA2023@sigill.intra.peff.net> <20190313015133.n7f7lyujnlwfytre@dcvr> <20190313145417.GA24101@sigill.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20190313145417.GA24101@sigill.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	[v3] repack: enable bitmaps by default on bare repos \| expand [v3] repack: enable bitmaps by default on bare repos

Eric Wong March 14, 2019, 9:12 a.m. UTC

Jeff King <peff@peff.net> wrote:
> On Wed, Mar 13, 2019 at 01:51:33AM +0000, Eric Wong wrote:
> 
> > But I did find Ævar's forgotten gitperformance doc and thread
> > where the topic was brought up:
> > 
> >   https://public-inbox.org/git/20170403211644.26814-1-avarab@gmail.com/
> 
> One thing that thread reminded me of: we probably also want to default
> pack.writebitmaphashcache on. Otherwise the time saved during the object
> enumeration can backfire when we spend much more time trying delta
> compression (because we don't know the pathnames of any objects).

Interesting...  I found a big improvement with public-inbox
just using bitmaps; but have never tried the hash cache.

> The reason it defaults to off is for on-disk compatibility with JGit.

Right.  Our documentation seems to indicate JGit just warns (but
doesn't fall over), so maybe that can be considered separately.

I've never used JGit myself; and was satisfied enough with
bitmaps alone that I never tried the hash-cache.

> But I have very little experience running without the hash-cache on. We
> added it very early on because we found performance was not great
> without it (I don't know if people running JGit have run into the same
> problem and if not, why not).

As far as serving clones and fetches, public-inbox-init has
always created bare repos with bitmaps enabled, but without
the hash-cache for compatibility concerns.

That's a lot of fetches and clones over the years.

> > +test_expect_success 'incremental repack does not complain' '
> > +	git -C bare.git repack -q 2>repack.err &&
> > +	! test -s repack.err
> > +'
> 
> This last line could use "test_must_be_empty".

Thanks for the review!

---------8<-----------
Subject: [PATCH] repack: enable bitmaps by default on bare repos

A typical use case for bare repos is for serving clones and
fetches to clients.  Enable bitmaps by default on bare repos to
make it easier for admins to host git repos in a performant way.

Signed-off-by: Eric Wong <e@80x24.org>
Helped-by: Jeff King <peff@peff.net>
---
 Documentation/config/repack.txt |  2 +-
 builtin/repack.c                |  5 ++++-
 t/t7700-repack.sh               | 19 ++++++++++++++++++-
 3 files changed, 23 insertions(+), 3 deletions(-)

Jeff King March 14, 2019, 4:02 p.m. UTC | #1

On Thu, Mar 14, 2019 at 09:12:54AM +0000, Eric Wong wrote:

> > The reason it defaults to off is for on-disk compatibility with JGit.
> 
> Right.  Our documentation seems to indicate JGit just warns (but
> doesn't fall over), so maybe that can be considered separately.

I think it was a hard error in the beginning, but they changed it pretty
soon after we added more flags. So it might be reasonable to just enable
it by default (but it wouldn't hurt to double check the behavior).

I tried running t5310 (which does a back-and-forth with jgit) using this
patch:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a154fc29f6..5264bf055a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -97,7 +97,7 @@ static off_t reuse_packfile_offset;
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
 static int write_bitmap_index;
-static uint16_t write_bitmap_options;
+static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
 
 static int exclude_promisor_objects;
 

and it seemed happy.

> As far as serving clones and fetches, public-inbox-init has
> always created bare repos with bitmaps enabled, but without
> the hash-cache for compatibility concerns.
> 
> That's a lot of fetches and clones over the years.

The symptom you'd see is that "Compressing objects" during a fetch takes
a long time, and/or produces lousy deltas. But it matters less if:

  - you keep things packed pretty promptly, because we don't bother
    looking for new deltas between objects in the same pack. Just trying
    to clone public-inbox.org/git, it does look like it's mostly packed
    (based on the object counts) but the compression phase still takes
    10+ seconds.

  - how much the names actually help. In your case, I'd think not at
    all, because being based on hashes, they're effectively random. So
    the pack-objects heuristics to try to find deltas between files of
    similar filenames will not help you.

Regarding the second thing, I wondered if the overall packing of your
public-inbox git repo might not be good, so I did a "git repack -adf
--window=1000" on a clone.  Hundreds of CPU minutes later, I was only
able to shave off about 80MB. I'm not sure if that means you
occasionally do very aggressive repacks, or if there simply isn't all
that much delta opportunity (after all, you're not storing many versions
of one file, but rather tons of different emails; I would expect to find
deltas between various versions of a patch, though).

Anyway...

> ---------8<-----------
> Subject: [PATCH] repack: enable bitmaps by default on bare repos
> 
> A typical use case for bare repos is for serving clones and
> fetches to clients.  Enable bitmaps by default on bare repos to
> make it easier for admins to host git repos in a performant way.
> 
> Signed-off-by: Eric Wong <e@80x24.org>
> Helped-by: Jeff King <peff@peff.net>

This version looks good to me. If we're going to flip the hash-cache
default, I think that should be a separate patch anyway.

-Peff

Ævar Arnfjörð Bjarmason April 9, 2019, 3:10 p.m. UTC | #2

On Thu, Mar 14 2019, Eric Wong wrote:

> Jeff King <peff@peff.net> wrote:
>> On Wed, Mar 13, 2019 at 01:51:33AM +0000, Eric Wong wrote:
>>
>> > But I did find Ævar's forgotten gitperformance doc and thread
>> > where the topic was brought up:
>> >
>> >   https://public-inbox.org/git/20170403211644.26814-1-avarab@gmail.com/
>>
>> One thing that thread reminded me of: we probably also want to default
>> pack.writebitmaphashcache on. Otherwise the time saved during the object
>> enumeration can backfire when we spend much more time trying delta
>> compression (because we don't know the pathnames of any objects).
>
> Interesting...  I found a big improvement with public-inbox
> just using bitmaps; but have never tried the hash cache.
>
>> The reason it defaults to off is for on-disk compatibility with JGit.
>
> Right.  Our documentation seems to indicate JGit just warns (but
> doesn't fall over), so maybe that can be considered separately.
>
> I've never used JGit myself; and was satisfied enough with
> bitmaps alone that I never tried the hash-cache.
>
>> But I have very little experience running without the hash-cache on. We
>> added it very early on because we found performance was not great
>> without it (I don't know if people running JGit have run into the same
>> problem and if not, why not).
>
> As far as serving clones and fetches, public-inbox-init has
> always created bare repos with bitmaps enabled, but without
> the hash-cache for compatibility concerns.
>
> That's a lot of fetches and clones over the years.
>
>> > +test_expect_success 'incremental repack does not complain' '
>> > +	git -C bare.git repack -q 2>repack.err &&
>> > +	! test -s repack.err
>> > +'
>>
>> This last line could use "test_must_be_empty".
>
> Thanks for the review!
>
> ---------8<-----------
> Subject: [PATCH] repack: enable bitmaps by default on bare repos
>
> A typical use case for bare repos is for serving clones and
> fetches to clients.  Enable bitmaps by default on bare repos to
> make it easier for admins to host git repos in a performant way.
>
> Signed-off-by: Eric Wong <e@80x24.org>
> Helped-by: Jeff King <peff@peff.net>
> ---
>  Documentation/config/repack.txt |  2 +-
>  builtin/repack.c                |  5 ++++-
>  t/t7700-repack.sh               | 19 ++++++++++++++++++-
>  3 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
> index a5c37813fd..9c413e177e 100644
> --- a/Documentation/config/repack.txt
> +++ b/Documentation/config/repack.txt
> @@ -24,4 +24,4 @@ repack.writeBitmaps::
>  	packs created for clones and fetches, at the cost of some disk
>  	space and extra time spent on the initial repack.  This has
>  	no effect if multiple packfiles are created.
> -	Defaults to false.
> +	Defaults to true on bare repos, false otherwise.
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 67f8978043..caca113927 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -14,7 +14,7 @@
>
>  static int delta_base_offset = 1;
>  static int pack_kept_objects = -1;
> -static int write_bitmaps;
> +static int write_bitmaps = -1;
>  static int use_delta_islands;
>  static char *packdir, *packtmp;
>
> @@ -343,6 +343,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
>  		die(_("--keep-unreachable and -A are incompatible"));
>
> +	if (write_bitmaps < 0)
> +		write_bitmaps = (pack_everything & ALL_INTO_ONE) &&
> +				 is_bare_repository();
>  	if (pack_kept_objects < 0)
>  		pack_kept_objects = write_bitmaps;
>
> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 6162e2a8e6..86d05160a3 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -221,5 +221,22 @@ test_expect_success 'repack --keep-pack' '
>  	)
>  '
>
> -test_done
> +test_expect_success 'bitmaps are created by default in bare repos' '
> +	git clone --bare .git bare.git &&
> +	git -C bare.git repack -ad &&
> +	bitmap=$(ls bare.git/objects/pack/*.bitmap) &&
> +	test_path_is_file "$bitmap"
> +'
> +
> +test_expect_success 'incremental repack does not complain' '
> +	git -C bare.git repack -q 2>repack.err &&
> +	test_must_be_empty repack.err
> +'
>
> +test_expect_success 'bitmaps can be disabled on bare repos' '
> +	git -c repack.writeBitmaps=false -C bare.git repack -ad &&
> +	bitmap=$(ls bare.git/objects/pack/*.bitmap 2>/dev/null || :) &&
> +	test -z "$bitmap"
> +'
> +
> +test_done

I've found a case where turning bitmaps on does horrible things for
bitmap "push" performance.

As it turns out it's not related to this patch per-se, because I had a
*.bitmap for other reasons, but replying to this because we'd presumably
get the same thing in the bare repo case once this merges down.

I can't share the repo, but I had a report where just a "git push" of a
topic branch that was 2/58 ahead/behind took ~2 minutes just in
"Enumerating objects", but ~500ms without bitmaps.

Using a horrible "print to stderr"[1] monkeypatch I'd get without
bitmaps and reported by trace2 / ts:

    Apr 09 16:45:15 16:45:15.365817 git.c:433                         | d1 | main                     | cmd_name     |     |           |           |            | pack-objects (push/pack-objects)
    Apr 09 16:45:15 16:45:15.366220 builtin/pack-objects.c:3493       | d1 | main                     | region_enter | r1  |  0.000928 |           | pack-objects | label:enumerate-objects
    Apr 09 16:45:15 16:45:15.366241 builtin/pack-objects.c:3495       | d1 | main                     | region_enter | r1  |  0.000950 |           | pack-objects | ..label:enumerate-objects-prepare-packing-data
    Apr 09 16:45:15 16:45:15.366384 builtin/pack-objects.c:3498       | d1 | main                     | region_leave | r1  |  0.001091 |  0.000141 | pack-objects | ..label:enumerate-objects-prepare-packing-data
    Apr 09 16:45:15 16:45:15.366394 builtin/pack-objects.c:3510       | d1 | main                     | region_enter | r1  |  0.001102 |           | pack-objects | ..label:enumerate-objects-get-obj-list
    Apr 09 16:45:15 get obj list 1
    Apr 09 16:45:15 get obj list 2, did 29391 lines
    Apr 09 16:45:15 get obj list 3
    Apr 09 16:45:15 get obj list 4
    Apr 09 16:45:15 get obj list 5
    Apr 09 16:45:15 get obj list 6
    Apr 09 16:45:15 get obj list 7
    Apr 09 16:45:15 get obj list 8
    Apr 09 16:45:15 get obj list 9
    Apr 09 16:45:15 16:45:15.776559 builtin/pack-objects.c:3514       | d1 | main                     | region_leave | r1  |  0.411263 |  0.410161 | pack-objects | ..label:enumerate-objects-get-obj-list
    Apr 09 16:45:15 16:45:15.776577 builtin/pack-objects.c:3517       | d1 | main                     | region_enter | r1  |  0.411285 |           | pack-objects | ..label:enumerate-objects-cleanup-preferred-base
    Apr 09 16:45:15 16:45:15.776584 builtin/pack-objects.c:3520       | d1 | main                     | region_leave | r1  |  0.411292 |  0.000007 | pack-objects | ..label:enumerate-objects-cleanup-preferred-base
    Apr 09 16:45:15 16:45:15.776605 builtin/pack-objects.c:3530       | d1 | main                     | region_leave | r1  |  0.411313 |  0.410385 | pack-objects | label:enumerate-objects
    Apr 09 16:45:15 16:45:15.776609 builtin/pack-objects.c:3542       | d1 | main                     | region_enter | r1  |  0.411318 |           | pack-objects | label:write-pack-file
    Apr 09 16:45:15 16:45:15.794235 builtin/pack-objects.c:3544       | d1 | main                     | region_leave | r1  |  0.428942 |  0.017624 | pack-objects | label:write-pack-file

But with pack.useBitmaps=true:

    Apr 09 16:39:59 16:39:59.139022 git.c:433                         | d1 | main                     | cmd_name     |     |           |           |            | pack-objects (push/pack-objects)
    Apr 09 16:39:59 16:39:59.139398 builtin/pack-objects.c:3493       | d1 | main                     | region_enter | r1  |  0.000869 |           | pack-objects | label:enumerate-objects
    Apr 09 16:39:59 16:39:59.139419 builtin/pack-objects.c:3495       | d1 | main                     | region_enter | r1  |  0.000892 |           | pack-objects | ..label:enumerate-objects-prepare-packing-data
    Apr 09 16:39:59 16:39:59.139551 builtin/pack-objects.c:3498       | d1 | main                     | region_leave | r1  |  0.001023 |  0.000131 | pack-objects | ..label:enumerate-objects-prepare-packing-data
    Apr 09 16:39:59 16:39:59.139559 builtin/pack-objects.c:3510       | d1 | main                     | region_enter | r1  |  0.001032 |           | pack-objects | ..label:enumerate-objects-get-obj-list
    Apr 09 16:39:59 get obj list 1
    Apr 09 16:39:59 get obj list 2, did 29392 lines
    Apr 09 16:39:59 get obj list 3
    Apr 09 16:39:59 prepping walk
    Apr 09 16:39:59 opening packed bitmap...
    Apr 09 16:39:59 opening packed bitmap done
    Apr 09 16:39:59 walking 29392 pending
    Apr 09 16:39:59 done walking 29392 pending
    Apr 09 16:39:59 prepare_bitmap_walk 3
    Apr 09 16:39:59 prepare_bitmap_walk 4
    Apr 09 16:39:59 prepare_bitmap_walk 5
    Apr 09 16:40:00 prepare_bitmap_walk 6
    Apr 09 16:40:00 prepare_bitmap_walk 6.1
    Apr 09 16:41:35 prepare_bitmap_walk 6.2
    Apr 09 16:41:35 prepare_bitmap_walk 7
    Apr 09 16:41:52 prepare_bitmap_walk 8
    Apr 09 16:41:52 walking?
    Apr 09 16:41:52 traversing
    Apr 09 16:41:52 traversing done
    Apr 09 16:41:52 16:41:52.091634 builtin/pack-objects.c:3514       | d1 | main                     | region_leave | r1  | 112.953099 | 112.952067 | pack-objects | ..label:enumerate-objects-get-obj-list
    Apr 09 16:41:52 16:41:52.091655 builtin/pack-objects.c:3517       | d1 | main                     | region_enter | r1  | 112.953128 |           | pack-objects | ..label:enumerate-objects-cleanup-preferred-base
    Apr 09 16:41:52 16:41:52.091668 builtin/pack-objects.c:3520       | d1 | main                     | region_leave | r1  | 112.953141 |  0.000013 | pack-objects | ..label:enumerate-objects-cleanup-preferred-base
    Apr 09 16:41:52 16:41:52.091700 builtin/pack-objects.c:3530       | d1 | main                     | region_leave | r1  | 112.953172 | 112.952303 | pack-objects | label:enumerate-objects
    Apr 09 16:41:52 16:41:52.091706 builtin/pack-objects.c:3542       | d1 | main                     | region_enter | r1  | 112.953179 |           | pack-objects | label:write-pack-file
    Apr 09 16:41:52 16:41:52.111966 builtin/pack-objects.c:3544       | d1 | main                     | region_leave | r1  | 112.973438 |  0.020259 | pack-objects | label:write-pack-file

I.e. almost all the time is in get_object_list_from_bitmap() and around
1m30s in just this in pack-bitmap.c:

    haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);

And then another ~20s in:

    wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);

This is with 10 packs and where only the largest (initial clone pack)
had a *.bitmap, but I can also reproduce with a 'git repack -A -d -b',
i.e. with only one pack with a *.bitmap, although that makes it a bit
better for the first bit, and almost completely cuts down on the time
spent in the second phase:

    Apr 09 17:08:37 17:08:37.261507 builtin/pack-objects.c:3493       | d1 | main                     | region_enter | r1  |  0.000922 |           | pack-objects | label:enumerate-objects
    Apr 09 17:08:37 17:08:37.261527 builtin/pack-objects.c:3495       | d1 | main                     | region_enter | r1  |  0.000943 |           | pack-objects | ..label:enumerate-objects-prepare-packing-data
    Apr 09 17:08:37 17:08:37.261600 builtin/pack-objects.c:3498       | d1 | main                     | region_leave | r1  |  0.001015 |  0.000072 | pack-objects | ..label:enumerate-objects-prepare-packing-data
    Apr 09 17:08:37 17:08:37.261608 builtin/pack-objects.c:3510       | d1 | main                     | region_enter | r1  |  0.001024 |           | pack-objects | ..label:enumerate-objects-get-obj-list
    Apr 09 17:08:37 get obj list 1
    Apr 09 17:08:37 get obj list 2, did 29380 lines
    Apr 09 17:08:37 get obj list 3
    Apr 09 17:08:37 prepping walk
    Apr 09 17:08:37 opening packed bitmap...
    Apr 09 17:08:37 opening packed bitmap done
    Apr 09 17:08:37 walking 29380 pending
    Apr 09 17:08:37 done walking 29380 pending
    Apr 09 17:08:37 prepare_bitmap_walk 3
    Apr 09 17:08:37 prepare_bitmap_walk 4
    Apr 09 17:08:37 prepare_bitmap_walk 5
    Apr 09 17:08:38 prepare_bitmap_walk 6
    Apr 09 17:08:38 prepare_bitmap_walk 6.1
    Apr 09 17:09:07 prepare_bitmap_walk 6.2
    Apr 09 17:09:07 prepare_bitmap_walk 7
    Apr 09 17:09:09 prepare_bitmap_walk 8
    Apr 09 17:09:09 walking?
    Apr 09 17:09:09 traversing
    Apr 09 17:09:09 traversing done
    Apr 09 17:09:09 17:09:09.229185 builtin/pack-objects.c:3514       | d1 | main                     | region_leave | r1  | 31.968595 | 31.967571 | pack-objects | ..label:enumerate-objects-get-obj-list
    Apr 09 17:09:09 17:09:09.229203 builtin/pack-objects.c:3517       | d1 | main                     | region_enter | r1  | 31.968619 |           | pack-objects | ..label:enumerate-objects-cleanup-preferred-base
    Apr 09 17:09:09 17:09:09.229214 builtin/pack-objects.c:3520       | d1 | main                     | region_leave | r1  | 31.968630 |  0.000011 | pack-objects | ..label:enumerate-objects-cleanup-preferred-base
    Apr 09 17:09:09 17:09:09.229242 builtin/pack-objects.c:3530       | d1 | main                     | region_leave | r1  | 31.968658 | 31.967736 | pack-objects | label:enumerate-objects
    Apr 09 17:09:09 17:09:09.229265 builtin/pack-objects.c:3542       | d1 | main                     | region_enter | r1  | 31.968681 |           | pack-objects | label:write-pack-file
    Apr 09 17:09:09 17:09:09.251998 builtin/pack-objects.c:3544       | d1 | main                     | region_leave | r1  | 31.991412 |  0.022731 | pack-objects | label:write-pack-file

I don't have time to dig more into this now, just wanted to send these
initial results...

1.



    diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
    index a154fc29f6..8b2af1740e 100644
    --- a/builtin/pack-objects.c
    +++ b/builtin/pack-objects.c
    @@ -3052,2 +3052,3 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
     {
    +	fprintf(stderr, "prepping walk\n");
     	if (!(bitmap_git = prepare_bitmap_walk(revs)))
    @@ -3055,2 +3056,3 @@ static int get_object_list_from_bitmap(struct rev_info *revs)

    +	fprintf(stderr, "walking?\n");
     	if (pack_options_allow_reuse() &&
    @@ -3066,3 +3068,5 @@ static int get_object_list_from_bitmap(struct rev_info *revs)

    +	fprintf(stderr, "traversing\n");
     	traverse_bitmap_commit_list(bitmap_git, &add_object_entry_from_bitmap);
    +	fprintf(stderr, "traversing done\n");
     	return 0;
    @@ -3091,2 +3095,3 @@ static void get_object_list(int ac, const char **av)
     	int save_warning;
    +	int lns = 0;

    @@ -3102,3 +3107,5 @@ static void get_object_list(int ac, const char **av)

    +	fprintf(stderr, "get obj list 1\n");
     	while (fgets(line, sizeof(line), stdin) != NULL) {
    +		lns++;
     		int len = strlen(line);
    @@ -3128,4 +3135,6 @@ static void get_object_list(int ac, const char **av)

    +	fprintf(stderr, "get obj list 2, did %d lines\n", lns);
     	warn_on_object_refname_ambiguity = save_warning;

    +	fprintf(stderr, "get obj list 3\n");
     	if (use_bitmap_index && !get_object_list_from_bitmap(&revs))
    @@ -3133,2 +3142,3 @@ static void get_object_list(int ac, const char **av)

    +	fprintf(stderr, "get obj list 4\n");
     	if (use_delta_islands)
    @@ -3136,2 +3146,3 @@ static void get_object_list(int ac, const char **av)

    +	fprintf(stderr, "get obj list 5\n");
     	if (prepare_revision_walk(&revs))
    @@ -3140,2 +3151,3 @@ static void get_object_list(int ac, const char **av)

    +	fprintf(stderr, "get obj list 6\n");
     	if (!fn_show_object)
    @@ -3146,2 +3158,3 @@ static void get_object_list(int ac, const char **av)

    +	fprintf(stderr, "get obj list 7\n");
     	if (unpack_unreachable_expiration) {
    @@ -3157,2 +3170,3 @@ static void get_object_list(int ac, const char **av)

    +	fprintf(stderr, "get obj list 8\n");
     	if (keep_unreachable)
    @@ -3163,2 +3177,3 @@ static void get_object_list(int ac, const char **av)
     		loosen_unused_packed_objects(&revs);
    +	fprintf(stderr, "get obj list 9\n");

    @@ -3478,3 +3493,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
     			    the_repository);
    +	trace2_region_enter("pack-objects", "enumerate-objects-prepare-packing-data",
    +			    the_repository);
     	prepare_packing_data(the_repository, &to_pack);
    +	trace2_region_leave("pack-objects", "enumerate-objects-prepare-packing-data",
    +			    the_repository);

    @@ -3482,11 +3501,28 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
     		progress_state = start_progress(_("Enumerating objects"), 0);
    -	if (!use_internal_rev_list)
    +	if (!use_internal_rev_list) {
    +		trace2_region_enter("pack-objects", "enumerate-objects-read-stdin",
    +				    the_repository);
     		read_object_list_from_stdin();
    -	else {
    +		trace2_region_leave("pack-objects", "enumerate-objects-read-stdin",
    +				    the_repository);
    +	} else {
    +		trace2_region_enter("pack-objects", "enumerate-objects-get-obj-list",
    +				    the_repository);
     		get_object_list(rp.argc, rp.argv);
     		argv_array_clear(&rp);
    +		trace2_region_leave("pack-objects", "enumerate-objects-get-obj-list",
    +				    the_repository);
     	}
    +	trace2_region_enter("pack-objects", "enumerate-objects-cleanup-preferred-base",
    +			    the_repository);
     	cleanup_preferred_base();
    -	if (include_tag && nr_result)
    +	trace2_region_leave("pack-objects", "enumerate-objects-cleanup-preferred-base",
    +			    the_repository);
    +	if (include_tag && nr_result) {
    +		trace2_region_enter("pack-objects", "enumerate-objects-add-tags",
    +				    the_repository);
     		for_each_ref(add_ref_tag, NULL);
    +		trace2_region_leave("pack-objects", "enumerate-objects-add-tags",
    +				    the_repository);
    +	}
     	stop_progress(&progress_state);
    diff --git a/pack-bitmap.c b/pack-bitmap.c
    index 4695aaf6b4..0ab71597fd 100644
    --- a/pack-bitmap.c
    +++ b/pack-bitmap.c
    @@ -693,5 +693,8 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
     	 * because we may not need to use it */
    +	fprintf(stderr, "opening packed bitmap...\n");
     	if (open_pack_bitmap(revs->repo, bitmap_git) < 0)
     		goto cleanup;
    +	fprintf(stderr, "opening packed bitmap done\n");

    +	fprintf(stderr, "walking %d pending\n", revs->pending.nr);
     	for (i = 0; i < revs->pending.nr; ++i) {
    @@ -720,2 +723,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
     	}
    +	fprintf(stderr, "done walking %d pending\n", revs->pending.nr);

    @@ -726,2 +730,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
     	 */
    +	fprintf(stderr, "prepare_bitmap_walk 3\n");
     	if (haves && !in_bitmapped_pack(bitmap_git, haves))
    @@ -729,2 +734,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)

    +	fprintf(stderr, "prepare_bitmap_walk 4\n");
     	/* if we don't want anything, we're done here */
    @@ -738,2 +744,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
     	 */
    +	fprintf(stderr, "prepare_bitmap_walk 5\n");
     	if (load_pack_bitmap(bitmap_git) < 0)
    @@ -741,2 +748,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)

    +	fprintf(stderr, "prepare_bitmap_walk 6\n");
     	object_array_clear(&revs->pending);
    @@ -745,3 +753,5 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
     		revs->ignore_missing_links = 1;
    + 		fprintf(stderr, "prepare_bitmap_walk 6.1\n");
     		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
    + 		fprintf(stderr, "prepare_bitmap_walk 6.2\n");
     		reset_revision_walk();
    @@ -752,4 +762,6 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
     	}
    +	fprintf(stderr, "prepare_bitmap_walk 7\n");

     	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
    +	fprintf(stderr, "prepare_bitmap_walk 8\n");

    diff --git a/revision.c b/revision.c
    index eb8e51bc63..4592d01ee7 100644
    --- a/revision.c
    +++ b/revision.c
    @@ -63,2 +63,4 @@ static void mark_tree_contents_uninteresting(struct repository *r,

    +	fprintf(stderr, "MTCU\n");
    +
     	if (parse_tree_gently(tree, 1) < 0)
    @@ -167,2 +169,4 @@ static void add_children_by_path(struct repository *r,

    +	fprintf(stderr, "ACBP\n");
    +
     	if (!tree)

Jeff King April 10, 2019, 10:57 p.m. UTC | #3

On Tue, Apr 09, 2019 at 05:10:43PM +0200, Ævar Arnfjörð Bjarmason wrote:

> I've found a case where turning bitmaps on does horrible things for
> bitmap "push" performance.
> [...]
> I can't share the repo, but I had a report where just a "git push" of a
> topic branch that was 2/58 ahead/behind took ~2 minutes just in
> "Enumerating objects", but ~500ms without bitmaps.

That's pretty bad, though I'm not terribly surprised. The worst cases
are ones where we have to traverse a lot to fill in the bitmap. So
either there are a lot of commits newer than the bitmapped pack, or
we've done a bad job of picking old commits to bitmap, requiring us to
walk back to find all of the reachable objects (until we find something
that does have a bitmap).

And "bad" here is somewhat subjective. The other side told us some
"have" lines, and those are what we have to walk back from. _Usually_
those would correspond to ref tips, and the bitmap code tries to put a
bitmap at each ref tip. But if you have tons of refs, it can't always do
so.

> I.e. almost all the time is in get_object_list_from_bitmap() and around
> 1m30s in just this in pack-bitmap.c:
> 
>     haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
> 
> And then another ~20s in:
> 
>     wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);

Yeah, that's where I'd expect the time to go. Inside find_objects()
we'll call traverse_commit_list() to do the actual walk.

The first walk for "haves" is looking mostly at old commits. Stuff that
is mentioned in the pack, but for which we have to walk to find the
bitmap.

The second walk for wants is probably mostly looking at new commits.
Things that don't have a bitmap yet, and for which we have to walk
(though it's interesting that it's so much more expensive than the
non-bitmap walk, which has to fully enumerate those trees itself; that
implies that some of the "haves" are recent, too, with respect to the
last pack).

> This is with 10 packs and where only the largest (initial clone pack)
> had a *.bitmap, but I can also reproduce with a 'git repack -A -d -b',
> i.e. with only one pack with a *.bitmap, although that makes it a bit
> better for the first bit, and almost completely cuts down on the time
> spent in the second phase:

Yeah, that makes sense. By repacking you've taken all those new commits
and included them in on-disk bitmaps. So I'd expect the "wants" to get
much shorter, but the "haves" phase staying long means we could do a
better job of picking commits to have on-disk bitmaps.

So two avenues for exploration I think:

  1. I've long suspected that the bitmap selection code isn't ideal.
     Both in terms of what it picks, but also in its runtime (I think it
     ends up walking the same slices of history multiple times in some
     cases).

  2. The answer we get from a bitmap versus a regular traversal are not
     apples-to-apples equivalent. The regular traversal walks down to
     the UNINTERESTING commits, marks the boundaries trees and blobs as
     UNINTERESTING, and then adds in all the interesting trees and blobs
     minus the UNINTERESTING parts. So it can sometimes give the wrong
     answer, claiming something is interesting when it is not.

     Whereas with bitmaps we fill in the trees and blobs as we walk, and
     you get the true answer. But it means we may open up a lot more
     trees than the equivalent traversal would.

     So one thing I've never really experimented with (and indeed, never
     really thought about until writing this email) is that the bitmaps
     could try to do that looser style of traversal, knowing that we
     might err on the side of calling things interesting in a few cases.
     But hopefully spending a lot less time opening trees.

     I'm not even 100% sure what that would look like in code, but just
     thinking about it from a high level, I don't there's a particular
     mathematical reason it couldn't work.

-Peff

Junio C Hamano April 25, 2019, 7:16 a.m. UTC | #4

Jeff King <peff@peff.net> writes:

> On Tue, Apr 09, 2019 at 05:10:43PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> I've found a case where turning bitmaps on does horrible things for
>> bitmap "push" performance.
>> [...]
>> I can't share the repo, but I had a report where just a "git push" of a
>> topic branch that was 2/58 ahead/behind took ~2 minutes just in
>> "Enumerating objects", but ~500ms without bitmaps.
>
> That's pretty bad, though I'm not terribly surprised.

I was revisiting the recent "What's cooking" report, and I am not
sure what the current status of the topic is.

I do not get a feel that the current bitmap implementation has been
widely used in repositories that have vastly different access
patterns---it probably has been tried only by those who can afford
the engineering cost to see if the implementation happens to work
well for their workload and some may have chosen to adopt it while
others didn't.  So it may be very well tuned for the former people
but once we merge this topic down, we'll hear from others with quite
different workload, which may lead to us tuning the code to bit
better to their workload while not hurting other existing users,
hopefully.

Or not.

I am somewhat tempted to make things more exciting by merging it to
'next' soonish, but I guess Ævar and you are not quite ready for
that excitement yet, judging from the following (which looks quite
sensible suggestions)?

Thanks.

> Yeah, that makes sense. By repacking you've taken all those new commits
> and included them in on-disk bitmaps. So I'd expect the "wants" to get
> much shorter, but the "haves" phase staying long means we could do a
> better job of picking commits to have on-disk bitmaps.
>
> So two avenues for exploration I think:
>
>   1. I've long suspected that the bitmap selection code isn't ideal.
>      Both in terms of what it picks, but also in its runtime (I think it
>      ends up walking the same slices of history multiple times in some
>      cases).
>
>   2. The answer we get from a bitmap versus a regular traversal are not
>      apples-to-apples equivalent. The regular traversal walks down to
>      the UNINTERESTING commits, marks the boundaries trees and blobs as
>      UNINTERESTING, and then adds in all the interesting trees and blobs
>      minus the UNINTERESTING parts. So it can sometimes give the wrong
>      answer, claiming something is interesting when it is not.
>
>      Whereas with bitmaps we fill in the trees and blobs as we walk, and
>      you get the true answer. But it means we may open up a lot more
>      trees than the equivalent traversal would.
>
>      So one thing I've never really experimented with (and indeed, never
>      really thought about until writing this email) is that the bitmaps
>      could try to do that looser style of traversal, knowing that we
>      might err on the side of calling things interesting in a few cases.
>      But hopefully spending a lot less time opening trees.
>
>      I'm not even 100% sure what that would look like in code, but just
>      thinking about it from a high level, I don't there's a particular
>      mathematical reason it couldn't work.
>
> -Peff

Jeff King May 4, 2019, 1:37 a.m. UTC | #5

On Thu, Apr 25, 2019 at 04:16:46PM +0900, Junio C Hamano wrote:

> I was revisiting the recent "What's cooking" report, and I am not
> sure what the current status of the topic is.
> 
> I do not get a feel that the current bitmap implementation has been
> widely used in repositories that have vastly different access
> patterns---it probably has been tried only by those who can afford
> the engineering cost to see if the implementation happens to work
> well for their workload and some may have chosen to adopt it while
> others didn't.  So it may be very well tuned for the former people
> but once we merge this topic down, we'll hear from others with quite
> different workload, which may lead to us tuning the code to bit
> better to their workload while not hurting other existing users,
> hopefully.
> 
> Or not.

Note that Ævar's case was somebody running bitmaps locally and trying to
push, which I think is generally not a good match for bitmaps (even when
they work, they cost more to generate than what you save if you're only
pushing once).

The goal of Eric's patch was that by kicking in for bare repos, we'd
mostly be hitting servers that are serving up fetches. So if by
"workload" you mean that we some people might use bare repos for other
cases, yeah, there's a potential for confusion or regression there.

If you mean that bitmaps might not work for some workloads even when
we're serving a lot of fetches, I won't say that's _not_ true, but my
experience is that they are generally a net win. Both for the smaller
repositories we see on github.com, but also for big, busy ones that our
on-premises customers use.

  Actually, there is one curiosity with Eric's patch that I haven't
  tested. As I've mentioned before, we store "forks" as single
  repositories pointing to a single shared alternates repository. Since
  the bitmap code only handles one .bitmap per invocation, you really
  want just one big one in the shared repo. If "git repack" in the forks
  started generating one, that would be surprising and annoying.

  In practice this is a pretty extreme corner case. And a lot would
  depend on how you're using "repack" in the fork (e.g., a partial
  repack would know that it can't generate bitmaps anyway). I'm pretty
  sure it would not even impact our setup at all, but I can probably
  come up with a devils advocate one where it would.

> I am somewhat tempted to make things more exciting by merging it to
> 'next' soonish, but I guess Ævar and you are not quite ready for
> that excitement yet, judging from the following (which looks quite
> sensible suggestions)?

It's OK with me for this to go to 'next'. Note that the other two
patches from me could actually graduate separately. One is a
straight-out test fix, and the other should always be a win (and does
nothing if you're not already generating bitmaps).

By the way, there were some timing puzzles mentioned in that second
commit. I re-ran them today and everything was what I'd expect. So I
wonder if I just screwed up the timings before. I can re-write that
commit message if it hasn't made it to 'next' yet.

-Peff

Ævar Arnfjörð Bjarmason May 4, 2019, 6:52 a.m. UTC | #6

On Sat, May 04 2019, Jeff King wrote:

> On Thu, Apr 25, 2019 at 04:16:46PM +0900, Junio C Hamano wrote:
>
>> I was revisiting the recent "What's cooking" report, and I am not
>> sure what the current status of the topic is.
>>
>> I do not get a feel that the current bitmap implementation has been
>> widely used in repositories that have vastly different access
>> patterns---it probably has been tried only by those who can afford
>> the engineering cost to see if the implementation happens to work
>> well for their workload and some may have chosen to adopt it while
>> others didn't.  So it may be very well tuned for the former people
>> but once we merge this topic down, we'll hear from others with quite
>> different workload, which may lead to us tuning the code to bit
>> better to their workload while not hurting other existing users,
>> hopefully.
>>
>> Or not.
>
> Note that Ævar's case was somebody running bitmaps locally and trying to
> push, which I think is generally not a good match for bitmaps (even when
> they work, they cost more to generate than what you save if you're only
> pushing once).

Right. It was *not* caused by this "enable bitmaps by default on bare
repos" patch (which I wasn't even running with), but *is* indicative of
a pretty big edge case with enabling bitmaps that *will* happen for some
on such bare repos if we ship the patch.

> The goal of Eric's patch was that by kicking in for bare repos, we'd
> mostly be hitting servers that are serving up fetches. So if by
> "workload" you mean that we some people might use bare repos for other
> cases, yeah, there's a potential for confusion or regression there.

As noted I suspect that's a really rare use-case in practice, and in
reply to Junio's <xmqq1s1qy2ox.fsf@gitster-ct.c.googlers.com> upthread I
think it's fine to just try this. Maybe we'll finally get such use-cases
out of the woodworks by turning it on by default.

As an aside this is the Nth time I notice how crappy that "Enumerating
objects" progress bar is. We do a *lot* of things there, including this
bitmap calculation.

But just splitting it up might result in either no progress (all
individually below 2 seconds), or a lot of noise as you have 20 things
that each take 2 seconds. I wonder if someone's looked at supporting:

    Enumerating Objects (X%) => Calculating bitmaps (Y%)

Where X% is the total progres, and %Y is the sub-progress. I eyeballed
doing this once by "chaining" the progress structs, but there's probably
a less crappy way...


> If you mean that bitmaps might not work for some workloads even when
> we're serving a lot of fetches, I won't say that's _not_ true, but my
> experience is that they are generally a net win. Both for the smaller
> repositories we see on github.com, but also for big, busy ones that our
> on-premises customers use.
>
>   Actually, there is one curiosity with Eric's patch that I haven't
>   tested. As I've mentioned before, we store "forks" as single
>   repositories pointing to a single shared alternates repository. Since
>   the bitmap code only handles one .bitmap per invocation, you really
>   want just one big one in the shared repo. If "git repack" in the forks
>   started generating one, that would be surprising and annoying.
>
>   In practice this is a pretty extreme corner case. And a lot would
>   depend on how you're using "repack" in the fork (e.g., a partial
>   repack would know that it can't generate bitmaps anyway). I'm pretty
>   sure it would not even impact our setup at all, but I can probably
>   come up with a devils advocate one where it would.
>
>> I am somewhat tempted to make things more exciting by merging it to
>> 'next' soonish, but I guess Ævar and you are not quite ready for
>> that excitement yet, judging from the following (which looks quite
>> sensible suggestions)?
>
> It's OK with me for this to go to 'next'. Note that the other two
> patches from me could actually graduate separately. One is a
> straight-out test fix, and the other should always be a win (and does
> nothing if you're not already generating bitmaps).
>
> By the way, there were some timing puzzles mentioned in that second
> commit. I re-ran them today and everything was what I'd expect. So I
> wonder if I just screwed up the timings before. I can re-write that
> commit message if it hasn't made it to 'next' yet.
>
> -Peff

SZEDER Gábor May 4, 2019, 1:23 p.m. UTC | #7

On Sat, May 04, 2019 at 08:52:01AM +0200, Ævar Arnfjörð Bjarmason wrote:
> As an aside this is the Nth time I notice how crappy that "Enumerating
> objects" progress bar is.

And don't forget that the commit-graph progress bar still prints
nonsense :)

  https://public-inbox.org/git/87ef6ydds8.fsf@evledraar.gmail.com/

Jeff King May 7, 2019, 7:45 a.m. UTC | #8

On Sat, May 04, 2019 at 08:52:01AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Note that Ævar's case was somebody running bitmaps locally and trying to
> > push, which I think is generally not a good match for bitmaps (even when
> > they work, they cost more to generate than what you save if you're only
> > pushing once).
> 
> Right. It was *not* caused by this "enable bitmaps by default on bare
> repos" patch (which I wasn't even running with), but *is* indicative of
> a pretty big edge case with enabling bitmaps that *will* happen for some
> on such bare repos if we ship the patch.

Yeah. To clarify my comments a bit: I do think it would be possible to
hit a weird case like this while serving fetches (i.e., as far as I know
there is nothing in what you saw that is inherent to pushes). But I do
think for serving fetches, bitmaps are overall a big net win (based on
my experiences).

So I think it may come down to a tradeoff: enabling this by default
would probably be a net win across the population, but that's little
comfort to the unlucky somebody who may see it as a regression. I'm not
sure which is more important to maintain.

> As an aside this is the Nth time I notice how crappy that "Enumerating
> objects" progress bar is. We do a *lot* of things there, including this
> bitmap calculation.
> 
> But just splitting it up might result in either no progress (all
> individually below 2 seconds), or a lot of noise as you have 20 things
> that each take 2 seconds. I wonder if someone's looked at supporting:
> 
>     Enumerating Objects (X%) => Calculating bitmaps (Y%)
> 
> Where X% is the total progres, and %Y is the sub-progress. I eyeballed
> doing this once by "chaining" the progress structs, but there's probably
> a less crappy way...

I don't think it needs to be split; I think we just need to update the
object count while we're traversing the bitmaps. The problem is that the
progress object is known to pack-objects.c. Without bitmaps, as the
revision machinery walks the graph, our callbacks bump the progress
meter every time we see an object.

With bitmaps, all that walking happens behind the scenes, and the bitmap
code delivers us the final answer. So we pause for a long time, and then
suddenly it shoots forward.

I think we'd want a way to tell the bitmap code to update our progress
meter as it traverses (both single objects, but also taking into account
when it finds a bitmap and then suddenly bumps the value by a large
amount).

-Peff

Ævar Arnfjörð Bjarmason May 7, 2019, 8:12 a.m. UTC | #9

On Tue, May 07 2019, Jeff King wrote:

> On Sat, May 04, 2019 at 08:52:01AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Note that Ævar's case was somebody running bitmaps locally and trying to
>> > push, which I think is generally not a good match for bitmaps (even when
>> > they work, they cost more to generate than what you save if you're only
>> > pushing once).
>>
>> Right. It was *not* caused by this "enable bitmaps by default on bare
>> repos" patch (which I wasn't even running with), but *is* indicative of
>> a pretty big edge case with enabling bitmaps that *will* happen for some
>> on such bare repos if we ship the patch.
>
> Yeah. To clarify my comments a bit: I do think it would be possible to
> hit a weird case like this while serving fetches (i.e., as far as I know
> there is nothing in what you saw that is inherent to pushes). But I do
> think for serving fetches, bitmaps are overall a big net win (based on
> my experiences).
>
> So I think it may come down to a tradeoff: enabling this by default
> would probably be a net win across the population, but that's little
> comfort to the unlucky somebody who may see it as a regression. I'm not
> sure which is more important to maintain.
>
>> As an aside this is the Nth time I notice how crappy that "Enumerating
>> objects" progress bar is. We do a *lot* of things there, including this
>> bitmap calculation.
>>
>> But just splitting it up might result in either no progress (all
>> individually below 2 seconds), or a lot of noise as you have 20 things
>> that each take 2 seconds. I wonder if someone's looked at supporting:
>>
>>     Enumerating Objects (X%) => Calculating bitmaps (Y%)
>>
>> Where X% is the total progres, and %Y is the sub-progress. I eyeballed
>> doing this once by "chaining" the progress structs, but there's probably
>> a less crappy way...
>
> I don't think it needs to be split; I think we just need to update the
> object count while we're traversing the bitmaps. The problem is that the
> progress object is known to pack-objects.c. Without bitmaps, as the
> revision machinery walks the graph, our callbacks bump the progress
> meter every time we see an object.
>
> With bitmaps, all that walking happens behind the scenes, and the bitmap
> code delivers us the final answer. So we pause for a long time, and then
> suddenly it shoots forward.
>
> I think we'd want a way to tell the bitmap code to update our progress
> meter as it traverses (both single objects, but also taking into account
> when it finds a bitmap and then suddenly bumps the value by a large
> amount).

Not splitting it will fix the progress bar stalling, so it fixes the
problem that the user is wondering if the command is entirely hanging.

But I was hoping to give the user an idea of roughly where we're
spending our time, e.g. so you can see how much the pack.useSparse
setting is helping (or not).

So something where we report sub-progress as we go along, and perhaps
print some brief summary at the end if it took long enough, e.g.:

    Enumerating Objects (X^1%) => Marking trees (Y^1%)
    Enumerating Objects (X^2%) => Calculating bitmaps (Y^2%)

And at the end:

    Enumerating Objects (100%) in ~2m30s -- (~10s marking trees, ~2m10s bitmaps, ~10s other)

I.e. bringing the whole "nested" trace2 regions full circle with the
progress bar where we could elect to trace/show some of that info, and
then you could turn on some trace2 mode/verbose progress to see more.

Jeff King May 8, 2019, 7:11 a.m. UTC | #10

On Tue, May 07, 2019 at 10:12:06AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > I think we'd want a way to tell the bitmap code to update our progress
> > meter as it traverses (both single objects, but also taking into account
> > when it finds a bitmap and then suddenly bumps the value by a large
> > amount).
> 
> Not splitting it will fix the progress bar stalling, so it fixes the
> problem that the user is wondering if the command is entirely hanging.
> 
> But I was hoping to give the user an idea of roughly where we're
> spending our time, e.g. so you can see how much the pack.useSparse
> setting is helping (or not).

Yeah, I think that's a bigger and more complicated problem. I admit that
my main annoyance is just the stall while we fill in the bitmaps (and
it's easy because the bitmap traversal is the same unit of work as a
regular traversal).

> So something where we report sub-progress as we go along, and perhaps
> print some brief summary at the end if it took long enough, e.g.:
> 
>     Enumerating Objects (X^1%) => Marking trees (Y^1%)
>     Enumerating Objects (X^2%) => Calculating bitmaps (Y^2%)
> 
> And at the end:
> 
>     Enumerating Objects (100%) in ~2m30s -- (~10s marking trees, ~2m10s bitmaps, ~10s other)
> 
> I.e. bringing the whole "nested" trace2 regions full circle with the
> progress bar where we could elect to trace/show some of that info, and
> then you could turn on some trace2 mode/verbose progress to see more.

I do wonder if this really needs to be part of the progress bar. The
goal of the progress bar is to give the user a sense that work is
happening, and (if possible, but not for "enumerating") an idea of when
it might finish. If the trace code can already do detailed timings, then
shouldn't we just be encouraging people to use that?

-Peff

Derrick Stolee May 8, 2019, 2:20 p.m. UTC | #11

On 5/8/2019 3:11 AM, Jeff King wrote:
> On Tue, May 07, 2019 at 10:12:06AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
>>> I think we'd want a way to tell the bitmap code to update our progress
>>> meter as it traverses (both single objects, but also taking into account
>>> when it finds a bitmap and then suddenly bumps the value by a large
>>> amount).
>>
>> Not splitting it will fix the progress bar stalling, so it fixes the
>> problem that the user is wondering if the command is entirely hanging.
>>
>> But I was hoping to give the user an idea of roughly where we're
>> spending our time, e.g. so you can see how much the pack.useSparse
>> setting is helping (or not).
> 
> Yeah, I think that's a bigger and more complicated problem. I admit that
> my main annoyance is just the stall while we fill in the bitmaps (and
> it's easy because the bitmap traversal is the same unit of work as a
> regular traversal).

The pack.useSparse setting also speeds up a section that is not marked
by progress: that portion usually is walking all UNINTERESTING trees and
the"Enumerating Objects" progress is just for walking the INTERESTING objects.
 
>> So something where we report sub-progress as we go along, and perhaps
>> print some brief summary at the end if it took long enough, e.g.:
>>
>>     Enumerating Objects (X^1%) => Marking trees (Y^1%)
>>     Enumerating Objects (X^2%) => Calculating bitmaps (Y^2%)

I like this idea for splitting the "normal" mechanism, too:

     Enumerating Objects (X^1%) => Marking trees (Y^1%)
     Enumerating Objects (X^2%) => Enumerating objects to pack (Y^2%)

>> I.e. bringing the whole "nested" trace2 regions full circle with the
>> progress bar where we could elect to trace/show some of that info, and
>> then you could turn on some trace2 mode/verbose progress to see more.
> 
> I do wonder if this really needs to be part of the progress bar. The
> goal of the progress bar is to give the user a sense that work is
> happening, and (if possible, but not for "enumerating") an idea of when
> it might finish. If the trace code can already do detailed timings, then
> shouldn't we just be encouraging people to use that?

The problem I've seen (without bitmaps) is that running `git push` can
take a while before _any_ progress is listed.

Good news is: `pack.useSparse` fixed our push problem in the Windows OS
repo. The end-to-end time for `git push` sped up by 7.7x with the change,
and this "blank" time is too fast for users to notice.

Updating the progress could help in cases without pack.useSparse.

Thanks,
-Stolee

Ævar Arnfjörð Bjarmason May 8, 2019, 4:13 p.m. UTC | #12

On Wed, May 08 2019, Jeff King wrote:

> On Tue, May 07, 2019 at 10:12:06AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > I think we'd want a way to tell the bitmap code to update our progress
>> > meter as it traverses (both single objects, but also taking into account
>> > when it finds a bitmap and then suddenly bumps the value by a large
>> > amount).
>>
>> Not splitting it will fix the progress bar stalling, so it fixes the
>> problem that the user is wondering if the command is entirely hanging.
>>
>> But I was hoping to give the user an idea of roughly where we're
>> spending our time, e.g. so you can see how much the pack.useSparse
>> setting is helping (or not).
>
> Yeah, I think that's a bigger and more complicated problem. I admit that
> my main annoyance is just the stall while we fill in the bitmaps (and
> it's easy because the bitmap traversal is the same unit of work as a
> regular traversal).
>
>> So something where we report sub-progress as we go along, and perhaps
>> print some brief summary at the end if it took long enough, e.g.:
>>
>>     Enumerating Objects (X^1%) => Marking trees (Y^1%)
>>     Enumerating Objects (X^2%) => Calculating bitmaps (Y^2%)
>>
>> And at the end:
>>
>>     Enumerating Objects (100%) in ~2m30s -- (~10s marking trees, ~2m10s bitmaps, ~10s other)
>>
>> I.e. bringing the whole "nested" trace2 regions full circle with the
>> progress bar where we could elect to trace/show some of that info, and
>> then you could turn on some trace2 mode/verbose progress to see more.
>
> I do wonder if this really needs to be part of the progress bar. The
> goal of the progress bar is to give the user a sense that work is
> happening, and (if possible, but not for "enumerating") an idea of when
> it might finish. If the trace code can already do detailed timings, then
> shouldn't we just be encouraging people to use that?

To just show work happening we could save ourselves some horizontal
space and the debates over counting v.s. enumerating with:

     diff --git a/progress.c b/progress.c
     index 0318bdd41b..83336ca391 100644
     --- a/progress.c
     +++ b/progress.c
     @@ -226,3 +226,3 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
             struct progress *progress = xmalloc(sizeof(*progress));
     -       progress->title = title;
     +       progress->title = "Reticulating splines";
             progress->total = total;

:)

Obviously that's silly, but the point is we do show some user messaging
with these now, and e.g. the other day here on-list (can't be bothered
to find the msgid) someone was lamenting that the N progressbars we show
on "push" were too verbose.

So by coalescing some of the existing bars that do one logical operation
(push) in N steps we could be less verbose without losing the "stuff's
happening" part of it, and would see if something odd was going on,
e.g. the "I/O write" part being proportionally slower on this box than
the other, or when they upgrade bitmaps suddenly showing up as >95% of
the time.

The bit I find interesting about tying it into trace2 is that once you
do that the trace logs can contain e.g. min/max/avg/median/percentile
time for doing some operation we can break into N steps same/similar
steps, which might be interesting for performance analysis.

Ævar Arnfjörð Bjarmason May 8, 2019, 8:17 p.m. UTC | #13

On Sat, May 04 2019, SZEDER Gábor wrote:

> On Sat, May 04, 2019 at 08:52:01AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> As an aside this is the Nth time I notice how crappy that "Enumerating
>> objects" progress bar is.
>
> And don't forget that the commit-graph progress bar still prints
> nonsense :)
>
>   https://public-inbox.org/git/87ef6ydds8.fsf@evledraar.gmail.com/

I forgot for a bit, but then figured I'd get out of Derrick's way with
his more significant commit-graph work (which would have inevitably had
merge conflicts with this patch), and look at the time. Here we are
coming up on rc0.

I'm inclined to just wait until Derrick's refactorings land post-2.22.0
unless a) we need it enough before 2.22.0 to cause him trouble b) Junio
agrees with "a" and would take such a patch to fix this part of the
progress output before 2.22.0.

If people (just you would do) tell me "yes I/we want it" and Junio says
"yeah I'll do that" I'll write this patch in the next couple of days,
otherwise I'll do other stuff.

Junio?

Jeff King May 8, 2019, 10:25 p.m. UTC | #14

On Wed, May 08, 2019 at 06:13:58PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > I do wonder if this really needs to be part of the progress bar. The
> > goal of the progress bar is to give the user a sense that work is
> > happening, and (if possible, but not for "enumerating") an idea of when
> > it might finish. If the trace code can already do detailed timings, then
> > shouldn't we just be encouraging people to use that?
> [...]
>      -       progress->title = title;
>      +       progress->title = "Reticulating splines";

Heh, OK, that's a fair reductio ad absurdum of my point. I guess what I
was trying to say is that we don't need to go overboard with accuracy as
long as we're giving a vague sense of the work. Precision measurements
can be handled through a different system.

But I don't mind if you can find a way to do these kind of cascaded
progress meters in a way that is pleasing to the eye and not hard to
handle in the callers of the code.

-Peff

Junio C Hamano May 9, 2019, 4:24 a.m. UTC | #15

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Sat, May 04 2019, SZEDER Gábor wrote:
> ...
>> And don't forget that the commit-graph progress bar still prints
>> nonsense :)
>
> I'm inclined to just wait until Derrick's refactorings land post-2.22.0

Fine by me.  Thanks.

Jeff King May 23, 2019, 11:30 a.m. UTC | #16

On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote:

>   2. The answer we get from a bitmap versus a regular traversal are not
>      apples-to-apples equivalent. The regular traversal walks down to
>      the UNINTERESTING commits, marks the boundaries trees and blobs as
>      UNINTERESTING, and then adds in all the interesting trees and blobs
>      minus the UNINTERESTING parts. So it can sometimes give the wrong
>      answer, claiming something is interesting when it is not.
> 
>      Whereas with bitmaps we fill in the trees and blobs as we walk, and
>      you get the true answer. But it means we may open up a lot more
>      trees than the equivalent traversal would.
> 
>      So one thing I've never really experimented with (and indeed, never
>      really thought about until writing this email) is that the bitmaps
>      could try to do that looser style of traversal, knowing that we
>      might err on the side of calling things interesting in a few cases.
>      But hopefully spending a lot less time opening trees.
> 
>      I'm not even 100% sure what that would look like in code, but just
>      thinking about it from a high level, I don't there's a particular
>      mathematical reason it couldn't work.

I spent a while thinking and experimenting with this tonight. The result
is the patch below. Ævar, do you still have a copy of the repo that
misbehaved? I'd be curious to see how it fares.

Finding the right trees to explore is a little tricky with bitmaps.  In
a normal traversal, we consider the "edges" to be worth exploring.
I.e., the places where an UNINTERESTING commit is the parent of an
interesting one.

But with bitmaps, we don't have that information in the same way,
because we're trying to avoid walking the commits in the first place. So
e.g., in a history like this:

  A--B--C
      \
       D

Let's imagine we're computing "C..D", and that D has a bitmap on disk
but C does not. When we visit D, we'll see that it has a bitmap, fill in
the results in our cumulative "want" bitmap, and then stop walking its
parents (because we know everything they could tell us already).

Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal
traversal, we can't stop walking even though there are only
UNINTERESTING commits left, because we have to fill the complete bitmap,
including A and B (and you can imagine there might be thousands of
ancestors of A, too). The reason is that we skipped adding ancestors of
D when we saw its bitmap, so no still_interesting() cutoff will work
there.

But how do we know that it's worth looking into the tree of "B"? If we
assume we visit commits before their ancestors (because of the commit
timestamp ordering), then we'd see that "B" is in the "want" bitmap
already, which makes it worth visiting its tree.

Unfortunately we'd then continue on to "A", and it would _also_ look
interesting, because it's also in the "want" bitmap. We don't have an
easy way of knowing that "A" is basically already covered by "B", and is
therefore not worth pursuing. In this example, we could pass down a bit
from B to A as we traverse. But you can imagine another interesting
commit branched from A, which _would_ make A's tree worth considering.

Fundamentally, as soon as we load a bitmap and stop traversing, we lose
all information about the graph structure.

So the patch below just looks at every tree that might be worth
exploring (so both "A" and "B" here, but not "C"). That shouldn't be any
_worse_ than what the current code does (because it looks at all the
trees). It appears to behave correctly, at least so far as passing the
test suite. Running p5310 and p5311 against git.git and linux.git, it
seems to perform worse. I'm not quite sure why that is (i.e., if I
screwed up something in the implementation, or if the algorithm is
fundamentally worse).

There are a lot of rough edges in the patch; I was just trying to see if
the idea was even viable. So don't bother reviewing it as a real
proposal for inclusion. If you do read it, I'd recommend just looking at
the post-image, since it's essentially a rewrite and the diff is a bit
messy.

-Peff

---
 pack-bitmap.c | 398 ++++++++++++++++++++++++--------------------------
 1 file changed, 193 insertions(+), 205 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 6069b2fe55..4a40f62a38 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -12,6 +12,8 @@
 #include "packfile.h"
 #include "repository.h"
 #include "object-store.h"
+#include "blob.h"
+#include "prio-queue.h"
 
 /*
  * An entry on the bitmap index, representing the bitmap for a given
@@ -422,180 +424,22 @@ static int ext_index_add_object(struct bitmap_index *bitmap_git,
 	return bitmap_pos + bitmap_git->pack->num_objects;
 }
 
-struct bitmap_show_data {
-	struct bitmap_index *bitmap_git;
-	struct bitmap *base;
-};
-
-static void show_object(struct object *object, const char *name, void *data_)
-{
-	struct bitmap_show_data *data = data_;
-	int bitmap_pos;
-
-	bitmap_pos = bitmap_position(data->bitmap_git, &object->oid);
-
-	if (bitmap_pos < 0)
-		bitmap_pos = ext_index_add_object(data->bitmap_git, object,
-						  name);
-
-	bitmap_set(data->base, bitmap_pos);
-}
-
-static void show_commit(struct commit *commit, void *data)
-{
-}
-
-static int add_to_include_set(struct bitmap_index *bitmap_git,
-			      struct include_data *data,
-			      const struct object_id *oid,
-			      int bitmap_pos)
-{
-	khiter_t hash_pos;
-
-	if (data->seen && bitmap_get(data->seen, bitmap_pos))
-		return 0;
-
-	if (bitmap_get(data->base, bitmap_pos))
-		return 0;
-
-	hash_pos = kh_get_oid_map(bitmap_git->bitmaps, *oid);
-	if (hash_pos < kh_end(bitmap_git->bitmaps)) {
-		struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, hash_pos);
-		bitmap_or_ewah(data->base, lookup_stored_bitmap(st));
-		return 0;
-	}
-
-	bitmap_set(data->base, bitmap_pos);
-	return 1;
-}
-
-static int should_include(struct commit *commit, void *_data)
-{
-	struct include_data *data = _data;
-	int bitmap_pos;
-
-	bitmap_pos = bitmap_position(data->bitmap_git, &commit->object.oid);
-	if (bitmap_pos < 0)
-		bitmap_pos = ext_index_add_object(data->bitmap_git,
-						  (struct object *)commit,
-						  NULL);
-
-	if (!add_to_include_set(data->bitmap_git, data, &commit->object.oid,
-				bitmap_pos)) {
-		struct commit_list *parent = commit->parents;
-
-		while (parent) {
-			parent->item->object.flags |= SEEN;
-			parent = parent->next;
-		}
-
-		return 0;
-	}
-
-	return 1;
-}
-
-static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
-				   struct rev_info *revs,
-				   struct object_list *roots,
-				   struct bitmap *seen)
+static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
+				struct bitmap **base, struct commit *commit)
 {
-	struct bitmap *base = NULL;
-	int needs_walk = 0;
-
-	struct object_list *not_mapped = NULL;
-
-	/*
-	 * Go through all the roots for the walk. The ones that have bitmaps
-	 * on the bitmap index will be `or`ed together to form an initial
-	 * global reachability analysis.
-	 *
-	 * The ones without bitmaps in the index will be stored in the
-	 * `not_mapped_list` for further processing.
-	 */
-	while (roots) {
-		struct object *object = roots->item;
-		roots = roots->next;
-
-		if (object->type == OBJ_COMMIT) {
-			khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, object->oid);
-
-			if (pos < kh_end(bitmap_git->bitmaps)) {
-				struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos);
-				struct ewah_bitmap *or_with = lookup_stored_bitmap(st);
-
-				if (base == NULL)
-					base = ewah_to_bitmap(or_with);
-				else
-					bitmap_or_ewah(base, or_with);
-
-				object->flags |= SEEN;
-				continue;
-			}
-		}
-
-		object_list_insert(object, &not_mapped);
-	}
-
-	/*
-	 * Best case scenario: We found bitmaps for all the roots,
-	 * so the resulting `or` bitmap has the full reachability analysis
-	 */
-	if (not_mapped == NULL)
-		return base;
-
-	roots = not_mapped;
-
-	/*
-	 * Let's iterate through all the roots that don't have bitmaps to
-	 * check if we can determine them to be reachable from the existing
-	 * global bitmap.
-	 *
-	 * If we cannot find them in the existing global bitmap, we'll need
-	 * to push them to an actual walk and run it until we can confirm
-	 * they are reachable
-	 */
-	while (roots) {
-		struct object *object = roots->item;
-		int pos;
-
-		roots = roots->next;
-		pos = bitmap_position(bitmap_git, &object->oid);
-
-		if (pos < 0 || base == NULL || !bitmap_get(base, pos)) {
-			object->flags &= ~UNINTERESTING;
-			add_pending_object(revs, object, "");
-			needs_walk = 1;
-		} else {
-			object->flags |= SEEN;
-		}
-	}
-
-	if (needs_walk) {
-		struct include_data incdata;
-		struct bitmap_show_data show_data;
-
-		if (base == NULL)
-			base = bitmap_new();
-
-		incdata.bitmap_git = bitmap_git;
-		incdata.base = base;
-		incdata.seen = seen;
-
-		revs->include_check = should_include;
-		revs->include_check_data = &incdata;
-
-		if (prepare_revision_walk(revs))
-			die("revision walk setup failed");
+	khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, commit->object.oid);
+	if (pos < kh_end(bitmap_git->bitmaps)) {
+		struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos);
+		struct ewah_bitmap *or_with = lookup_stored_bitmap(st);
 
-		show_data.bitmap_git = bitmap_git;
-		show_data.base = base;
+		if (*base == NULL)
+			*base = ewah_to_bitmap(or_with);
+		else
+			bitmap_or_ewah(*base, or_with);
 
-		traverse_commit_list(revs, show_commit, show_object,
-				     &show_data);
+		return 1;
 	}
-
-	return base;
+	return 0;
 }
 
 static void show_extended_objects(struct bitmap_index *bitmap_git,
@@ -665,24 +509,122 @@ static void show_objects_for_type(
 	}
 }
 
-static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
-			     struct object_list *roots)
+static void add_object_to_bitmap(struct bitmap_index *bitmap_git,
+				 struct object *obj,
+				 struct bitmap *bitmap,
+				 struct bitmap *seen,
+				 int ignore_missing);
+
+static void add_tag_to_bitmap(struct bitmap_index *bitmap_git,
+			      struct tag *tag,
+			      struct bitmap *bitmap,
+			      struct bitmap *seen,
+			      int ignore_missing)
+{
+	if (parse_tag(tag) < 0 || !tag->tagged) {
+		if (ignore_missing)
+			return;
+		die("unable to parse tag %s", oid_to_hex(&tag->object.oid));
+	}
+	add_object_to_bitmap(bitmap_git, tag->tagged, bitmap, seen,
+			     ignore_missing);
+}
+
+static void add_tree_to_bitmap(struct bitmap_index *bitmap_git,
+			       struct tree *tree,
+			       struct bitmap *bitmap,
+			       struct bitmap *seen,
+			       int ignore_missing)
 {
-	while (roots) {
-		struct object *object = roots->item;
-		roots = roots->next;
+	struct tree_desc desc;
+	struct name_entry entry;
 
-		if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
-			return 1;
+	if (parse_tree(tree) < 0) {
+		if (ignore_missing)
+			return;
+		die("unable to parse tree %s", oid_to_hex(&tree->object.oid));
 	}
 
-	return 0;
+	init_tree_desc(&desc, tree->buffer, tree->size);
+	while (tree_entry(&desc, &entry)) {
+		if (S_ISDIR(entry.mode)) {
+			struct tree *t = lookup_tree(the_repository, &entry.oid);
+			if (!t) {
+				die(_("entry '%s' in tree %s has tree mode, "
+				      "but is not a tree"),
+				    entry.path, oid_to_hex(&tree->object.oid));
+			}
+			add_object_to_bitmap(bitmap_git, &t->object,
+					     bitmap, seen, ignore_missing);
+		} else if (!S_ISGITLINK(entry.mode)) {
+			struct blob *b = lookup_blob(the_repository, &entry.oid);
+			if (!b) {
+				die(_("entry '%s' in tree %s has blob mode, "
+				      "but is not a blob"),
+				    entry.path, oid_to_hex(&tree->object.oid));
+			}
+			add_object_to_bitmap(bitmap_git, &b->object,
+					     bitmap, seen, ignore_missing);
+		}
+	}
+
+	free_tree_buffer(tree);
+}
+
+static void add_object_to_bitmap(struct bitmap_index *bitmap_git,
+				 struct object *obj,
+				 struct bitmap *bitmap,
+				 struct bitmap *seen,
+				 int ignore_missing)
+{
+	int pos = bitmap_position(bitmap_git, &obj->oid);
+
+	if (pos < 0)
+		pos = ext_index_add_object(bitmap_git, obj, NULL);
+
+	if (bitmap_get(bitmap, pos))
+		return; /* already there */
+
+	if (seen && bitmap_get(seen, pos))
+		return; /* seen in other bitmap; not worth digging further */
+
+	bitmap_set(bitmap, pos);
+
+	if (obj->type == OBJ_BLOB)
+		return;
+	else if (obj->type == OBJ_TAG)
+		add_tag_to_bitmap(bitmap_git, (struct tag *)obj,
+				  bitmap, seen, ignore_missing);
+	else if (obj->type == OBJ_TREE)
+		add_tree_to_bitmap(bitmap_git, (struct tree *)obj,
+				   bitmap, seen, ignore_missing);
+	else
+		BUG("unexpected object type: %d", obj->type);
+}
+
+static void add_objects_to_bitmap(struct bitmap_index *bitmap_git,
+				  struct object_list **list,
+				  struct bitmap *bitmap,
+				  struct bitmap *seen,
+				  int ignore_missing)
+{
+	while (*list) {
+		struct object_list *cur = *list;
+
+		add_object_to_bitmap(bitmap_git, cur->item,
+				     bitmap, seen, ignore_missing);
+
+		*list = cur->next;
+		free(cur);
+	}
 }
 
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 {
 	unsigned int i;
 
+	struct prio_queue commits = { compare_commits_by_commit_date };
+
 	struct object_list *wants = NULL;
 	struct object_list *haves = NULL;
 
@@ -714,24 +656,16 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 			object = parse_object_or_die(&tag->tagged->oid, NULL);
 		}
 
-		if (object->flags & UNINTERESTING)
-			object_list_insert(object, &haves);
-		else
-			object_list_insert(object, &wants);
+		if (object->type == OBJ_COMMIT)
+			prio_queue_put(&commits, object);
+		else {
+			if (object->flags & UNINTERESTING)
+				object_list_insert(object, &haves);
+			else
+				object_list_insert(object, &wants);
+		}
 	}
 
-	/*
-	 * if we have a HAVES list, but none of those haves is contained
-	 * in the packfile that has a bitmap, we don't have anything to
-	 * optimize here
-	 */
-	if (haves && !in_bitmapped_pack(bitmap_git, haves))
-		goto cleanup;
-
-	/* if we don't want anything, we're done here */
-	if (!wants)
-		goto cleanup;
-
 	/*
 	 * now we're going to use bitmaps, so load the actual bitmap entries
 	 * from disk. this is the point of no return; after this the rev_list
@@ -742,20 +676,74 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 
 	object_array_clear(&revs->pending);
 
-	if (haves) {
-		revs->ignore_missing_links = 1;
-		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
-		reset_revision_walk();
-		revs->ignore_missing_links = 0;
+	haves_bitmap = bitmap_new();
+	wants_bitmap = bitmap_new();
 
-		if (haves_bitmap == NULL)
-			BUG("failed to perform bitmap walk");
-	}
+	/*
+	 * First traverse the relevant commits as we would for a normal
+	 * traversal.
+	 */
+	while (commits.nr) {
+		struct commit *commit = prio_queue_get(&commits);
+		struct bitmap **dst_bitmap;
+		int pos;
+		struct commit_list *parent;
+
+		if (commit->object.flags & UNINTERESTING)
+			dst_bitmap = &haves_bitmap;
+		else
+			dst_bitmap = &wants_bitmap;
 
-	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
+		pos = bitmap_position(bitmap_git, &commit->object.oid);
+		if (pos >= 0 && *dst_bitmap && bitmap_get(*dst_bitmap, pos))
+			continue; /* already covered */
 
-	if (!wants_bitmap)
-		BUG("failed to perform bitmap walk");
+		if (add_commit_to_bitmap(bitmap_git, dst_bitmap, commit))
+			continue; /* skip adding parents, they're covered */
+
+
+		/* otherwise mark ourselves and queue dependent objects */
+		if (pos < 0)
+			pos = ext_index_add_object(bitmap_git, &commit->object, NULL);
+		bitmap_set(*dst_bitmap, pos);
+		if (parse_commit(commit)) {
+			if (commit->object.flags & UNINTERESTING)
+				continue;
+			die("unable to parse commit %s",
+			    oid_to_hex(&commit->object.oid));
+		}
+		if (commit->object.flags & UNINTERESTING) {
+			/*
+			 * If an uninteresting commit is in the "wants" bitmap,
+			 * then our tree is worth exploring. This means we may
+			 * miss some trees in the "haves" that are not
+			 * ancestors of "wants" (or that are, but we missed
+			 * because of out-of-order timestamps).
+			 */
+			if (wants_bitmap && bitmap_get(wants_bitmap, pos))
+				object_list_insert(&get_commit_tree(commit)->object,
+						   &haves);
+		} else {
+			/*
+			 * "wants" must always be explored
+			 */
+			object_list_insert(&get_commit_tree(commit)->object,
+					   &wants);
+		}
+
+		for (parent = commit->parents; parent; parent = parent->next) {
+			if (commit->object.flags & UNINTERESTING)
+				parent->item->object.flags |= UNINTERESTING;
+			prio_queue_put(&commits, parent->item);
+		}
+	}
+
+	/*
+	 * At this point we've processed all of the commits, and queued items
+	 * in "haves" and "wants" that need to be marked.
+	 */
+	add_objects_to_bitmap(bitmap_git, &haves, haves_bitmap, NULL, 1);
+	add_objects_to_bitmap(bitmap_git, &wants, wants_bitmap, haves_bitmap, 0);
 
 	if (haves_bitmap)
 		bitmap_and_not(wants_bitmap, haves_bitmap);

Derrick Stolee May 23, 2019, 12:53 p.m. UTC | #17

On 5/23/2019 7:30 AM, Jeff King wrote:
> On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote:
> 
>>   2. The answer we get from a bitmap versus a regular traversal are not
>>      apples-to-apples equivalent. The regular traversal walks down to
>>      the UNINTERESTING commits, marks the boundaries trees and blobs as
>>      UNINTERESTING, and then adds in all the interesting trees and blobs
>>      minus the UNINTERESTING parts. So it can sometimes give the wrong
>>      answer, claiming something is interesting when it is not.
>>
>>      Whereas with bitmaps we fill in the trees and blobs as we walk, and
>>      you get the true answer. But it means we may open up a lot more
>>      trees than the equivalent traversal would.
>>
>>      So one thing I've never really experimented with (and indeed, never
>>      really thought about until writing this email) is that the bitmaps
>>      could try to do that looser style of traversal, knowing that we
>>      might err on the side of calling things interesting in a few cases.
>>      But hopefully spending a lot less time opening trees.
>>
>>      I'm not even 100% sure what that would look like in code, but just
>>      thinking about it from a high level, I don't there's a particular
>>      mathematical reason it couldn't work.
> 
> I spent a while thinking and experimenting with this tonight. The result
> is the patch below. Ævar, do you still have a copy of the repo that
> misbehaved? I'd be curious to see how it fares.

This patch caught my attention, and I think I understand some of the issues
at hand. I'm not well-versed specifically in Git's bitmap implementation.
The fundamental ideas are there, but my perspective is biased by my own
independent bitmap implementation for Azure Repos. What worked there may not
work at all here.

> Finding the right trees to explore is a little tricky with bitmaps.  In
> a normal traversal, we consider the "edges" to be worth exploring.
> I.e., the places where an UNINTERESTING commit is the parent of an
> interesting one.

This is the "commit frontier" which I bring up again below.

> But with bitmaps, we don't have that information in the same way,
> because we're trying to avoid walking the commits in the first place. So
> e.g., in a history like this:
> 
>   A--B--C
>       \
>        D
> 
> Let's imagine we're computing "C..D", and that D has a bitmap on disk
> but C does not.

(As I read your discussion below, I'm confused. For "C..D", C is a have
and D is a want. We should explore all the haves _first_, then walk the
wants, right?)

> When we visit D, we'll see that it has a bitmap, fill in
> the results in our cumulative "want" bitmap, and then stop walking its
> parents (because we know everything they could tell us already).

I may be misreading something, but we would construct _a_ bitmap for C
by walking its reachable objects until hitting a bitmap we know about.
Perhaps A or B have a bitmap to start the construction. If we never
construct a bitmap for C, then we don't know what to remove from the "D"
bitmap to show the difference.

If "C" is not even in the bitmap pack, then we don't use bitmaps for
this calculation because of this condition:

        /*
         * if we have a HAVES list, but none of those haves is contained
         * in the packfile that has a bitmap, we don't have anything to
         * optimize here
         */
        if (haves && !in_bitmapped_pack(bitmap_git, haves))
                goto cleanup;

> Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal
> traversal, we can't stop walking even though there are only
> UNINTERESTING commits left, because we have to fill the complete bitmap,
> including A and B (and you can imagine there might be thousands of
> ancestors of A, too). The reason is that we skipped adding ancestors of
> D when we saw its bitmap, so no still_interesting() cutoff will work
> there.
> 
> But how do we know that it's worth looking into the tree of "B"? If we
> assume we visit commits before their ancestors (because of the commit
> timestamp ordering), then we'd see that "B" is in the "want" bitmap
> already, which makes it worth visiting its tree.
> 
> Unfortunately we'd then continue on to "A", and it would _also_ look
> interesting, because it's also in the "want" bitmap. We don't have an
> easy way of knowing that "A" is basically already covered by "B", and is
> therefore not worth pursuing. In this example, we could pass down a bit
> from B to A as we traverse. But you can imagine another interesting
> commit branched from A, which _would_ make A's tree worth considering.
> 
> Fundamentally, as soon as we load a bitmap and stop traversing, we lose
> all information about the graph structure.
> 
> So the patch below just looks at every tree that might be worth
> exploring (so both "A" and "B" here, but not "C"). That shouldn't be any
> _worse_ than what the current code does (because it looks at all the
> trees). It appears to behave correctly, at least so far as passing the
> test suite. Running p5310 and p5311 against git.git and linux.git, it
> seems to perform worse. I'm not quite sure why that is (i.e., if I
> screwed up something in the implementation, or if the algorithm is
> fundamentally worse).

I'm of the opinion that the old method was better. It followed a very clear
three-step process:

 1. Compute the "haves" bitmap.

 2. Compute the "wants" bitmap, but don't explore into the "haves" bitmap.

 3. Subtract the "haves" bitmap from the "wants" (in case we added bits to
    the wants before they were in the haves).

But there is hope in your idea to improve some cases. As noted, we give up
if all of the haves are not in the bitmapped pack. By starting with a
commit walk, we can find the "commit frontier," which is the set of commits
that are explored by paint_down_to_common(). In this case, the set {B, C, D}.

After walking commits and finding a set of UNINTERESTING commits, we could
look for any UNINTERESTING commits that have bitmaps (or simply are in the
bitmapped pack) and use those to see our "haves" bitmap. So, if B has a
bitmap, but C is too new for the bitmap, then we can still create the haves
based on B and then take a bitmap diff "D minus B".

In fact, using the "merge base set" for the diff reflects what the non-bitmap
algorithm does in this case. It ignores C and only explores B.

I learned a lot looking at both versions of this method side-by-side. Thanks!

-Stolee

Ævar Arnfjörð Bjarmason May 23, 2019, 7:26 p.m. UTC | #18

On Thu, May 23 2019, Jeff King wrote:

> On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote:
>
>>   2. The answer we get from a bitmap versus a regular traversal are not
>>      apples-to-apples equivalent. The regular traversal walks down to
>>      the UNINTERESTING commits, marks the boundaries trees and blobs as
>>      UNINTERESTING, and then adds in all the interesting trees and blobs
>>      minus the UNINTERESTING parts. So it can sometimes give the wrong
>>      answer, claiming something is interesting when it is not.
>>
>>      Whereas with bitmaps we fill in the trees and blobs as we walk, and
>>      you get the true answer. But it means we may open up a lot more
>>      trees than the equivalent traversal would.
>>
>>      So one thing I've never really experimented with (and indeed, never
>>      really thought about until writing this email) is that the bitmaps
>>      could try to do that looser style of traversal, knowing that we
>>      might err on the side of calling things interesting in a few cases.
>>      But hopefully spending a lot less time opening trees.
>>
>>      I'm not even 100% sure what that would look like in code, but just
>>      thinking about it from a high level, I don't there's a particular
>>      mathematical reason it couldn't work.
>
> I spent a while thinking and experimenting with this tonight. The result
> is the patch below. Ævar, do you still have a copy of the repo that
> misbehaved? I'd be curious to see how it fares.

No, sorry. I think we could make an artificial test to emulate it, which
would be something like:

 * ~1 million commits
 * local clone setup to fetch all branches/tags (default 'git clone')
 * local a bit ahead/behind
 * Have old "main" pack with *.bitmap, bunch of other packs/loose objects without that
 * push try to push the local change upstream (or to a topic branch)

I tried briefly to emulate this with git.git with:

    (
        rm -rf /tmp/git /tmp/git.old &&
        git init /tmp/git.old &&
        git clone --bare https://github.com/git/git.git /tmp/git &&
        cd /tmp/git &&
        old_commit=$(git rev-parse v2.20.0^{}) &&
        git rev-list v2.12.0..v2.21.0|parallel 'git branch topic-{} {}' &&
        cd /tmp/git.old &&
        echo /tmp/git/objects >.git/objects/info/alternates &&
        git branch master $old_commit &&
        git reset --hard master &&
        git repack -Adb &&
        rm .git/objects/info/alternates &&
        for c in {1..10}
        do
            >$c &&
            git add $c &&
            git commit -m"bump $c"
        done
    )

But didn't get anywhere, probably because here my topics are all stuff I
have already, and I just have 2x packs.

It would be really cool to have some test tool that could produce
various "shape of repo" states like that. I.e. given an end-state of a
public repo emulate various plausible local client states like that
given some assumptions about how often the local client fetches, what
the GC settings are etc.

> Finding the right trees to explore is a little tricky with bitmaps.  In
> a normal traversal, we consider the "edges" to be worth exploring.
> I.e., the places where an UNINTERESTING commit is the parent of an
> interesting one.
>
> But with bitmaps, we don't have that information in the same way,
> because we're trying to avoid walking the commits in the first place. So
> e.g., in a history like this:
>
>   A--B--C
>       \
>        D
>
> Let's imagine we're computing "C..D", and that D has a bitmap on disk
> but C does not. When we visit D, we'll see that it has a bitmap, fill in
> the results in our cumulative "want" bitmap, and then stop walking its
> parents (because we know everything they could tell us already).
>
> Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal
> traversal, we can't stop walking even though there are only
> UNINTERESTING commits left, because we have to fill the complete bitmap,
> including A and B (and you can imagine there might be thousands of
> ancestors of A, too). The reason is that we skipped adding ancestors of
> D when we saw its bitmap, so no still_interesting() cutoff will work
> there.
>
> But how do we know that it's worth looking into the tree of "B"? If we
> assume we visit commits before their ancestors (because of the commit
> timestamp ordering), then we'd see that "B" is in the "want" bitmap
> already, which makes it worth visiting its tree.
>
> Unfortunately we'd then continue on to "A", and it would _also_ look
> interesting, because it's also in the "want" bitmap. We don't have an
> easy way of knowing that "A" is basically already covered by "B", and is
> therefore not worth pursuing. In this example, we could pass down a bit
> from B to A as we traverse. But you can imagine another interesting
> commit branched from A, which _would_ make A's tree worth considering.
>
> Fundamentally, as soon as we load a bitmap and stop traversing, we lose
> all information about the graph structure.
>
> So the patch below just looks at every tree that might be worth
> exploring (so both "A" and "B" here, but not "C"). That shouldn't be any
> _worse_ than what the current code does (because it looks at all the
> trees). It appears to behave correctly, at least so far as passing the
> test suite. Running p5310 and p5311 against git.git and linux.git, it
> seems to perform worse. I'm not quite sure why that is (i.e., if I
> screwed up something in the implementation, or if the algorithm is
> fundamentally worse).
>
> There are a lot of rough edges in the patch; I was just trying to see if
> the idea was even viable. So don't bother reviewing it as a real
> proposal for inclusion. If you do read it, I'd recommend just looking at
> the post-image, since it's essentially a rewrite and the diff is a bit
> messy.
>
> -Peff
>
> ---
>  pack-bitmap.c | 398 ++++++++++++++++++++++++--------------------------
>  1 file changed, 193 insertions(+), 205 deletions(-)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 6069b2fe55..4a40f62a38 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -12,6 +12,8 @@
>  #include "packfile.h"
>  #include "repository.h"
>  #include "object-store.h"
> +#include "blob.h"
> +#include "prio-queue.h"
>
>  /*
>   * An entry on the bitmap index, representing the bitmap for a given
> @@ -422,180 +424,22 @@ static int ext_index_add_object(struct bitmap_index *bitmap_git,
>  	return bitmap_pos + bitmap_git->pack->num_objects;
>  }
>
> -struct bitmap_show_data {
> -	struct bitmap_index *bitmap_git;
> -	struct bitmap *base;
> -};
> -
> -static void show_object(struct object *object, const char *name, void *data_)
> -{
> -	struct bitmap_show_data *data = data_;
> -	int bitmap_pos;
> -
> -	bitmap_pos = bitmap_position(data->bitmap_git, &object->oid);
> -
> -	if (bitmap_pos < 0)
> -		bitmap_pos = ext_index_add_object(data->bitmap_git, object,
> -						  name);
> -
> -	bitmap_set(data->base, bitmap_pos);
> -}
> -
> -static void show_commit(struct commit *commit, void *data)
> -{
> -}
> -
> -static int add_to_include_set(struct bitmap_index *bitmap_git,
> -			      struct include_data *data,
> -			      const struct object_id *oid,
> -			      int bitmap_pos)
> -{
> -	khiter_t hash_pos;
> -
> -	if (data->seen && bitmap_get(data->seen, bitmap_pos))
> -		return 0;
> -
> -	if (bitmap_get(data->base, bitmap_pos))
> -		return 0;
> -
> -	hash_pos = kh_get_oid_map(bitmap_git->bitmaps, *oid);
> -	if (hash_pos < kh_end(bitmap_git->bitmaps)) {
> -		struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, hash_pos);
> -		bitmap_or_ewah(data->base, lookup_stored_bitmap(st));
> -		return 0;
> -	}
> -
> -	bitmap_set(data->base, bitmap_pos);
> -	return 1;
> -}
> -
> -static int should_include(struct commit *commit, void *_data)
> -{
> -	struct include_data *data = _data;
> -	int bitmap_pos;
> -
> -	bitmap_pos = bitmap_position(data->bitmap_git, &commit->object.oid);
> -	if (bitmap_pos < 0)
> -		bitmap_pos = ext_index_add_object(data->bitmap_git,
> -						  (struct object *)commit,
> -						  NULL);
> -
> -	if (!add_to_include_set(data->bitmap_git, data, &commit->object.oid,
> -				bitmap_pos)) {
> -		struct commit_list *parent = commit->parents;
> -
> -		while (parent) {
> -			parent->item->object.flags |= SEEN;
> -			parent = parent->next;
> -		}
> -
> -		return 0;
> -	}
> -
> -	return 1;
> -}
> -
> -static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
> -				   struct rev_info *revs,
> -				   struct object_list *roots,
> -				   struct bitmap *seen)
> +static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
> +				struct bitmap **base, struct commit *commit)
>  {
> -	struct bitmap *base = NULL;
> -	int needs_walk = 0;
> -
> -	struct object_list *not_mapped = NULL;
> -
> -	/*
> -	 * Go through all the roots for the walk. The ones that have bitmaps
> -	 * on the bitmap index will be `or`ed together to form an initial
> -	 * global reachability analysis.
> -	 *
> -	 * The ones without bitmaps in the index will be stored in the
> -	 * `not_mapped_list` for further processing.
> -	 */
> -	while (roots) {
> -		struct object *object = roots->item;
> -		roots = roots->next;
> -
> -		if (object->type == OBJ_COMMIT) {
> -			khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, object->oid);
> -
> -			if (pos < kh_end(bitmap_git->bitmaps)) {
> -				struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos);
> -				struct ewah_bitmap *or_with = lookup_stored_bitmap(st);
> -
> -				if (base == NULL)
> -					base = ewah_to_bitmap(or_with);
> -				else
> -					bitmap_or_ewah(base, or_with);
> -
> -				object->flags |= SEEN;
> -				continue;
> -			}
> -		}
> -
> -		object_list_insert(object, &not_mapped);
> -	}
> -
> -	/*
> -	 * Best case scenario: We found bitmaps for all the roots,
> -	 * so the resulting `or` bitmap has the full reachability analysis
> -	 */
> -	if (not_mapped == NULL)
> -		return base;
> -
> -	roots = not_mapped;
> -
> -	/*
> -	 * Let's iterate through all the roots that don't have bitmaps to
> -	 * check if we can determine them to be reachable from the existing
> -	 * global bitmap.
> -	 *
> -	 * If we cannot find them in the existing global bitmap, we'll need
> -	 * to push them to an actual walk and run it until we can confirm
> -	 * they are reachable
> -	 */
> -	while (roots) {
> -		struct object *object = roots->item;
> -		int pos;
> -
> -		roots = roots->next;
> -		pos = bitmap_position(bitmap_git, &object->oid);
> -
> -		if (pos < 0 || base == NULL || !bitmap_get(base, pos)) {
> -			object->flags &= ~UNINTERESTING;
> -			add_pending_object(revs, object, "");
> -			needs_walk = 1;
> -		} else {
> -			object->flags |= SEEN;
> -		}
> -	}
> -
> -	if (needs_walk) {
> -		struct include_data incdata;
> -		struct bitmap_show_data show_data;
> -
> -		if (base == NULL)
> -			base = bitmap_new();
> -
> -		incdata.bitmap_git = bitmap_git;
> -		incdata.base = base;
> -		incdata.seen = seen;
> -
> -		revs->include_check = should_include;
> -		revs->include_check_data = &incdata;
> -
> -		if (prepare_revision_walk(revs))
> -			die("revision walk setup failed");
> +	khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, commit->object.oid);
> +	if (pos < kh_end(bitmap_git->bitmaps)) {
> +		struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos);
> +		struct ewah_bitmap *or_with = lookup_stored_bitmap(st);
>
> -		show_data.bitmap_git = bitmap_git;
> -		show_data.base = base;
> +		if (*base == NULL)
> +			*base = ewah_to_bitmap(or_with);
> +		else
> +			bitmap_or_ewah(*base, or_with);
>
> -		traverse_commit_list(revs, show_commit, show_object,
> -				     &show_data);
> +		return 1;
>  	}
> -
> -	return base;
> +	return 0;
>  }
>
>  static void show_extended_objects(struct bitmap_index *bitmap_git,
> @@ -665,24 +509,122 @@ static void show_objects_for_type(
>  	}
>  }
>
> -static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
> -			     struct object_list *roots)
> +static void add_object_to_bitmap(struct bitmap_index *bitmap_git,
> +				 struct object *obj,
> +				 struct bitmap *bitmap,
> +				 struct bitmap *seen,
> +				 int ignore_missing);
> +
> +static void add_tag_to_bitmap(struct bitmap_index *bitmap_git,
> +			      struct tag *tag,
> +			      struct bitmap *bitmap,
> +			      struct bitmap *seen,
> +			      int ignore_missing)
> +{
> +	if (parse_tag(tag) < 0 || !tag->tagged) {
> +		if (ignore_missing)
> +			return;
> +		die("unable to parse tag %s", oid_to_hex(&tag->object.oid));
> +	}
> +	add_object_to_bitmap(bitmap_git, tag->tagged, bitmap, seen,
> +			     ignore_missing);
> +}
> +
> +static void add_tree_to_bitmap(struct bitmap_index *bitmap_git,
> +			       struct tree *tree,
> +			       struct bitmap *bitmap,
> +			       struct bitmap *seen,
> +			       int ignore_missing)
>  {
> -	while (roots) {
> -		struct object *object = roots->item;
> -		roots = roots->next;
> +	struct tree_desc desc;
> +	struct name_entry entry;
>
> -		if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
> -			return 1;
> +	if (parse_tree(tree) < 0) {
> +		if (ignore_missing)
> +			return;
> +		die("unable to parse tree %s", oid_to_hex(&tree->object.oid));
>  	}
>
> -	return 0;
> +	init_tree_desc(&desc, tree->buffer, tree->size);
> +	while (tree_entry(&desc, &entry)) {
> +		if (S_ISDIR(entry.mode)) {
> +			struct tree *t = lookup_tree(the_repository, &entry.oid);
> +			if (!t) {
> +				die(_("entry '%s' in tree %s has tree mode, "
> +				      "but is not a tree"),
> +				    entry.path, oid_to_hex(&tree->object.oid));
> +			}
> +			add_object_to_bitmap(bitmap_git, &t->object,
> +					     bitmap, seen, ignore_missing);
> +		} else if (!S_ISGITLINK(entry.mode)) {
> +			struct blob *b = lookup_blob(the_repository, &entry.oid);
> +			if (!b) {
> +				die(_("entry '%s' in tree %s has blob mode, "
> +				      "but is not a blob"),
> +				    entry.path, oid_to_hex(&tree->object.oid));
> +			}
> +			add_object_to_bitmap(bitmap_git, &b->object,
> +					     bitmap, seen, ignore_missing);
> +		}
> +	}
> +
> +	free_tree_buffer(tree);
> +}
> +
> +static void add_object_to_bitmap(struct bitmap_index *bitmap_git,
> +				 struct object *obj,
> +				 struct bitmap *bitmap,
> +				 struct bitmap *seen,
> +				 int ignore_missing)
> +{
> +	int pos = bitmap_position(bitmap_git, &obj->oid);
> +
> +	if (pos < 0)
> +		pos = ext_index_add_object(bitmap_git, obj, NULL);
> +
> +	if (bitmap_get(bitmap, pos))
> +		return; /* already there */
> +
> +	if (seen && bitmap_get(seen, pos))
> +		return; /* seen in other bitmap; not worth digging further */
> +
> +	bitmap_set(bitmap, pos);
> +
> +	if (obj->type == OBJ_BLOB)
> +		return;
> +	else if (obj->type == OBJ_TAG)
> +		add_tag_to_bitmap(bitmap_git, (struct tag *)obj,
> +				  bitmap, seen, ignore_missing);
> +	else if (obj->type == OBJ_TREE)
> +		add_tree_to_bitmap(bitmap_git, (struct tree *)obj,
> +				   bitmap, seen, ignore_missing);
> +	else
> +		BUG("unexpected object type: %d", obj->type);
> +}
> +
> +static void add_objects_to_bitmap(struct bitmap_index *bitmap_git,
> +				  struct object_list **list,
> +				  struct bitmap *bitmap,
> +				  struct bitmap *seen,
> +				  int ignore_missing)
> +{
> +	while (*list) {
> +		struct object_list *cur = *list;
> +
> +		add_object_to_bitmap(bitmap_git, cur->item,
> +				     bitmap, seen, ignore_missing);
> +
> +		*list = cur->next;
> +		free(cur);
> +	}
>  }
>
>  struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
>  {
>  	unsigned int i;
>
> +	struct prio_queue commits = { compare_commits_by_commit_date };
> +
>  	struct object_list *wants = NULL;
>  	struct object_list *haves = NULL;
>
> @@ -714,24 +656,16 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
>  			object = parse_object_or_die(&tag->tagged->oid, NULL);
>  		}
>
> -		if (object->flags & UNINTERESTING)
> -			object_list_insert(object, &haves);
> -		else
> -			object_list_insert(object, &wants);
> +		if (object->type == OBJ_COMMIT)
> +			prio_queue_put(&commits, object);
> +		else {
> +			if (object->flags & UNINTERESTING)
> +				object_list_insert(object, &haves);
> +			else
> +				object_list_insert(object, &wants);
> +		}
>  	}
>
> -	/*
> -	 * if we have a HAVES list, but none of those haves is contained
> -	 * in the packfile that has a bitmap, we don't have anything to
> -	 * optimize here
> -	 */
> -	if (haves && !in_bitmapped_pack(bitmap_git, haves))
> -		goto cleanup;
> -
> -	/* if we don't want anything, we're done here */
> -	if (!wants)
> -		goto cleanup;
> -
>  	/*
>  	 * now we're going to use bitmaps, so load the actual bitmap entries
>  	 * from disk. this is the point of no return; after this the rev_list
> @@ -742,20 +676,74 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
>
>  	object_array_clear(&revs->pending);
>
> -	if (haves) {
> -		revs->ignore_missing_links = 1;
> -		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
> -		reset_revision_walk();
> -		revs->ignore_missing_links = 0;
> +	haves_bitmap = bitmap_new();
> +	wants_bitmap = bitmap_new();
>
> -		if (haves_bitmap == NULL)
> -			BUG("failed to perform bitmap walk");
> -	}
> +	/*
> +	 * First traverse the relevant commits as we would for a normal
> +	 * traversal.
> +	 */
> +	while (commits.nr) {
> +		struct commit *commit = prio_queue_get(&commits);
> +		struct bitmap **dst_bitmap;
> +		int pos;
> +		struct commit_list *parent;
> +
> +		if (commit->object.flags & UNINTERESTING)
> +			dst_bitmap = &haves_bitmap;
> +		else
> +			dst_bitmap = &wants_bitmap;
>
> -	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
> +		pos = bitmap_position(bitmap_git, &commit->object.oid);
> +		if (pos >= 0 && *dst_bitmap && bitmap_get(*dst_bitmap, pos))
> +			continue; /* already covered */
>
> -	if (!wants_bitmap)
> -		BUG("failed to perform bitmap walk");
> +		if (add_commit_to_bitmap(bitmap_git, dst_bitmap, commit))
> +			continue; /* skip adding parents, they're covered */
> +
> +
> +		/* otherwise mark ourselves and queue dependent objects */
> +		if (pos < 0)
> +			pos = ext_index_add_object(bitmap_git, &commit->object, NULL);
> +		bitmap_set(*dst_bitmap, pos);
> +		if (parse_commit(commit)) {
> +			if (commit->object.flags & UNINTERESTING)
> +				continue;
> +			die("unable to parse commit %s",
> +			    oid_to_hex(&commit->object.oid));
> +		}
> +		if (commit->object.flags & UNINTERESTING) {
> +			/*
> +			 * If an uninteresting commit is in the "wants" bitmap,
> +			 * then our tree is worth exploring. This means we may
> +			 * miss some trees in the "haves" that are not
> +			 * ancestors of "wants" (or that are, but we missed
> +			 * because of out-of-order timestamps).
> +			 */
> +			if (wants_bitmap && bitmap_get(wants_bitmap, pos))
> +				object_list_insert(&get_commit_tree(commit)->object,
> +						   &haves);
> +		} else {
> +			/*
> +			 * "wants" must always be explored
> +			 */
> +			object_list_insert(&get_commit_tree(commit)->object,
> +					   &wants);
> +		}
> +
> +		for (parent = commit->parents; parent; parent = parent->next) {
> +			if (commit->object.flags & UNINTERESTING)
> +				parent->item->object.flags |= UNINTERESTING;
> +			prio_queue_put(&commits, parent->item);
> +		}
> +	}
> +
> +	/*
> +	 * At this point we've processed all of the commits, and queued items
> +	 * in "haves" and "wants" that need to be marked.
> +	 */
> +	add_objects_to_bitmap(bitmap_git, &haves, haves_bitmap, NULL, 1);
> +	add_objects_to_bitmap(bitmap_git, &wants, wants_bitmap, haves_bitmap, 0);
>
>  	if (haves_bitmap)
>  		bitmap_and_not(wants_bitmap, haves_bitmap);

Jeff King May 24, 2019, 7:24 a.m. UTC | #19

On Thu, May 23, 2019 at 08:53:56AM -0400, Derrick Stolee wrote:

> > I spent a while thinking and experimenting with this tonight. The result
> > is the patch below. Ævar, do you still have a copy of the repo that
> > misbehaved? I'd be curious to see how it fares.
> 
> This patch caught my attention, and I think I understand some of the issues
> at hand. I'm not well-versed specifically in Git's bitmap implementation.
> The fundamental ideas are there, but my perspective is biased by my own
> independent bitmap implementation for Azure Repos. What worked there may not
> work at all here.

Thanks for looking at this. There are a lot of short-comings in the
current bitmap implementation, so if there's a better way to do things,
I'm not opposed to moving to a new format. :)

> > Finding the right trees to explore is a little tricky with bitmaps.  In
> > a normal traversal, we consider the "edges" to be worth exploring.
> > I.e., the places where an UNINTERESTING commit is the parent of an
> > interesting one.
> 
> This is the "commit frontier" which I bring up again below.

Right. I actually had trouble coming up with a succinct way of
describing this, and stole the definition from your recent blog post. ;)

> > But with bitmaps, we don't have that information in the same way,
> > because we're trying to avoid walking the commits in the first place. So
> > e.g., in a history like this:
> > 
> >   A--B--C
> >       \
> >        D
> > 
> > Let's imagine we're computing "C..D", and that D has a bitmap on disk
> > but C does not.
> 
> (As I read your discussion below, I'm confused. For "C..D", C is a have
> and D is a want. We should explore all the haves _first_, then walk the
> wants, right?)

I think I may have confused things by starting my description with a
hypothetical combined want/have walk. To take a step back: the problem
we were discussing is that we spend a lot of time reading trees to fill
in the "have" bitmap, even though most of those objects are unlikely to
be in the "want" in the first place (only the frontier trees are really
useful).

The current code does indeed fill the "have" bitmap first (so that while
walking "want", it can avoid walking down paths whose bits we know we're
going to clear eventually anyway). But we can't know the frontier if we
completely fill the "have" bitmap first. We have to walk both sides
together, looking for the frontier.

> > When we visit D, we'll see that it has a bitmap, fill in
> > the results in our cumulative "want" bitmap, and then stop walking its
> > parents (because we know everything they could tell us already).
> 
> I may be misreading something, but we would construct _a_ bitmap for C
> by walking its reachable objects until hitting a bitmap we know about.
> Perhaps A or B have a bitmap to start the construction. If we never
> construct a bitmap for C, then we don't know what to remove from the "D"
> bitmap to show the difference.

Yes, and that's how the current code works. If we walk back to B and it
has a bitmap, we can stop walking there and just OR in its bitmap, and
look at the trees for any intermediate commits (in this case just C) to
fill in the rest.

The problem is that we don't have a bitmap for every commit. So you can
imagine a history like this:

  A_1 -- A_2 -- ... -- A_1000 -- B -- C
                                   \
				    D

where we have a bitmap _only_ for D. Filling in the accurate and true
bitmap for C requires walking a thousand commits (and their trees!),
when the non-bitmap algorithm would find the frontier at B and call that
good enough.

> If "C" is not even in the bitmap pack, then we don't use bitmaps for
> this calculation because of this condition:
> 
>         /*
>          * if we have a HAVES list, but none of those haves is contained
>          * in the packfile that has a bitmap, we don't have anything to
>          * optimize here
>          */
>         if (haves && !in_bitmapped_pack(bitmap_git, haves))
>                 goto cleanup;

Right, but it may be in the pack but without a bitmap. We don't store a
bitmap for every commit. The idea was to save space, but select some key
commits that let us generally find a bitmap with a small amount of
walking. And most of the time it works fine. But in some cases, we seem
to find pathological cases where we do quite a lot of walking.

As I said earlier in the thread, I suspect our commit selection is not
great. It's mostly some heuristics we threw together in 2013, and I
don't think it was tested very thoroughly. So fixing that may be a
simpler approach.

What I was wondering here was whether we could get an easy fix based on
the same frontier ideas that the non-bitmap walk uses.

> I'm of the opinion that the old method was better. It followed a very clear
> three-step process:
> 
>  1. Compute the "haves" bitmap.
> 
>  2. Compute the "wants" bitmap, but don't explore into the "haves" bitmap.
> 
>  3. Subtract the "haves" bitmap from the "wants" (in case we added bits to
>     the wants before they were in the haves).

Right, this is _way_ simpler. But it necessarily means spending effort
to find the complete "haves", because we don't know which parts are
useful.

> But there is hope in your idea to improve some cases. As noted, we give up
> if all of the haves are not in the bitmapped pack. By starting with a
> commit walk, we can find the "commit frontier," which is the set of commits
> that are explored by paint_down_to_common(). In this case, the set {B, C, D}.
> 
> After walking commits and finding a set of UNINTERESTING commits, we could
> look for any UNINTERESTING commits that have bitmaps (or simply are in the
> bitmapped pack) and use those to see our "haves" bitmap. So, if B has a
> bitmap, but C is too new for the bitmap, then we can still create the haves
> based on B and then take a bitmap diff "D minus B".

But doing that commit walk to find the frontier negates part of the
purpose of using the bitmaps, which is avoiding even walking commits.
Going back to a variant of our example:

  A -- B -- C_1 -- .. -- C_1000
        \
	 D_1 -- .. -- D_1000

If we have a bitmap at C_1000 and D_1000, we don't have to walk any
commits at all. But finding the frontier requires walking 2000 commits.

Is there a way to find it with just bitmaps? You can certainly find the
intersection, but you don't have any idea which of the many shared
commits is the merge base. Of course in this example you don't actually
care about the frontier (you have the full answer immediately). But how
do you decide which way to optimize: try to avoid walking commits to
get a quick answer from bitmaps, or prefer to walk some commits to find
the frontier and avoid opening too many trees?

-Peff

Jeff King May 24, 2019, 7:27 a.m. UTC | #20

On Thu, May 23, 2019 at 09:26:34PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > I spent a while thinking and experimenting with this tonight. The result
> > is the patch below. Ævar, do you still have a copy of the repo that
> > misbehaved? I'd be curious to see how it fares.
> 
> No, sorry. I think we could make an artificial test to emulate it, which
> would be something like:
> 
>  * ~1 million commits
>  * local clone setup to fetch all branches/tags (default 'git clone')
>  * local a bit ahead/behind
>  * Have old "main" pack with *.bitmap, bunch of other packs/loose objects without that
>  * push try to push the local change upstream (or to a topic branch)
> 
> I tried briefly to emulate this with git.git with:
> [...]
> But didn't get anywhere, probably because here my topics are all stuff I
> have already, and I just have 2x packs.

Yeah, I haven't been able to find a reproduction for this problem at
will. The bitmaps are _supposed_ to be sprinkled around through the
commit graph so that we don't have to walk far. But presumably in your
case that was not so.

I'm not sure what tickles the bitmap-writer to fail so hard. Is it
having too many refs? Weird patterns in the graph? Just a ton of
commits?

-Peff

Ævar Arnfjörð Bjarmason May 24, 2019, 7:55 a.m. UTC | #21

On Fri, May 24 2019, Jeff King wrote:

> On Thu, May 23, 2019 at 09:26:34PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > I spent a while thinking and experimenting with this tonight. The result
>> > is the patch below. Ævar, do you still have a copy of the repo that
>> > misbehaved? I'd be curious to see how it fares.
>>
>> No, sorry. I think we could make an artificial test to emulate it, which
>> would be something like:
>>
>>  * ~1 million commits
>>  * local clone setup to fetch all branches/tags (default 'git clone')
>>  * local a bit ahead/behind
>>  * Have old "main" pack with *.bitmap, bunch of other packs/loose objects without that
>>  * push try to push the local change upstream (or to a topic branch)
>>
>> I tried briefly to emulate this with git.git with:
>> [...]
>> But didn't get anywhere, probably because here my topics are all stuff I
>> have already, and I just have 2x packs.
>
> Yeah, I haven't been able to find a reproduction for this problem at
> will. The bitmaps are _supposed_ to be sprinkled around through the
> commit graph so that we don't have to walk far. But presumably in your
> case that was not so.
>
> I'm not sure what tickles the bitmap-writer to fail so hard. Is it
> having too many refs? Weird patterns in the graph? Just a ton of
> commits?

Ah, why did only this ancient (big) pack have a bitmap?

The bitmap writer had never failed, this was just a repository where
some automation (on a dev/staging box) cloned a repo, and someone had
once run a manual "repack" to make make a pack with a bitmap.

Then as time passed that pack stayed around, and re-looking at this that
could have only happened because I had gc.bigPackThreshold turned on.

I.e. without that we'd have eventually done a full repack, so the bitmap
would have gone away.

So getting the repo into that state was a series of unlikely events.

I think to the extent that this is an issue we can reproduce in the
future the proper fix for it in lieu of some easy fix in the bitmap code
would be to just teach "gc" to unlink old *.bitmap files if we detect
they're too stale.

I.e. we don't need to deal gracefully with some case where the bitmaps
just cover some tiny part of the graph, we can just teach "gc" to either
update them, or (if we're not currently writing them) unlink them.

That seems to me to be a good idea in general, not just with bitmaps but
also the commit graph. If we're doing a GC and our current settings
aren't such that we'd update those files, shouldn't we just unlink them?

Jeff King May 24, 2019, 8:26 a.m. UTC | #22

On Fri, May 24, 2019 at 09:55:05AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > I'm not sure what tickles the bitmap-writer to fail so hard. Is it
> > having too many refs? Weird patterns in the graph? Just a ton of
> > commits?
> 
> Ah, why did only this ancient (big) pack have a bitmap?
> 
> The bitmap writer had never failed, this was just a repository where
> some automation (on a dev/staging box) cloned a repo, and someone had
> once run a manual "repack" to make make a pack with a bitmap.

Just to be clear, by "fail" I didn't mean that the writer failed to
produce. I just meant it had poor commit selection for its bitmap
coverage. But...

> Then as time passed that pack stayed around, and re-looking at this that
> could have only happened because I had gc.bigPackThreshold turned on.
> 
> I.e. without that we'd have eventually done a full repack, so the bitmap
> would have gone away.
> 
> So getting the repo into that state was a series of unlikely events.

Ah, now that's interesting. The issue may have just been that we had a
ton of un-bitmapped commits because they weren't in the bitmapped pack
at all. The logic that Stolee pointed out earlier:

          /*
           * if we have a HAVES list, but none of those haves is contained
           * in the packfile that has a bitmap, we don't have anything to
           * optimize here
           */
          if (haves && !in_bitmapped_pack(bitmap_git, haves))
                  goto cleanup;

is pretty feeble. If you have even _one_ UNINTERESTING tip that's in the
pack, we'll try to use bitmaps. And in your case, you almost certainly
had old tags on both the client and the server there were in the old
bitmapped pack, but then a huge swath of history that had happened since
then. And it was that newer part of the graph that we had to walk to
fill in the bitmap.

So all of this makes pretty good sense to me now. Bitmaps work
incrementally as you add new, un-bitmapped history. But the cost gets
higher and higher the longer you go before repacking and generating new
bitmaps. So your case was very similar to a repo that uses bitmaps but
just hadn't packed in a really long time.

> I think to the extent that this is an issue we can reproduce in the
> future the proper fix for it in lieu of some easy fix in the bitmap code
> would be to just teach "gc" to unlink old *.bitmap files if we detect
> they're too stale.

Yeah. This could happen if you simply accumulated history without ever
running "gc". But in general we can probably assume that "gc" will run
periodically (though there is a real blind spot if you push up a very
huge chunk of history in one pack, since gc counts packs, not objects).

I agree that if gc is deciding to leave a big pack, it should probably
ditch the bitmaps for it.

> I.e. we don't need to deal gracefully with some case where the bitmaps
> just cover some tiny part of the graph, we can just teach "gc" to either
> update them, or (if we're not currently writing them) unlink them.

Unfortunately I don't think we can update them, because all of the
reachable objects need to be in the same pack. So any scheme that isn't
doing a periodic all-into-one repack probably shouldn't be using
bitmaps. I wonder if we need to tweak Eric's bitmaps-by-default logic to
better account for this.

> That seems to me to be a good idea in general, not just with bitmaps but
> also the commit graph. If we're doing a GC and our current settings
> aren't such that we'd update those files, shouldn't we just unlink them?

I don't think the commit graph would suffer from this. It's not tied to
a specific pack, so it can be freely updated on any gc. You still have
the problem when gc does not run. It's possible that auto-gc should have
separate thresholds for different activities (e.g., number of packs
should tell us when to repack objects, number of loose refs should tell
us when to pack refs, number of un-annotated commits should tell us when
to update the commit graph). The trouble is that some of those checks
are non-trivial.

In the long run, I think the plan is for the commit graph to allow cheap
incremental updating, which may make it reasonable to just update it
preemptively after every fetch/push.

-Peff

Ævar Arnfjörð Bjarmason May 24, 2019, 9:01 a.m. UTC | #23

On Fri, May 24 2019, Jeff King wrote:

> On Fri, May 24, 2019 at 09:55:05AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > I'm not sure what tickles the bitmap-writer to fail so hard. Is it
>> > having too many refs? Weird patterns in the graph? Just a ton of
>> > commits?
>>
>> Ah, why did only this ancient (big) pack have a bitmap?
>>
>> The bitmap writer had never failed, this was just a repository where
>> some automation (on a dev/staging box) cloned a repo, and someone had
>> once run a manual "repack" to make make a pack with a bitmap.
>
> Just to be clear, by "fail" I didn't mean that the writer failed to
> produce. I just meant it had poor commit selection for its bitmap
> coverage. But...
>
>> Then as time passed that pack stayed around, and re-looking at this that
>> could have only happened because I had gc.bigPackThreshold turned on.
>>
>> I.e. without that we'd have eventually done a full repack, so the bitmap
>> would have gone away.
>>
>> So getting the repo into that state was a series of unlikely events.
>
> Ah, now that's interesting. The issue may have just been that we had a
> ton of un-bitmapped commits because they weren't in the bitmapped pack
> at all. The logic that Stolee pointed out earlier:
>
>           /*
>            * if we have a HAVES list, but none of those haves is contained
>            * in the packfile that has a bitmap, we don't have anything to
>            * optimize here
>            */
>           if (haves && !in_bitmapped_pack(bitmap_git, haves))
>                   goto cleanup;
>
> is pretty feeble. If you have even _one_ UNINTERESTING tip that's in the
> pack, we'll try to use bitmaps. And in your case, you almost certainly
> had old tags on both the client and the server there were in the old
> bitmapped pack, but then a huge swath of history that had happened since
> then. And it was that newer part of the graph that we had to walk to
> fill in the bitmap.
>
> So all of this makes pretty good sense to me now. Bitmaps work
> incrementally as you add new, un-bitmapped history. But the cost gets
> higher and higher the longer you go before repacking and generating new
> bitmaps. So your case was very similar to a repo that uses bitmaps but
> just hadn't packed in a really long time.
>
>> I think to the extent that this is an issue we can reproduce in the
>> future the proper fix for it in lieu of some easy fix in the bitmap code
>> would be to just teach "gc" to unlink old *.bitmap files if we detect
>> they're too stale.
>
> Yeah. This could happen if you simply accumulated history without ever
> running "gc". But in general we can probably assume that "gc" will run
> periodically (though there is a real blind spot if you push up a very
> huge chunk of history in one pack, since gc counts packs, not objects).
>
> I agree that if gc is deciding to leave a big pack, it should probably
> ditch the bitmaps for it.
>
>> I.e. we don't need to deal gracefully with some case where the bitmaps
>> just cover some tiny part of the graph, we can just teach "gc" to either
>> update them, or (if we're not currently writing them) unlink them.
>
> Unfortunately I don't think we can update them, because all of the
> reachable objects need to be in the same pack. So any scheme that isn't
> doing a periodic all-into-one repack probably shouldn't be using
> bitmaps. I wonder if we need to tweak Eric's bitmaps-by-default logic to
> better account for this.

I mean either we'd update them via repacking, or have some heuristic to
do away with them.

>> That seems to me to be a good idea in general, not just with bitmaps but
>> also the commit graph. If we're doing a GC and our current settings
>> aren't such that we'd update those files, shouldn't we just unlink them?
>
> I don't think the commit graph would suffer from this. It's not tied to
> a specific pack, so it can be freely updated on any gc. You still have
> the problem when gc does not run. It's possible that auto-gc should have
> separate thresholds for different activities (e.g., number of packs
> should tell us when to repack objects, number of loose refs should tell
> us when to pack refs, number of un-annotated commits should tell us when
> to update the commit graph). The trouble is that some of those checks
> are non-trivial.
>
> In the long run, I think the plan is for the commit graph to allow cheap
> incremental updating, which may make it reasonable to just update it
> preemptively after every fetch/push.

I don't think it's a performance problem to have an old commit-graph
lying around. But if you turn on the commit-graph, run gc a bunch, then
turn it off in config we'll have it lying around forever, even if you do
subsequent gc's.

So I think we should delete such things on the general principle that
the end-state of a gc's shouldn't be the accumulation of the values of
past configuration options if we can help it.

Maybe that screws over other users who did a "commit-graph write"
without setting gc.writeCommitGraph, but I think the only sane thing to
do is to make "gc" fully 'own' such things if its turned on at all. We
don't worry e.g. that we can't repack some recent pack because a user
might have just manually produced it with handcrafted love seconds
earlier.

Some of what you bring up is "let's do incremental GC". I agree, but
think that can be considered separately from what GC should do *when* it
runs, whether that's a "full" or "partial" run.

SZEDER Gábor May 24, 2019, 9:29 a.m. UTC | #24

On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote:
> I don't think it's a performance problem to have an old commit-graph
> lying around. But if you turn on the commit-graph, run gc a bunch, then
> turn it off in config we'll have it lying around forever, even if you do
> subsequent gc's.
> 
> So I think we should delete such things on the general principle that
> the end-state of a gc's shouldn't be the accumulation of the values of
> past configuration options if we can help it.
> 
> Maybe that screws over other users who did a "commit-graph write"
> without setting gc.writeCommitGraph, but I think the only sane thing to
> do is to make "gc" fully 'own' such things if its turned on at all.

Note that there is 'core.commitGraph' as well; as long as it's
enabled, no commit-graph files should be deleted.

Derrick Stolee May 24, 2019, 10:33 a.m. UTC | #25

On 5/24/2019 3:24 AM, Jeff King wrote:
> On Thu, May 23, 2019 at 08:53:56AM -0400, Derrick Stolee wrote:
> 
>>> I spent a while thinking and experimenting with this tonight. The result
>>> is the patch below. Ævar, do you still have a copy of the repo that
>>> misbehaved? I'd be curious to see how it fares.
>>
>> This patch caught my attention, and I think I understand some of the issues
>> at hand. I'm not well-versed specifically in Git's bitmap implementation.
>> The fundamental ideas are there, but my perspective is biased by my own
>> independent bitmap implementation for Azure Repos. What worked there may not
>> work at all here.
> 
> Thanks for looking at this. There are a lot of short-comings in the
> current bitmap implementation, so if there's a better way to do things,
> I'm not opposed to moving to a new format. :)
> 
>>> Finding the right trees to explore is a little tricky with bitmaps.  In
>>> a normal traversal, we consider the "edges" to be worth exploring.
>>> I.e., the places where an UNINTERESTING commit is the parent of an
>>> interesting one.
>>
>> This is the "commit frontier" which I bring up again below.
> 
> Right. I actually had trouble coming up with a succinct way of
> describing this, and stole the definition from your recent blog post. ;)
> 
>>> But with bitmaps, we don't have that information in the same way,
>>> because we're trying to avoid walking the commits in the first place. So
>>> e.g., in a history like this:
>>>
>>>   A--B--C
>>>       \
>>>        D
>>>
>>> Let's imagine we're computing "C..D", and that D has a bitmap on disk
>>> but C does not.
>>
>> (As I read your discussion below, I'm confused. For "C..D", C is a have
>> and D is a want. We should explore all the haves _first_, then walk the
>> wants, right?)
> 
> I think I may have confused things by starting my description with a
> hypothetical combined want/have walk. To take a step back: the problem
> we were discussing is that we spend a lot of time reading trees to fill
> in the "have" bitmap, even though most of those objects are unlikely to
> be in the "want" in the first place (only the frontier trees are really
> useful).

Thank you for resolving my confusion.

[snip]

> As I said earlier in the thread, I suspect our commit selection is not
> great. It's mostly some heuristics we threw together in 2013, and I
> don't think it was tested very thoroughly. So fixing that may be a
> simpler approach.

It's a hard problem! There are no _sure_ answers here, and what works in
some cases will probably not work in other cases.
 
> What I was wondering here was whether we could get an easy fix based on
> the same frontier ideas that the non-bitmap walk uses.

[snip]
 
> But doing that commit walk to find the frontier negates part of the
> purpose of using the bitmaps, which is avoiding even walking commits.
> Going back to a variant of our example:
> 
>   A -- B -- C_1 -- .. -- C_1000
>         \
> 	 D_1 -- .. -- D_1000
> 
> If we have a bitmap at C_1000 and D_1000, we don't have to walk any
> commits at all. But finding the frontier requires walking 2000 commits.

In my opinion, walking commits is easy (easier with the commit-graph)
and walking trees is hard. We probably _should_ do more commit walking if
it helps avoid walking some trees.
 
> Is there a way to find it with just bitmaps? You can certainly find the
> intersection, but you don't have any idea which of the many shared
> commits is the merge base. Of course in this example you don't actually
> care about the frontier (you have the full answer immediately). But how
> do you decide which way to optimize: try to avoid walking commits to
> get a quick answer from bitmaps, or prefer to walk some commits to find
> the frontier and avoid opening too many trees?

With a new perspective on the problem, I think perhaps trying to solve this
problem with bitmaps is incorrect. If you want to use bitmaps for C..D and
you don't have any bitmaps in the range D..C, then maybe we should just use
the old-fashioned method of walking trees? In your examples above, the
new method would walk trees for the commits in {B, C_i, D_j} while the
bitmap method would walk trees for the commits in {B, C_i, A_k}. I would
expect the list of {A_k} commits to be the largest in most cases.

The approach here would be to do the commit frontier walk, and check for
commits with bitmaps along the way. If we don't find an UNINTERESTING
bitmap, then use the non-bitmap way instead. Perhaps worth a shot.

I'll bring up this code snippet again:

        /*
         * if we have a HAVES list, but none of those haves is contained
         * in the packfile that has a bitmap, we don't have anything to
         * optimize here
         */
        if (haves && !in_bitmapped_pack(bitmap_git, haves))
                goto cleanup;

In addition to this, we can fill the "haves" set with the commits
in D..C (with boundary) and then check if any of those commits have a
precomputed bitmap. If not, goto cleanup.

Thanks,
-Stolee

Ævar Arnfjörð Bjarmason May 24, 2019, 11:17 a.m. UTC | #26

On Fri, May 24 2019, SZEDER Gábor wrote:

> On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> I don't think it's a performance problem to have an old commit-graph
>> lying around. But if you turn on the commit-graph, run gc a bunch, then
>> turn it off in config we'll have it lying around forever, even if you do
>> subsequent gc's.
>>
>> So I think we should delete such things on the general principle that
>> the end-state of a gc's shouldn't be the accumulation of the values of
>> past configuration options if we can help it.
>>
>> Maybe that screws over other users who did a "commit-graph write"
>> without setting gc.writeCommitGraph, but I think the only sane thing to
>> do is to make "gc" fully 'own' such things if its turned on at all.
>
> Note that there is 'core.commitGraph' as well; as long as it's
> enabled, no commit-graph files should be deleted.

Why? If we won't update it or write it if it's not there, why keep it
around?

It means the commit-graph code and anything else (like bitmaps) needs to
deal with stale data for the common and default gc --auto case.

You also can't have e.g. a global core.commitGraph=true config along
with a per-repo gc.writeCommitGraph=true config do what you expect.

Now just because you wanted to write it for some you'll end up keeping
it around forever because you'd also want to optimistically always use
it if it's there.

Note that I'm talking about the *default* gc semantics, they don't have
to cover all advanced use-cases, just be good enough for most, and it's
also important that they're as simple as possible, and don't result in
stuff like "my performance sucks because I turned this config option on
once a year ago for 2 days".

SZEDER Gábor May 24, 2019, 11:41 a.m. UTC | #27

On Fri, May 24, 2019 at 01:17:06PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, May 24 2019, SZEDER Gábor wrote:
> 
> > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> I don't think it's a performance problem to have an old commit-graph
> >> lying around. But if you turn on the commit-graph, run gc a bunch, then
> >> turn it off in config we'll have it lying around forever, even if you do
> >> subsequent gc's.
> >>
> >> So I think we should delete such things on the general principle that
> >> the end-state of a gc's shouldn't be the accumulation of the values of
> >> past configuration options if we can help it.
> >>
> >> Maybe that screws over other users who did a "commit-graph write"
> >> without setting gc.writeCommitGraph, but I think the only sane thing to
> >> do is to make "gc" fully 'own' such things if its turned on at all.
> >
> > Note that there is 'core.commitGraph' as well; as long as it's
> > enabled, no commit-graph files should be deleted.
> 
> Why? If we won't update it or write it if it's not there, why keep it
> around?

To read it, if 'core.commitGraph' says that is should be read.

> It means the commit-graph code and anything else (like bitmaps) needs to
> deal with stale data for the common and default gc --auto case.
> 
> You also can't have e.g. a global core.commitGraph=true config along
> with a per-repo gc.writeCommitGraph=true config do what you expect.
> 
> Now just because you wanted to write it for some you'll end up keeping
> it around forever because you'd also want to optimistically always use
> it if it's there.

This is exactly what I expect it to do.


> Note that I'm talking about the *default* gc semantics, they don't have
> to cover all advanced use-cases, just be good enough for most, and it's
> also important that they're as simple as possible, and don't result in
> stuff like "my performance sucks because I turned this config option on
> once a year ago for 2 days".

Ævar Arnfjörð Bjarmason May 24, 2019, 11:58 a.m. UTC | #28

On Fri, May 24 2019, SZEDER Gábor wrote:

> On Fri, May 24, 2019 at 01:17:06PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Fri, May 24 2019, SZEDER Gábor wrote:
>>
>> > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >> I don't think it's a performance problem to have an old commit-graph
>> >> lying around. But if you turn on the commit-graph, run gc a bunch, then
>> >> turn it off in config we'll have it lying around forever, even if you do
>> >> subsequent gc's.
>> >>
>> >> So I think we should delete such things on the general principle that
>> >> the end-state of a gc's shouldn't be the accumulation of the values of
>> >> past configuration options if we can help it.
>> >>
>> >> Maybe that screws over other users who did a "commit-graph write"
>> >> without setting gc.writeCommitGraph, but I think the only sane thing to
>> >> do is to make "gc" fully 'own' such things if its turned on at all.
>> >
>> > Note that there is 'core.commitGraph' as well; as long as it's
>> > enabled, no commit-graph files should be deleted.
>>
>> Why? If we won't update it or write it if it's not there, why keep it
>> around?
>
> To read it, if 'core.commitGraph' says that is should be read.
>
>> It means the commit-graph code and anything else (like bitmaps) needs to
>> deal with stale data for the common and default gc --auto case.
>>
>> You also can't have e.g. a global core.commitGraph=true config along
>> with a per-repo gc.writeCommitGraph=true config do what you expect.
>>
>> Now just because you wanted to write it for some you'll end up keeping
>> it around forever because you'd also want to optimistically always use
>> it if it's there.
>
> This is exactly what I expect it to do.

Do you also expect base packs with an associated bitmap to have an
implicit *.keep flag under gc with pack.writeBitmaps=false and
pack.useBitmaps=true?

>> Note that I'm talking about the *default* gc semantics, they don't have
>> to cover all advanced use-cases, just be good enough for most, and it's
>> also important that they're as simple as possible, and don't result in
>> stuff like "my performance sucks because I turned this config option on
>> once a year ago for 2 days".

SZEDER Gábor May 24, 2019, 12:34 p.m. UTC | #29

On Fri, May 24, 2019 at 01:58:03PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, May 24 2019, SZEDER Gábor wrote:
> 
> > On Fri, May 24, 2019 at 01:17:06PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> On Fri, May 24 2019, SZEDER Gábor wrote:
> >>
> >> > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> >> I don't think it's a performance problem to have an old commit-graph
> >> >> lying around. But if you turn on the commit-graph, run gc a bunch, then
> >> >> turn it off in config we'll have it lying around forever, even if you do
> >> >> subsequent gc's.
> >> >>
> >> >> So I think we should delete such things on the general principle that
> >> >> the end-state of a gc's shouldn't be the accumulation of the values of
> >> >> past configuration options if we can help it.
> >> >>
> >> >> Maybe that screws over other users who did a "commit-graph write"
> >> >> without setting gc.writeCommitGraph, but I think the only sane thing to
> >> >> do is to make "gc" fully 'own' such things if its turned on at all.
> >> >
> >> > Note that there is 'core.commitGraph' as well; as long as it's
> >> > enabled, no commit-graph files should be deleted.
> >>
> >> Why? If we won't update it or write it if it's not there, why keep it
> >> around?
> >
> > To read it, if 'core.commitGraph' says that is should be read.
> >
> >> It means the commit-graph code and anything else (like bitmaps) needs to
> >> deal with stale data for the common and default gc --auto case.
> >>
> >> You also can't have e.g. a global core.commitGraph=true config along
> >> with a per-repo gc.writeCommitGraph=true config do what you expect.
> >>
> >> Now just because you wanted to write it for some you'll end up keeping
> >> it around forever because you'd also want to optimistically always use
> >> it if it's there.
> >
> > This is exactly what I expect it to do.
> 
> Do you also expect base packs with an associated bitmap to have an
> implicit *.keep flag under gc with pack.writeBitmaps=false and
> pack.useBitmaps=true?

I don't understand what an "implicit *.keep flag" is.  However, since
a reachability bitmap is always associated with a pack, but the
commit-graph is not, I don't think this is a valid comparison.

> >> Note that I'm talking about the *default* gc semantics, they don't have
> >> to cover all advanced use-cases, just be good enough for most, and it's
> >> also important that they're as simple as possible, and don't result in
> >> stuff like "my performance sucks because I turned this config option on
> >> once a year ago for 2 days".

Ævar Arnfjörð Bjarmason May 24, 2019, 1:41 p.m. UTC | #30

On Fri, May 24 2019, SZEDER Gábor wrote:

> On Fri, May 24, 2019 at 01:58:03PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Fri, May 24 2019, SZEDER Gábor wrote:
>>
>> > On Fri, May 24, 2019 at 01:17:06PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >>
>> >> On Fri, May 24 2019, SZEDER Gábor wrote:
>> >>
>> >> > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >> >> I don't think it's a performance problem to have an old commit-graph
>> >> >> lying around. But if you turn on the commit-graph, run gc a bunch, then
>> >> >> turn it off in config we'll have it lying around forever, even if you do
>> >> >> subsequent gc's.
>> >> >>
>> >> >> So I think we should delete such things on the general principle that
>> >> >> the end-state of a gc's shouldn't be the accumulation of the values of
>> >> >> past configuration options if we can help it.
>> >> >>
>> >> >> Maybe that screws over other users who did a "commit-graph write"
>> >> >> without setting gc.writeCommitGraph, but I think the only sane thing to
>> >> >> do is to make "gc" fully 'own' such things if its turned on at all.
>> >> >
>> >> > Note that there is 'core.commitGraph' as well; as long as it's
>> >> > enabled, no commit-graph files should be deleted.
>> >>
>> >> Why? If we won't update it or write it if it's not there, why keep it
>> >> around?
>> >
>> > To read it, if 'core.commitGraph' says that is should be read.
>> >
>> >> It means the commit-graph code and anything else (like bitmaps) needs to
>> >> deal with stale data for the common and default gc --auto case.
>> >>
>> >> You also can't have e.g. a global core.commitGraph=true config along
>> >> with a per-repo gc.writeCommitGraph=true config do what you expect.
>> >>
>> >> Now just because you wanted to write it for some you'll end up keeping
>> >> it around forever because you'd also want to optimistically always use
>> >> it if it's there.
>> >
>> > This is exactly what I expect it to do.
>>
>> Do you also expect base packs with an associated bitmap to have an
>> implicit *.keep flag under gc with pack.writeBitmaps=false and
>> pack.useBitmaps=true?
>
> I don't understand what an "implicit *.keep flag" is[...]

A .keep means we keep the pack, and e.g. gc.bigPackThreshold is
effectively an implicit *.keep flag on a pack matching some critera,
which in this case caused this issue of a stale *.bitmap file (since the
pack wasn't touched, neither was the bitmap).

> [...]However, since a reachability bitmap is always associated with a
> pack, but the commit-graph is not, I don't think this is a valid
> comparison.

I don't either, I'm just trying to understand where you're coming from,
and I still don't.

That bitmaps are associated with specific packs and the commit graph
isn't is an internal implementation detail. Users who care about that
distinction either don't use "git gc" or would be willing to tweak its
settings.

For the rest I think removing existing side-indexes is a good default
for the practical reasons mentioned upthread.

>> >> Note that I'm talking about the *default* gc semantics, they don't have
>> >> to cover all advanced use-cases, just be good enough for most, and it's
>> >> also important that they're as simple as possible, and don't result in
>> >> stuff like "my performance sucks because I turned this config option on
>> >> once a year ago for 2 days".

[v3] repack: enable bitmaps by default on bare repos

Commit Message

Comments

Patch