[0/7] rev-parse: implement object type filter

Message ID	cover.1614600555.git.ps@pks.im (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Mon, 1 Mar 2021 13:20:26 +0100 From: Patrick Steinhardt <ps@pks.im> To: git@vger.kernel.org Cc: Christian Couder <christian.couder@gmail.com> Subject: [PATCH 0/7] rev-parse: implement object type filter Message-ID: <cover.1614600555.git.ps@pks.im> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="WixB0Y6vj/+u7QPH" Content-Disposition: inline Precedence: bulk
Series	rev-parse: implement object type filter \| expand [0/7] rev-parse: implement object type filter [1/7] revision: mark commit parents as NOT_USER_GIVEN [2/7] list-objects: move tag processing into its own function [3/7] list-objects: support filtering by tag and commit [4/7] list-objects: implement object type filter [5/7] pack-bitmap: implement object type filter [6/7] pack-bitmap: implement combined filter [7/7] rev-list: allow filtering of provided items

Patrick Steinhardt March 1, 2021, 12:20 p.m. UTC

Hi,

I've recently had the usecase to retrieve all blobs introduces between
two versions which have a limit smaller than 200 bytes in order to find
all potential candidates for LFS pointers. This is currently done with
`git rev-list --objects --filter=blob:limit=200 <newrev> ^<oldrev>`, but
this is kind of inefficient: the resulting list is way too long as it
also potentially includes tags, commits and trees.

To be able to more efficiently answer this query, I've implemented
multiple things:

- A new object type filter `--filter=object:type=<type>` for
  git-rev-list(1), which is implemented both for normal graph walks and
  for the packfile bitmap index.

- Given that above usecase requires two filters (the object type
  and blob size filters), bitmap filters were extended to support
  combined filters.

- git-rev-list(1) doesn't filter user-provided objects and always prints
  them. I don't want the listed commits though and only their referenced
  potential LFS blobs. So I've added a new flag `--filter-provided`
  which marks all provided objects as not-user-provided such that they
  get filtered the same as all the other objects.

Altogether, this ends up with the following queries, both of which have
been executed in a well-packed linux.git repository:

    # Previous query which uses object names as a heuristic to filter
    # non-blob objects, which bars us from using bitmap indices because
    # they cannot print paths.
    $ time git rev-list --objects --filter=blob:limit=200 \
        --object-names --all | sed -r '/^.{,41}$/d' | wc -l
    4502300

    real 1m23.872s
    user 1m30.076s
    sys  0m6.002s

    # New query.
    $ time git rev-list --objects --filter-provided \
        --filter=object:type=blob --filter=blob:limit=200 \
        --use-bitmap-index --all | wc -l
    22585

    real 0m19.216s
    user 0m16.768s
    sys  0m2.450s

So with the new optimized query, we can both significantly reduce the
list of candidate LFS pointers and execution time.

Patrick

Patrick Steinhardt (7):
  revision: mark commit parents as NOT_USER_GIVEN
  list-objects: move tag processing into its own function
  list-objects: support filtering by tag and commit
  list-objects: implement object type filter
  pack-bitmap: implement object type filter
  pack-bitmap: implement combined filter
  rev-list: allow filtering of provided items

 Documentation/rev-list-options.txt  |   3 +
 builtin/rev-list.c                  |  14 ++++
 list-objects-filter-options.c       |  14 ++++
 list-objects-filter-options.h       |   8 ++
 list-objects-filter.c               | 116 ++++++++++++++++++++++++++++
 list-objects-filter.h               |   2 +
 list-objects.c                      |  32 +++++++-
 pack-bitmap.c                       |  71 +++++++++++++++--
 revision.c                          |   4 +-
 revision.h                          |   3 -
 t/t6112-rev-list-filters-objects.sh |  76 ++++++++++++++++++
 t/t6113-rev-list-bitmap-filters.sh  |  54 ++++++++++++-
 12 files changed, 380 insertions(+), 17 deletions(-)

Jeff King March 10, 2021, 9:39 p.m. UTC | #1

On Mon, Mar 01, 2021 at 01:20:26PM +0100, Patrick Steinhardt wrote:

> Altogether, this ends up with the following queries, both of which have
> been executed in a well-packed linux.git repository:
> 
>     # Previous query which uses object names as a heuristic to filter
>     # non-blob objects, which bars us from using bitmap indices because
>     # they cannot print paths.
>     $ time git rev-list --objects --filter=blob:limit=200 \
>         --object-names --all | sed -r '/^.{,41}$/d' | wc -l
>     4502300
> 
>     real 1m23.872s
>     user 1m30.076s
>     sys  0m6.002s
> 
>     # New query.
>     $ time git rev-list --objects --filter-provided \
>         --filter=object:type=blob --filter=blob:limit=200 \
>         --use-bitmap-index --all | wc -l
>     22585
> 
>     real 0m19.216s
>     user 0m16.768s
>     sys  0m2.450s

Those produce very different answers. I guess because in the first one,
you still have a bunch of tree objects, too. You'd do much better to get
the actual types from cat-file, and filter on that. That also lets you
use bitmaps for the traversal portion. E.g.:

  $ time git rev-list --use-bitmap-index --objects --filter=blob:limit=200 --all |
         git cat-file --buffer --batch-check='%(objecttype) %(objectname)' |
	 perl -lne 'print $1 if /^blob (.*)/' | wc -l
  14966
  
  real	0m6.248s
  user	0m7.810s
  sys	0m0.440s

which is faster than what you showed above (this is on linux.git, but my
result is different; maybe you have more refs than me?). But we should
be able to do better purely internally, so I suspect my computer is just
faster (or maybe your extra refs just aren't well-covered by bitmaps).
Running with your patches I get:

  $ time git rev-list --objects --use-bitmap-index --all \
             --filter-provided --filter=object:type=blob \
	     --filter=blob:limit=200 | wc -l
  16339

  real	0m1.309s
  user	0m1.234s
  sys	0m0.079s

which is indeed faster. It's quite curious that the answer is not the
same, though! I think yours has some bugs. If I sort and diff the
results, I see some commits mentioned in the output. Perhaps this is
--filter-provided not working, as they all seem to be ref tips.

> To be able to more efficiently answer this query, I've implemented
> multiple things:
> 
> - A new object type filter `--filter=object:type=<type>` for
>   git-rev-list(1), which is implemented both for normal graph walks and
>   for the packfile bitmap index.
> 
> - Given that above usecase requires two filters (the object type
>   and blob size filters), bitmap filters were extended to support
>   combined filters.

That's probably reasonable, especially because it lets us use bitmaps. I
do have a dream that we'll eventually be able to support more extensive
formatting via log/rev-list, which would allow:

  git rev-list --use-bitmap-index --objects --all \
               --format=%(objecttype) %(objectname) |
  perl -ne 'print $1 if /^blob (.*)/'

That should be faster than the separate cat-file (which has to re-lookup
each object, in addition to the extra pipe overhead), but I expect the
--filter solution should always be faster still, as it can very quickly
eliminate the majority of the objects at the bitmap level.

> - git-rev-list(1) doesn't filter user-provided objects and always prints
>   them. I don't want the listed commits though and only their referenced
>   potential LFS blobs. So I've added a new flag `--filter-provided`
>   which marks all provided objects as not-user-provided such that they
>   get filtered the same as all the other objects.

Yeah, this "user-provided" behavior was quite a surprise to me when I
started implementing the bitmap versions of the existing filters. It's
nice to have the option to specify which you want.

-Peff

Taylor Blau March 10, 2021, 9:58 p.m. UTC | #2

On Mon, Mar 01, 2021 at 01:20:26PM +0100, Patrick Steinhardt wrote:
> - A new object type filter `--filter=object:type=<type>` for
>   git-rev-list(1), which is implemented both for normal graph walks and
>   for the packfile bitmap index.

I understand what you're looking for here, but I worry that '--filter'
might be too leaky of an abstraction.

I was a little surprised to learn that you can clone a repository with
--filter=object:type=tree (excluding commits), but it does work. I'm
fine reusing a lot of the object filtering code if it makes this an
easier task, but I think it may be worthwhile to hide this new kind of
filter from upload-pack.

> - Given that above usecase requires two filters (the object type
>   and blob size filters), bitmap filters were extended to support
>   combined filters.

Nice. We didn't do this since the only previously supported filters were
blob:none and tree:0 (the latter implying the former), so there was no
need.

Thanks,
Taylor

Jeff King March 10, 2021, 10:19 p.m. UTC | #3

On Wed, Mar 10, 2021 at 04:58:16PM -0500, Taylor Blau wrote:

> On Mon, Mar 01, 2021 at 01:20:26PM +0100, Patrick Steinhardt wrote:
> > - A new object type filter `--filter=object:type=<type>` for
> >   git-rev-list(1), which is implemented both for normal graph walks and
> >   for the packfile bitmap index.
> 
> I understand what you're looking for here, but I worry that '--filter'
> might be too leaky of an abstraction.
> 
> I was a little surprised to learn that you can clone a repository with
> --filter=object:type=tree (excluding commits), but it does work. I'm
> fine reusing a lot of the object filtering code if it makes this an
> easier task, but I think it may be worthwhile to hide this new kind of
> filter from upload-pack.

I had a similar thought, but wouldn't the existing uploadpackfilter
config take care of this?

I guess the catch-all "allow" option defaults to "true", so we'd support
any new filters that are added. Which seems like a poor choice in
general, but flipping it would mean that servers have to update their
config.

I do wonder if it's that bad for clients to be able to specify something
like this, though. Even though there's not that much use for it with a
regular partial clone, it could conceivably used for some special cases.
I do think it would be more useful if you could OR together multiple
types. Asking for "commits|tags|trees" is really the same as the already
useful "blob:none". And "commits|tags" is the same as tree:depth=0.

-Peff

Patrick Steinhardt March 11, 2021, 2:38 p.m. UTC | #4

On Wed, Mar 10, 2021 at 04:39:22PM -0500, Jeff King wrote:
> On Mon, Mar 01, 2021 at 01:20:26PM +0100, Patrick Steinhardt wrote:
> 
> > Altogether, this ends up with the following queries, both of which have
> > been executed in a well-packed linux.git repository:
> > 
> >     # Previous query which uses object names as a heuristic to filter
> >     # non-blob objects, which bars us from using bitmap indices because
> >     # they cannot print paths.
> >     $ time git rev-list --objects --filter=blob:limit=200 \
> >         --object-names --all | sed -r '/^.{,41}$/d' | wc -l
> >     4502300
> > 
> >     real 1m23.872s
> >     user 1m30.076s
> >     sys  0m6.002s
> > 
> >     # New query.
> >     $ time git rev-list --objects --filter-provided \
> >         --filter=object:type=blob --filter=blob:limit=200 \
> >         --use-bitmap-index --all | wc -l
> >     22585
> > 
> >     real 0m19.216s
> >     user 0m16.768s
> >     sys  0m2.450s
> 
> Those produce very different answers. I guess because in the first one,
> you still have a bunch of tree objects, too. You'd do much better to get
> the actual types from cat-file, and filter on that. That also lets you
> use bitmaps for the traversal portion. E.g.:

They do provide different answers, and you're right that `--batch-check`
would have helped to filter by type. Your idea doesn't really work in my
usecase though to identify LFS pointers, at least not without additional
tooling on top of what you've provided. There'd at least need to be two
git-cat-file(1) processes: one to do the `--batch-check` thing to
actually filter by object type, and one to then read the actual LFS
pointer candidates from disk in order to see whether they are LFS
pointers or not.

Actually, we currently are doing something similar to that at GitLab: we
list all potential candidates via git-rev-list(1), write the output into
`git-cat-file --batch-check`, and anything that is a blob then gets
forwarded into `git-cat-file --batch`.

>   $ time git rev-list --use-bitmap-index --objects --filter=blob:limit=200 --all |
>          git cat-file --buffer --batch-check='%(objecttype) %(objectname)' |
> 	 perl -lne 'print $1 if /^blob (.*)/' | wc -l
>   14966
>   
>   real	0m6.248s
>   user	0m7.810s
>   sys	0m0.440s
> 
> which is faster than what you showed above (this is on linux.git, but my
> result is different; maybe you have more refs than me?). But we should
> be able to do better purely internally, so I suspect my computer is just
> faster (or maybe your extra refs just aren't well-covered by bitmaps).
> Running with your patches I get:

I've got quite a beefy machine with a Ryzen 3 5800X, and I did do a `git
repack -Adfb` right before doig benchmarks. I do have the stable kernel
repository added though, which accounts for quite a lot of additional
references (3938) and objects (9.3M).

>   $ time git rev-list --objects --use-bitmap-index --all \
>              --filter-provided --filter=object:type=blob \
> 	     --filter=blob:limit=200 | wc -l
>   16339
> 
>   real	0m1.309s
>   user	0m1.234s
>   sys	0m0.079s
> 
> which is indeed faster. It's quite curious that the answer is not the
> same, though! I think yours has some bugs. If I sort and diff the
> results, I see some commits mentioned in the output. Perhaps this is
> --filter-provided not working, as they all seem to be ref tips.

I noticed it, too, and couldn't yet find an answer why that is.
Honestly, I found the NOT_USER_GIVEN flag quite confusing and I'm not at
all sure whether I've got all cases covered correctly. The previous was
how this was handled (`USER_GIVEN` instead of `NOT_USER_GIVEN`) would've
been easier to figure out for this specific usecase. But I guess it was
converted due to specific reasons.

I'll invest some more time to figure out what's happening here.

> > To be able to more efficiently answer this query, I've implemented
> > multiple things:
> > 
> > - A new object type filter `--filter=object:type=<type>` for
> >   git-rev-list(1), which is implemented both for normal graph walks and
> >   for the packfile bitmap index.
> > 
> > - Given that above usecase requires two filters (the object type
> >   and blob size filters), bitmap filters were extended to support
> >   combined filters.
> 
> That's probably reasonable, especially because it lets us use bitmaps. I
> do have a dream that we'll eventually be able to support more extensive
> formatting via log/rev-list, which would allow:
> 
>   git rev-list --use-bitmap-index --objects --all \
>                --format=%(objecttype) %(objectname) |
>   perl -ne 'print $1 if /^blob (.*)/'
> 
> That should be faster than the separate cat-file (which has to re-lookup
> each object, in addition to the extra pipe overhead), but I expect the
> --filter solution should always be faster still, as it can very quickly
> eliminate the majority of the objects at the bitmap level.

That'd be nice, even though it wouldn't help in my particular usecase: I
need to read each candidate blob to see whether it's an LFS pointer or
not anyway.

> > - git-rev-list(1) doesn't filter user-provided objects and always prints
> >   them. I don't want the listed commits though and only their referenced
> >   potential LFS blobs. So I've added a new flag `--filter-provided`
> >   which marks all provided objects as not-user-provided such that they
> >   get filtered the same as all the other objects.
> 
> Yeah, this "user-provided" behavior was quite a surprise to me when I
> started implementing the bitmap versions of the existing filters. It's
> nice to have the option to specify which you want.
> 
> -Peff

Patrick

Patrick Steinhardt March 11, 2021, 2:43 p.m. UTC | #5

On Wed, Mar 10, 2021 at 05:19:44PM -0500, Jeff King wrote:
> On Wed, Mar 10, 2021 at 04:58:16PM -0500, Taylor Blau wrote:
> 
> > On Mon, Mar 01, 2021 at 01:20:26PM +0100, Patrick Steinhardt wrote:
> > > - A new object type filter `--filter=object:type=<type>` for
> > >   git-rev-list(1), which is implemented both for normal graph walks and
> > >   for the packfile bitmap index.
> > 
> > I understand what you're looking for here, but I worry that '--filter'
> > might be too leaky of an abstraction.
> > 
> > I was a little surprised to learn that you can clone a repository with
> > --filter=object:type=tree (excluding commits), but it does work. I'm
> > fine reusing a lot of the object filtering code if it makes this an
> > easier task, but I think it may be worthwhile to hide this new kind of
> > filter from upload-pack.
> 
> I had a similar thought, but wouldn't the existing uploadpackfilter
> config take care of this?
> 
> I guess the catch-all "allow" option defaults to "true", so we'd support
> any new filters that are added. Which seems like a poor choice in
> general, but flipping it would mean that servers have to update their
> config.
> 
> I do wonder if it's that bad for clients to be able to specify something
> like this, though. Even though there's not that much use for it with a
> regular partial clone, it could conceivably used for some special cases.
> I do think it would be more useful if you could OR together multiple
> types. Asking for "commits|tags|trees" is really the same as the already
> useful "blob:none". And "commits|tags" is the same as tree:depth=0.

I did waste a few thoughts on how this should be handled. I see two ways
of doing it:

    - We could just implement the new `object:type` filter such that it
      directly supports OR'ing. That's the easy way to do it, but it's
      inflexible.

    - We could extend combined filters to support OR-semantics in
      addition to the current AND-semantics. In the end, that'd be a
      much more flexible approach and potentially allow additional
      usecases.

I lean more towards the latter as it feels like the better design. But
it's more involved, and I'm not sure I want to do it as part of this
patch series.

Patrick

Jeff King March 11, 2021, 5:54 p.m. UTC | #6

On Thu, Mar 11, 2021 at 03:38:11PM +0100, Patrick Steinhardt wrote:

> > Those produce very different answers. I guess because in the first one,
> > you still have a bunch of tree objects, too. You'd do much better to get
> > the actual types from cat-file, and filter on that. That also lets you
> > use bitmaps for the traversal portion. E.g.:
> 
> They do provide different answers, and you're right that `--batch-check`
> would have helped to filter by type. Your idea doesn't really work in my
> usecase though to identify LFS pointers, at least not without additional
> tooling on top of what you've provided. There'd at least need to be two
> git-cat-file(1) processes: one to do the `--batch-check` thing to
> actually filter by object type, and one to then read the actual LFS
> pointer candidates from disk in order to see whether they are LFS
> pointers or not.
> 
> Actually, we currently are doing something similar to that at GitLab: we
> list all potential candidates via git-rev-list(1), write the output into
> `git-cat-file --batch-check`, and anything that is a blob then gets
> forwarded into `git-cat-file --batch`.

You'd need that final cat-file with your patch, too, though. So I think
it makes sense to think about "generate the list of blobs" as the
primary action.

You can of course do the type and content dump as a single cat-file, but
in my experience that is much slower (because we waste time dumping
object content that the caller ultimately won't care about).

Thinking in the opposite direction, if we are filtering by type via
cat-file, we could do the size filter there, too. So:

  git rev-list --use-bitmap-index --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectsize) %(objectname)' |
  perl -lne 'print $2 if /^blob (\d+) (.*)/ && $1 < 200'

which produces the same answer as my earlier:

> >   $ time git rev-list --use-bitmap-index --objects --filter=blob:limit=200 --all |
> >          git cat-file --buffer --batch-check='%(objecttype) %(objectname)' |
> > 	 perl -lne 'print $1 if /^blob (.*)/' | wc -l

but takes about twice as long. Which is really just a roundabout way of
saying that yes, shoving things into "rev-list" can provide substantial
speedups. :)

> > which is faster than what you showed above (this is on linux.git, but my
> > result is different; maybe you have more refs than me?). But we should
> > be able to do better purely internally, so I suspect my computer is just
> > faster (or maybe your extra refs just aren't well-covered by bitmaps).
> > Running with your patches I get:
> 
> I've got quite a beefy machine with a Ryzen 3 5800X, and I did do a `git
> repack -Adfb` right before doig benchmarks. I do have the stable kernel
> repository added though, which accounts for quite a lot of additional
> references (3938) and objects (9.3M).

Yeah, I wondered if it was something like that. Mine is just
torvalds/linux.git. Fetching stable/linux.git from kernel.org, running
"git repack -adb" on the result, and then repeating my timings gets me
numbers close to yours.

> > which is indeed faster. It's quite curious that the answer is not the
> > same, though! I think yours has some bugs. If I sort and diff the
> > results, I see some commits mentioned in the output. Perhaps this is
> > --filter-provided not working, as they all seem to be ref tips.
> 
> I noticed it, too, and couldn't yet find an answer why that is.
> Honestly, I found the NOT_USER_GIVEN flag quite confusing and I'm not at
> all sure whether I've got all cases covered correctly. The previous was
> how this was handled (`USER_GIVEN` instead of `NOT_USER_GIVEN`) would've
> been easier to figure out for this specific usecase. But I guess it was
> converted due to specific reasons.
> 
> I'll invest some more time to figure out what's happening here.

Thanks. I also scratched my head at NOT_USER_GIVEN. I haven't looked at
this part of the filter code very much, but it seems like that is a
recipe for accidentally marking a commit as NOT_USER_GIVEN if we
traverse to it (even if it was originally _also_ given by the user).

-Peff

> > That's probably reasonable, especially because it lets us use bitmaps. I
> > do have a dream that we'll eventually be able to support more extensive
> > formatting via log/rev-list, which would allow:
> > 
> >   git rev-list --use-bitmap-index --objects --all \
> >                --format=%(objecttype) %(objectname) |
> >   perl -ne 'print $1 if /^blob (.*)/'
> > 
> > That should be faster than the separate cat-file (which has to re-lookup
> > each object, in addition to the extra pipe overhead), but I expect the
> > --filter solution should always be faster still, as it can very quickly
> > eliminate the majority of the objects at the bitmap level.
> 
> That'd be nice, even though it wouldn't help in my particular usecase: I
> need to read each candidate blob to see whether it's an LFS pointer or
> not anyway.

I think it works out roughly the same as the --filter solution, in the
sense that both generate a list of candidate blobs that you'd read with
"cat-file --batch" (but of course it's still slower).

-Peff

Jeff King March 11, 2021, 5:56 p.m. UTC | #7

On Thu, Mar 11, 2021 at 03:43:39PM +0100, Patrick Steinhardt wrote:

> > I do wonder if it's that bad for clients to be able to specify something
> > like this, though. Even though there's not that much use for it with a
> > regular partial clone, it could conceivably used for some special cases.
> > I do think it would be more useful if you could OR together multiple
> > types. Asking for "commits|tags|trees" is really the same as the already
> > useful "blob:none". And "commits|tags" is the same as tree:depth=0.
> 
> I did waste a few thoughts on how this should be handled. I see two ways
> of doing it:
> 
>     - We could just implement the new `object:type` filter such that it
>       directly supports OR'ing. That's the easy way to do it, but it's
>       inflexible.
> 
>     - We could extend combined filters to support OR-semantics in
>       addition to the current AND-semantics. In the end, that'd be a
>       much more flexible approach and potentially allow additional
>       usecases.
> 
> I lean more towards the latter as it feels like the better design. But
> it's more involved, and I'm not sure I want to do it as part of this
> patch series.

Yeah, I don't think that needs to be part of this series. The only thing
to consider for this series is whether it's a problem for clients to be
able to ask for type=blob from a server which has blindly turned on
uploadpack.allowFilter without restricting the types.

My gut is to say yes. Even if we don't have a particular use, I don't
think it hurts (and in general, I think people running public servers
with bitmaps really ought to set uploadpackfilter.allow=false anyway,
because stuff like non-zero tree-depth filters are expensive).

-Peff

Patrick Steinhardt March 15, 2021, 11:25 a.m. UTC | #8

On Wed, Mar 10, 2021 at 04:39:22PM -0500, Jeff King wrote:
> On Mon, Mar 01, 2021 at 01:20:26PM +0100, Patrick Steinhardt wrote:
> 
> > Altogether, this ends up with the following queries, both of which have
> > been executed in a well-packed linux.git repository:
> > 
> >     # Previous query which uses object names as a heuristic to filter
> >     # non-blob objects, which bars us from using bitmap indices because
> >     # they cannot print paths.
> >     $ time git rev-list --objects --filter=blob:limit=200 \
> >         --object-names --all | sed -r '/^.{,41}$/d' | wc -l
> >     4502300
> > 
> >     real 1m23.872s
> >     user 1m30.076s
> >     sys  0m6.002s
> > 
> >     # New query.
> >     $ time git rev-list --objects --filter-provided \
> >         --filter=object:type=blob --filter=blob:limit=200 \
> >         --use-bitmap-index --all | wc -l
> >     22585
> > 
> >     real 0m19.216s
> >     user 0m16.768s
> >     sys  0m2.450s
> 
> Those produce very different answers. I guess because in the first one,
> you still have a bunch of tree objects, too. You'd do much better to get
> the actual types from cat-file, and filter on that. That also lets you
> use bitmaps for the traversal portion. E.g.:
> 
>   $ time git rev-list --use-bitmap-index --objects --filter=blob:limit=200 --all |
>          git cat-file --buffer --batch-check='%(objecttype) %(objectname)' |
> 	 perl -lne 'print $1 if /^blob (.*)/' | wc -l
>   14966
>   
>   real	0m6.248s
>   user	0m7.810s
>   sys	0m0.440s
> 
> which is faster than what you showed above (this is on linux.git, but my
> result is different; maybe you have more refs than me?). But we should
> be able to do better purely internally, so I suspect my computer is just
> faster (or maybe your extra refs just aren't well-covered by bitmaps).
> Running with your patches I get:
> 
>   $ time git rev-list --objects --use-bitmap-index --all \
>              --filter-provided --filter=object:type=blob \
> 	     --filter=blob:limit=200 | wc -l
>   16339
> 
>   real	0m1.309s
>   user	0m1.234s
>   sys	0m0.079s
> 
> which is indeed faster. It's quite curious that the answer is not the
> same, though! I think yours has some bugs. If I sort and diff the
> results, I see some commits mentioned in the output. Perhaps this is
> --filter-provided not working, as they all seem to be ref tips.
[snip]

I've found the issue: when converting filters to a combined filter via
`transform_to_combine_type()`, we reset the top-level filter via a call
to `memset()`. So for combined filters, the option wouldn't have taken
any effect because it got reset iff the `--filter-provided` option comes
before the second filter.

Patrick

[0/7] rev-parse: implement object type filter

Message

Comments