mbox series

[0/6] Speed up mirror-fetches with many refs

Message ID cover.1629452412.git.ps@pks.im (mailing list archive)
Headers show
Series Speed up mirror-fetches with many refs | expand

Message

Patrick Steinhardt Aug. 20, 2021, 10:08 a.m. UTC
Hi,

I've taken another look at fetches in the context of repos with a huge
amount of refs. This time around, I've taken a look at mirror fetches:
in our notorious repo with 2.3M refs, these mirror fetches can take up
to several minutes of time even if there are no changes at all.

As it turns out, many of the issues are again caused by loading and
dereferencing refs. This patch series thus mostly focusses on optimizing
the patterns there, where the biggest win is to opportunistically load
refs via commit-graphs. The following numbers were all calculated for a
mirror-fetch of above 2.3M refs repo on the local disk:

    - Patch 1 speeds up the way we look up commits when appending to
      FETCH_HEAD via the commit-graph, resulting in a ~40% speedup.

    - Patch 2 optimizes the way we check for object existence for a 7%
      speedup.

    - Patch 3 is a cleanup patch which changes the iterator functions
      passed to our connectivity checks. I was hoping for a speedup
      given that we can now avoid copying objects (which could have an
      effect with 2.3M copied OIDs), but unfortunately it didn't. In any
      case, I still think that the end result is much cleaner.

    - Patch 4 optimizes git-fetch-pack(1) to use the commit-graph. This
      is a small win of about ~2%. It's debatable whether this patch is
      worth it.

    - Patch 5 is a preparatory commit which refactors `fetch_refs()` to
      be more readily extendable.

    - Patch 6 optimizes an edge case where we're doing two connectivity
      checks even if the first connectivity check noticed we already had
      all objects locally available, skipping the fetch. This brings a
      15% speedup.

In combination with my previous optimizations for git-fetch-pack(1) and
the connectivity check, this improves performance from 71s
(ps/fetch-pack-load-refs-optim), to 54s (ps/connectivity-optim) to 26s
(this series).

Note that this series depends on ps/connectivity-optim and thus only
applies on top of next.

Patrick

[1]: <08519b8ab6f395cffbcd5e530bfba6aaf64241a2.1628085347.git.ps@pks.im>


Patrick Steinhardt (6):
  fetch: speed up lookup of want refs via commit-graph
  fetch: avoid unpacking headers in object existence check
  connected: refactor iterator to return next object ID directly
  fetch-pack: optimize loading of refs via commit graph
  fetch: refactor fetch refs to be more extendable
  fetch: avoid second connectivity check if we already have all objects

 builtin/clone.c        |  8 ++--
 builtin/fetch.c        | 84 +++++++++++++++++++++++-------------------
 builtin/receive-pack.c | 17 ++++-----
 connected.c            | 15 ++++----
 connected.h            |  2 +-
 fetch-pack.c           | 14 ++++---
 6 files changed, 74 insertions(+), 66 deletions(-)

Comments

Derrick Stolee Aug. 20, 2021, 2:50 p.m. UTC | #1
On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
...
> As it turns out, many of the issues are again caused by loading and
> dereferencing refs. This patch series thus mostly focusses on optimizing
> the patterns there, where the biggest win is to opportunistically load
> refs via commit-graphs.

You caught my attention at "commit-graph" and I found your use of them
to be interesting. You strike a balance in checking the commit-graph
when it is likely to be helpful, and skip the commit-graph when it is
not. (For example, PATCH 2 is unlikely to benefit from checking the
commit-graph at that point, because we are looking for objects that
were just downloaded.)

I read all the patches and checked the full context of the functions
to see if there were any issues, but found none. My only comments are
about the case of many annotated tags (do we slow down?) and some
nitpicks.

Thanks,
-Stolee
Junio C Hamano Aug. 21, 2021, 12:09 a.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> I've taken another look at fetches in the context of repos with a huge
> amount of refs. This time around, I've taken a look at mirror fetches:
> in our notorious repo with 2.3M refs, these mirror fetches can take up
> to several minutes of time even if there are no changes at all.

I notice that 4/6 (and no other patch) is not signed-off and wonder
if there is a reason (e.g. you are not so comfortable with the idea
behind the step or the implementation) or a simple oversight?

> Note that this series depends on ps/connectivity-optim and thus only
> applies on top of next.

It seems that the dependency of this series is not just 'master'
plus 'ps/connectivity-optim' but some more stuff from 'next'.  I
expact that topics that have been cooking in 'next' during the
previous cycle will graduate to 'master' early next week, so perhaps
it is easier to handle this kind of "depends on some stuff in
'next'" topics after that happens.

Thanks.