fast-import's hash table is slow

[breaking thread, since this is really an independent topic]

On Mon, Mar 30, 2020 at 10:09:30AM -0400, Jeff King wrote:

> So I arrived at this fast-import solution, which was...not super fast.
> Profiling showed that we were spending 80% of the time inserting into
> our custom hashtable, which is fixed at 2^16 entries and then chains
> beyond that. Swapping it out for a khash proved much faster, but I'm not
> sure if the memory games are too gross (see the comment in find_object
> below).
> 
> I also didn't look into whether we could get rid of the extra allocating
> pool (and store the structs right in the hash), or if it's necessary for
> their pointers to be stable.

I briefly tried to get rid of the pool. I _think_ it should be possible,
but I did see some test failures. It's entirely possible I screwed it
up. However, I did generate a few interesting measurements showing how
the current hash table behaves on this test:

  git init repo
  cd repo
  perl -e '
      my $bits = shift;
      my $nr = 2**$bits;

      for (my $i = 0; $i < $nr; $i++) {
              print "blob\n";
              print "data 4\n";
              print pack("N", $i);
      }
  ' "$@" | git fast-import

Here are wall-clock timings for the current tip of master, versus with
the patch below applied:

nr_objects   master       patch
2^20         0m04.317s    0m5.109s
2^21         0m10.204s    0m9.702s
2^22         0m27.159s    0m17.911s
2^23         1m19.038s    0m35.080s
2^24         4m18.766s    1m10.233s

The curve on master is quadratic-ish (each line has double the number of
objects of the previous one; the times don't multiply by 4, but that's
because the hash table is only part of the work we're doing). With my
patch, it's pretty linear.

But I'm still disappointed that the smallest case is actually _slower_
with the patch. The existing hash table is so simple I can imagine using
khash has a little overhead. But I'm surprised it would be so much (or
that the existing hash table does OK at 2^20; it only has 2^16 buckets).

Maybe this email will nerd-snipe René into poking at it.

The patch I tested is below (it's slightly different than what I showed
before, in that it handles duplicate insertions). Maybe using hashmap.c
would be better?

---

Message ID	20200331094553.GB7274@coredump.intra.peff.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=DhGT=5Q=vger.kernel.org=git-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1498B81 for <patchwork-git@patchwork.kernel.org>; Tue, 31 Mar 2020 09:45:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E8758207FF for <patchwork-git@patchwork.kernel.org>; Tue, 31 Mar 2020 09:45:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730333AbgCaJpy (ORCPT <rfc822;patchwork-git@patchwork.kernel.org>); Tue, 31 Mar 2020 05:45:54 -0400 Received: from cloud.peff.net ([104.130.231.41]:56670 "HELO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1729997AbgCaJpy (ORCPT <rfc822;git@vger.kernel.org>); Tue, 31 Mar 2020 05:45:54 -0400 Received: (qmail 23604 invoked by uid 109); 31 Mar 2020 09:45:54 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with SMTP; Tue, 31 Mar 2020 09:45:54 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 28199 invoked by uid 111); 31 Mar 2020 09:55:59 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Tue, 31 Mar 2020 05:55:59 -0400 Authentication-Results: peff.net; auth=none Date: Tue, 31 Mar 2020 05:45:53 -0400 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Subject: fast-import's hash table is slow Message-ID: <20200331094553.GB7274@coredump.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: <git.vger.kernel.org> X-Mailing-List: git@vger.kernel.org
Series	fast-import's hash table is slow \| expand fast-import's hash table is slow

fast-import's hash table is slow

Commit Message

Comments

Patch