mbox series

[0/3] switch to tombstone-free khashl table

Message ID 20240325230704.262272-1-e@80x24.org (mailing list archive)
Headers show
Series switch to tombstone-free khashl table | expand

Message

Eric Wong March 25, 2024, 11:07 p.m. UTC
The memory improvement is minor, but any memory reduction at all
is welcome at this point.  Fortunately, this set of changes is
unintrusive.

I have some other ideas that I'll hopefully get to implement before
swapping kills all my SSDs (see bottom).

Eric Wong (3):
  list-objects-filter: use kh_size API
  treewide: switch to khashl for memory savings
  khashl: fix ensemble lookups on empty table

 builtin/fast-import.c       |   2 +-
 builtin/fsmonitor--daemon.c |   4 +-
 delta-islands.c             |   4 +-
 khash.h                     | 338 -----------------------
 khashl.h                    | 522 ++++++++++++++++++++++++++++++++++++
 list-objects-filter.c       |   2 +-
 object-store-ll.h           |   2 +-
 object-store.h              |   7 +-
 oidset.h                    |   2 +-
 pack-bitmap.h               |   2 +-
 10 files changed, 535 insertions(+), 350 deletions(-)
 delete mode 100644 khash.h
 create mode 100644 khashl.h

TODO (some ideas stolen from khashl):

* obj_hash can probably use a 0.75 load factor (instead of 0.5)
  to save memory and not slow down too much.  oid_* khash has
  always had 0.77 which was fine and now khashl has 0.75.
  0.75 is cheaper to compute via (shifts + ORs) than 0.77.

* obj_hash can use `i = (i + 1) & mask' like khashl does to
  avoid branches in linear probe loops

* obj_hash can use realloc and copy in-place resize logic khashl does.
  khashl uses the ->used bitmap for this, but it should be
  possible to do for obj_hash w/o additional allocations
  via pointer tagging

(not khashl inspired):

* keep an identity pool of OIDs separately (similar to how Perl5
  pools all hash keys) and only use tagged pointers for OIDs.
  Pointer tagging can be used to distinguish between 7 hash
  functions, leaving us room for 5 more in addition to SHA-(1|256).
  This change will be a large effort (with a hopefully large savings).
  If we ever need more than 7 hash functions, we can switch to
  storing hash type information in slabs that can be looked up
  using address ranges (AFAIK, jemalloc does this).

Comments

Junio C Hamano March 26, 2024, 5:40 p.m. UTC | #1
Eric Wong <e@80x24.org> writes:

> The memory improvement is minor, but any memory reduction at all
> is welcome at this point.  Fortunately, this set of changes is
> unintrusive.
>
> I have some other ideas that I'll hopefully get to implement before
> swapping kills all my SSDs (see bottom).

Please describe what this topic aims at to sell the topic better.
Are we trying to reduce memory footprint?  In other words, if this
topic were to hit a released version of Git, what would the short
paragraph description for the topic in the release notes look like?

 * The khash.h hashtable implementation has been replaced with
   khashl.h that is mostly API compatible with reduced memory
   consumption, simpler insertion and a bit slower deletion.

or somesuch.

A performance oriented topic would be helped to have benchmark
numbers to show how much improvement it makes and a memory reduction
topic would be helped to have some numbers in the cover letter.  It
is OK to summarize/duplicate what appears in the proposed log
message of some step; it does not need too much text to say 100MB
total allocations reduced by 10MB or something like that, for
example.

An API improvement topic would be helped to have an example rewrite
of a caller (or just a reference to a representative one, i.e., "see
how the caller in function X gets simplified in [PATCH 04/28]") in
the cover letter.

A bugfix topic would be helped to have an end-user visible effect in
the cover letter.

Thanks.
Junio C Hamano April 19, 2024, 9:31 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> Eric Wong <e@80x24.org> writes:
>
>> The memory improvement is minor, but any memory reduction at all
>> is welcome at this point.  Fortunately, this set of changes is
>> unintrusive.
>>
>> I have some other ideas that I'll hopefully get to implement before
>> swapping kills all my SSDs (see bottom).
>
> Please describe what this topic aims at to sell the topic better.
> Are we trying to reduce memory footprint?  In other words, if this
> topic were to hit a released version of Git, what would the short
> paragraph description for the topic in the release notes look like?
> ...

So, did anything happened since this exchange?  I remember that we
caught and fixed a few minor sparse errors, but other than that, I
am not sure what to do with this topic.

Not that I want to merge loud tree-wide topics down during the
prerelease period...

Thanks.
Eric Wong April 19, 2024, 9:46 p.m. UTC | #3
Junio C Hamano <gitster@pobox.com> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> > Eric Wong <e@80x24.org> writes:
> >
> >> The memory improvement is minor, but any memory reduction at all
> >> is welcome at this point.  Fortunately, this set of changes is
> >> unintrusive.
> >>
> >> I have some other ideas that I'll hopefully get to implement before
> >> swapping kills all my SSDs (see bottom).
> >
> > Please describe what this topic aims at to sell the topic better.
> > Are we trying to reduce memory footprint?  In other words, if this
> > topic were to hit a released version of Git, what would the short
> > paragraph description for the topic in the release notes look like?
> > ...
> 
> So, did anything happened since this exchange?  I remember that we
> caught and fixed a few minor sparse errors, but other than that, I
> am not sure what to do with this topic.

Not really...  I've been thinking we can beat khashl for our
purposes usage by allowing explicit tombstones to be configured
and getting rid of the ->used bitmap entirely.

One goal is to eventually reuse the code with the open-coded
hash table in object.c   That said, I've been bogged down by
personal crap this year and don't know how/if I'll be able to
hack on it this year.

> Not that I want to merge loud tree-wide topics down during the
> prerelease period...

No worries, we can keep it out for now and work on it
incrementally or drop it entirely in favor of something else
down the line...


git's about to turn 20, and I really want to ensure it and the
tools around it continue to be usable on computers from 20-25
years ago.