mbox series

[0/5,RFC] introduce Roaring bitmaps to Git

Message ID pull.1357.git.1663609659.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series introduce Roaring bitmaps to Git | expand

Message

Jean-Noël Avila via GitGitGadget Sept. 19, 2022, 5:47 p.m. UTC
Git currently uses ewah bitmaps ( which are based on run-length encoding) to
compress bitmaps. Ewah bitmaps stores bitmaps in the form of run-length
words i.e. instead of storing each and every bit, it tries to find
consecutive bits (having same value) and replace them with the value bit and
the range upto which the bit is present. It is simple and efficient. But one
downside of this approach is that we have to decompress the whole bitmap in
order to find the bit of a certain position.

For small (or medium sized) bitmaps, this is not an issue. But it can be an
issue for large (or extra large) bitmaps. In that case roaring bitmaps are
generally more efficient[1] than ewah itself. Some benchmarks suggests that
roaring bitmaps give more performance benefits than ewah or any other
similar compression technique.

This patch series is currently in RFC state and it aims to let Git use
roaring bitmaps. As this is an RFC patch series (for now), the code are not
fully accurate (i.e. some tests are failing). But it is backward-compatible
(tests related to ewah bitmaps are passing). Some commit messages might need
more explanation and some commits may need a split (specially the one that
implement writing roaring bitmaps). Overall, the structure and code are near
to ready to make the series a formal patch series.

I am submitting it as an RFC (after discussions with mentors) because the
GSoC coding period is about to end. I will continue to work on the patch
series.

Abhradeep Chakraborty (5):
  reachability-bitmaps: add CRoaring library to Git
  roaring.[ch]: apply Git specific changes to the roaring API
  roaring: teach Git to write roaring bitmaps
  roaring: introduce a new config option for roaring bitmaps
  roaring: teach Git to read roaring bitmaps

 Makefile                   |     3 +
 bitmap.c                   |   225 +
 bitmap.h                   |    33 +
 builtin/diff.c             |    10 +-
 builtin/multi-pack-index.c |     5 +
 builtin/pack-objects.c     |    81 +-
 ewah/bitmap.c              |    61 +-
 ewah/ewok.h                |    37 +-
 midx.c                     |     7 +
 midx.h                     |     1 +
 pack-bitmap-write.c        |   326 +-
 pack-bitmap.c              |   969 +-
 pack-bitmap.h              |    27 +-
 roaring/roaring.c          | 20047 +++++++++++++++++++++++++++++++++++
 roaring/roaring.h          |  1028 ++
 t/t5310-pack-bitmaps.sh    |    79 +-
 16 files changed, 22490 insertions(+), 449 deletions(-)
 create mode 100644 bitmap.c
 create mode 100644 bitmap.h
 create mode 100644 roaring/roaring.c
 create mode 100644 roaring/roaring.h


base-commit: d3fa443f97e3a8d75b51341e2d5bac380b7422df
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1357%2FAbhra303%2Froaring-bitmap-exp-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1357/Abhra303/roaring-bitmap-exp-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1357

Comments

Derrick Stolee Sept. 19, 2022, 6:18 p.m. UTC | #1
On 9/19/2022 1:47 PM, Abhradeep Chakraborty via GitGitGadget wrote:
> Git currently uses ewah bitmaps ( which are based on run-length encoding) to
> compress bitmaps. Ewah bitmaps stores bitmaps in the form of run-length
> words i.e. instead of storing each and every bit, it tries to find
> consecutive bits (having same value) and replace them with the value bit and
> the range upto which the bit is present. It is simple and efficient. But one
> downside of this approach is that we have to decompress the whole bitmap in
> order to find the bit of a certain position.
> 
> For small (or medium sized) bitmaps, this is not an issue. But it can be an
> issue for large (or extra large) bitmaps. In that case roaring bitmaps are
> generally more efficient[1] than ewah itself. Some benchmarks suggests that
> roaring bitmaps give more performance benefits than ewah or any other
> similar compression technique.
> 
> This patch series is currently in RFC state and it aims to let Git use
> roaring bitmaps. As this is an RFC patch series (for now), the code are not
> fully accurate (i.e. some tests are failing). But it is backward-compatible
> (tests related to ewah bitmaps are passing). Some commit messages might need
> more explanation and some commits may need a split (specially the one that
> implement writing roaring bitmaps). Overall, the structure and code are near
> to ready to make the series a formal patch series.
> 
> I am submitting it as an RFC (after discussions with mentors) because the
> GSoC coding period is about to end. I will continue to work on the patch
> series.

I look forward to your next version. I hope to see some information about
the performance characteristics across the two versions. Specifically:

1. How do various test in t/perf/ change between the two formats?
2. For certain test repos (git/git, torvalds/linux, etc.) how much does
   the .bitmap file change in size across the formats?
 
>  Makefile                   |     3 +
>  bitmap.c                   |   225 +
>  bitmap.h                   |    33 +
...
>  ewah/bitmap.c              |    61 +-
>  ewah/ewok.h                |    37 +-
...
>  roaring/roaring.c          | 20047 +++++++++++++++++++++++++++++++++++
>  roaring/roaring.h          |  1028 ++

I wonder if there is value in modifying the structure of these files
into a bitmap/ directory and then perhaps ewah/ and roaring/ within
each? Just a thought.

Thanks,
-Stolee
Abhradeep Chakraborty Sept. 20, 2022, 2:05 p.m. UTC | #2
Hi Derrick,

On Mon, Sep 19, 2022 at 11:48 PM Derrick Stolee
<derrickstolee@github.com> wrote:
> I look forward to your next version. I hope to see some information about
> the performance characteristics across the two versions. Specifically:
>
> 1. How do various test in t/perf/ change between the two formats?
> 2. For certain test repos (git/git, torvalds/linux, etc.) how much does
>    the .bitmap file change in size across the formats?

Yeah, sure. I will be including the performance test result in the
next version :)

> >  Makefile                   |     3 +
> >  bitmap.c                   |   225 +
> >  bitmap.h                   |    33 +
> ...
> >  ewah/bitmap.c              |    61 +-
> >  ewah/ewok.h                |    37 +-
> ...
> >  roaring/roaring.c          | 20047 +++++++++++++++++++++++++++++++++++
> >  roaring/roaring.h          |  1028 ++
>
> I wonder if there is value in modifying the structure of these files
> into a bitmap/ directory and then perhaps ewah/ and roaring/ within
> each? Just a thought.

Great idea! Thanks! Will change it in the next version..

Thanks :)
Taylor Blau Sept. 20, 2022, 9:59 p.m. UTC | #3
On Mon, Sep 19, 2022 at 05:47:34PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> This patch series is currently in RFC state and it aims to let Git use
> roaring bitmaps. As this is an RFC patch series (for now), the code are not
> fully accurate (i.e. some tests are failing). But it is backward-compatible
> (tests related to ewah bitmaps are passing). Some commit messages might need
> more explanation and some commits may need a split (specially the one that
> implement writing roaring bitmaps). Overall, the structure and code are near
> to ready to make the series a formal patch series.

Extremely exciting. Congratulations on all of your work so far. I'm
hopeful that you'll continue working on this after GSoC is over (for
those playing along at home, Abhradeep's coding period was extended by a
couple of weeks).

But even if you don't, this is a great artifact to leave around on the
list for somebody else who is interested in this area to pick up in the
future, and benefit from all of the work that you've done so far.

I am still working through my post-Git Merge backlog, but I'm looking
forward to reading these patches soon. I'm glad that other reviewers
have already started to dive in :-).

Well done!


Thanks,
Taylor
Abhradeep Chakraborty Sept. 21, 2022, 3:27 p.m. UTC | #4
On Wed, Sep 21, 2022 at 3:29 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Sep 19, 2022 at 05:47:34PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> > This patch series is currently in RFC state and it aims to let Git use
> > roaring bitmaps. As this is an RFC patch series (for now), the code are not
> > fully accurate (i.e. some tests are failing). But it is backward-compatible
> > (tests related to ewah bitmaps are passing). Some commit messages might need
> > more explanation and some commits may need a split (specially the one that
> > implement writing roaring bitmaps). Overall, the structure and code are near
> > to ready to make the series a formal patch series.
>
> Extremely exciting. Congratulations on all of your work so far. I'm
> hopeful that you'll continue working on this after GSoC is over (for
> those playing along at home, Abhradeep's coding period was extended by a
> couple of weeks).

Yeah, I will continue (or better to say I am continuing) my work. I
hope that I can submit the next version in the upcoming few days.

Thanks for supporting and guiding me throughout the GSoC period. I
have learned a lot of new things during this period.

> I am still working through my post-Git Merge backlog, but I'm looking
> forward to reading these patches soon. I'm glad that other reviewers
> have already started to dive in :-).

No problem, I am doing some improvements by this time.
By the way, I am very excited to see the Youtube Git-Merge recordings ;)

Thanks :)