[00/17] Refactor chunk-format into an API

Message ID	pull.848.git.1611676886.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.848.git.1611676886.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Tue, 26 Jan 2021 16:01:09 +0000 Subject: [PATCH 00/17] Refactor chunk-format into an API Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, gitster@pobox.com, l.s.r@web.de, szeder.dev@gmail.com, Derrick Stolee <derrickstolee@github.com> Precedence: bulk
Series	Refactor chunk-format into an API \| expand [00/17] Refactor chunk-format into an API [01/17] commit-graph: anonymize data in chunk_write_fn [02/17] chunk-format: create chunk format write API [03/17] commit-graph: use chunk-format write API [04/17] midx: rename pack_info to write_midx_context [05/17] midx: use context in write_midx_pack_names() [06/17] midx: add entries to write_midx_context [07/17] midx: add pack_perm to write_midx_context [08/17] midx: add num_large_offsets to write_midx_context [09/17] midx: return success/failure in chunk write methods [10/17] midx: drop chunk progress during write [11/17] midx: use chunk-format API in write_midx_internal() [12/17] chunk-format: create read chunk API [13/17] commit-graph: use chunk-format read API [14/17] midx: use chunk-format read API [15/17] midx: use 64-bit multiplication for chunk sizes [16/17] chunk-format: restore duplicate chunk checks [17/17] chunk-format: add technical docs

Message ID

pull.848.git.1611676886.gitgitgadget@gmail.com (mailing list archive)

Headers

Message-Id: <pull.848.git.1611676886.gitgitgadget@gmail.com>
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Date: Tue, 26 Jan 2021 16:01:09 +0000
Subject: [PATCH 00/17] Refactor chunk-format into an API
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: me@ttaylorr.com, gitster@pobox.com, l.s.r@web.de,
        szeder.dev@gmail.com, Derrick Stolee <derrickstolee@github.com>
Precedence: bulk

Series

Refactor chunk-format into an API | expand

Message

Philippe Blain via GitGitGadget Jan. 26, 2021, 4:01 p.m. UTC

This is a restart on the topic previously submitted [1] but dropped because
ak/corrected-commit-date was still in progress. This version is based on
that branch.

[1]
https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction
with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
but I started again from scratch. I also go further to make struct chunkfile
anonymous to API consumers. It is defined only in chunk-format.c, which
should hopefully deter future users from interacting with that data
directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to
ensure that similar file formats have similar protections against bad data.
The multi-pack-index code did not have as many guards as the commit-graph
code did, but now they both share a common base that checks for things like
duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

 * 638 insertions(+), 456 deletions(-).
 * commit-graph.c: 171 insertions(+), 192 deletions(-)
 * midx.c: 196 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get a
bit smaller. Boilerplate things like abstracting method to match
chunk_write_fn and chunk_read_fn make up a lot of these insertions. The
"interesting" code gets a lot smaller and cleaner.

Thanks, -Stolee

Derrick Stolee (17):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: create chunk format write API
  commit-graph: use chunk-format write API
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: return success/failure in chunk write methods
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  chunk-format: create read chunk API
  commit-graph: use chunk-format read API
  midx: use chunk-format read API
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      |  54 +++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 165 +++++++
 chunk-format.h                                |  41 ++
 commit-graph.c                                | 363 +++++++-------
 midx.c                                        | 456 ++++++++----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 638 insertions(+), 456 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-848%2Fderrickstolee%2Fchunk-format%2Frefactor-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/848

Comments

Junio C Hamano Jan. 26, 2021, 10:37 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This is a restart on the topic previously submitted [1] but dropped because
> ak/corrected-commit-date was still in progress. This version is based on
> that branch.

Nice to see that we have an endorsement on ak/corrected-commit-date
topic ;-)

I've scanned this round of the topic and they were pleasant read.

I may have other comments after a more careful reading, but so far,
I am happy with what I see here.

Thanks.

Taylor Blau Jan. 27, 2021, 2:29 a.m. UTC | #2

On Tue, Jan 26, 2021 at 04:01:09PM +0000, Derrick Stolee via GitGitGadget wrote:
> This version also changes the approach to use a more dynamic interaction
> with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
> but I started again from scratch. I also go further to make struct chunkfile
> anonymous to API consumers. It is defined only in chunk-format.c, which
> should hopefully deter future users from interacting with that data
> directly.
>
> [2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

Great; I am very happy that you found my patch to be useful. I'm glad
that you decided to start from scratch, too, since as I recall there
were some unresolved test issues that I punted on in case you decided to
abandon the topic altogether.

> This combined API is beneficial to reduce duplicated logic. Or rather, to
> ensure that similar file formats have similar protections against bad data.
> The multi-pack-index code did not have as many guards as the commit-graph
> code did, but now they both share a common base that checks for things like
> duplicate chunks or offsets outside the size of the file.

Definitely good.

> Here are some stats for the end-to-end change:
>
>  * 638 insertions(+), 456 deletions(-).
>  * commit-graph.c: 171 insertions(+), 192 deletions(-)
>  * midx.c: 196 insertions(+), 260 deletions(-)
>
> While there is an overall increase to the code size, the consumers do get a
> bit smaller. Boilerplate things like abstracting method to match
> chunk_write_fn and chunk_read_fn make up a lot of these insertions. The
> "interesting" code gets a lot smaller and cleaner.

Like I said in [1], I don't think a net +182 line diff is reason alone
not to pursue this topic. I don't think that an chunked index v3 will
come as part of my work on the on-disk revindex format, but I do think
that it's something brian may be interested in. So, I'm feeling rather
certain that we'll eventually have new callers, at which point this will
reduce duplication overall.

[1]: https://lore.kernel.org/git/X8%2FK1dUgUmwp8ZOv@nand.local/

Thanks,
Taylor