mbox series

[00/21] Introduce `USE_THE_REPOSITORY_VARIABLE` macro

Message ID cover.1718106284.git.ps@pks.im (mailing list archive)
Headers show
Series Introduce `USE_THE_REPOSITORY_VARIABLE` macro | expand

Message

Patrick Steinhardt June 11, 2024, 11:57 a.m. UTC
Hi,

use of the `the_repository` variable is nowadays considered to be
deprecated, and over time we want to convert our codebase to stop using
it in favor of explicitly passing down the repository to functions via
parameters. This effort faces some important problems though.

  - It is hard to prove that a certain code unit does not use
    `the_repository` anymore when sending patch series. The reviewer has
    no way to verify that it's not used anymore without reading through
    the code itself.

  - It is easy to sneak in new usages of `the_repository` by accident
    into a code unit that is already `the_repository`-clean.

  - There are many functions which implicitly use `the_repository`,
    which is really hard to spot.

This patch series aims to address those problems by introducing a new
`USE_THE_REPOSITORY_VARIABLE` macro. When unset, then the declarations
of `the_repository`, `the_hash_algo` and some functions that implicitly
depend on them will be hidden away. This makes it trivial to demonstrate
that a code unit is `the_repository`-free by removing the definition of
any such macro.

The patch series doesn't aim to be perfect. There are still many
functions that implicitly depend on `the_repository` which are not
hidden away. Those can be addressed over time, either by guarding them
as required or, even better, removing such functions altogether.

The series is structured as follows:

  - Patches 1 to 8 refactor "hash.h" such that its functions do not
    depend on `the_repository` anymore. As these are used almost
    everywhere, the whole patch series would be rather moot without such
    a refactoring as everything would depend on `the_repository`.

  - Patch 9 introduces `USE_THE_REPOSITORY_VARIABLE`, guarding the
    declaration of `the_repository` and `the_hash_algo`.

  - Patch 11 merges "hash-ll.h" back into "hash.h" as the split is now
    mostly pointless.

  - Patches 12 to 20 refactor various users of "hex.h" to not use
    functions depending on `the_repository` anymore.

  - Patch 21 guards declarations of functions that depend on
    `the_repository` in "hex.h".

The series depends on ps/ref-sotrage-migration at 25a0023f28
(builtin/refs: new command to migrate ref storage formats, 2024-06-06).
This is moslty due to patch 10, which fixes a cyclic include that breaks
the build once we merge "hash-ll.h" back into "hash.h".

There are some minor textual and semantic conflicts with "seen" that can
be fixed as shown below.

Thanks!

Patrick

    diff --cc bloom.c
    index bbf146fc30,d080a1b616..0000000000
    --- a/bloom.c
    +++ b/bloom.c
    @@@ -6,7 -6,9 +6,10 @@@
      #include "commit-graph.h"
      #include "commit.h"
      #include "commit-slab.h"
     +#include "repository.h"
    + #include "tree.h"
    + #include "tree-walk.h"
    + #include "config.h"
      
      define_commit_slab(bloom_filter_slab, struct bloom_filter);
      
    diff --cc midx.c
    index 3992b05465,5aa7e2a6e6..0000000000
    --- a/midx.c
    +++ b/midx.c
    @@@ -303,11 -530,12 +532,13 @@@ struct object_id *nth_midxed_object_oid
                        struct multi_pack_index *m,
                        uint32_t n)
      {
    - 	if (n >= m->num_objects)
    + 	if (n >= m->num_objects + m->num_objects_in_base)
            return NULL;
      
    + 	n = midx_for_object(&m, n);
    + 
     -	oidread(oid, m->chunk_oid_lookup + st_mult(m->hash_len, n));
     +	oidread(oid, m->chunk_oid_lookup + st_mult(m->hash_len, n),
     +		the_repository->hash_algo);
        return oid;
      }
      
    diff --cc pack-bitmap-write.c
    index 37a8ad0fb3,6e8060f8a0..0000000000
    --- a/pack-bitmap-write.c
    +++ b/pack-bitmap-write.c
    @@@ -817,8 -1022,8 +1024,8 @@@ void bitmap_writer_finish(struct bitmap
        memcpy(header.magic, BITMAP_IDX_SIGNATURE, sizeof(BITMAP_IDX_SIGNATURE));
        header.version = htons(default_version);
        header.options = htons(flags | options);
    - 	header.entry_count = htonl(writer->selected_nr);
    + 	header.entry_count = htonl(bitmap_writer_nr_selected_commits(writer));
     -	hashcpy(header.checksum, writer->pack_checksum);
     +	hashcpy(header.checksum, writer->pack_checksum, the_repository->hash_algo);
      
        hashwrite(f, &header, sizeof(header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz);
        dump_bitmap(f, writer->commits);
    diff --git a/pseudo-merge.c b/pseudo-merge.c
    index e3e0393f11..f0fde13c47 100644
    --- a/pseudo-merge.c
    +++ b/pseudo-merge.c
    @@ -1,3 +1,5 @@
    +#define USE_THE_REPOSITORY_VARIABLE
    +
     #include "git-compat-util.h"
     #include "pseudo-merge.h"
     #include "date.h"
    diff --git a/t/unit-tests/lib-oid.h b/t/unit-tests/lib-oid.h
    index bfde639190..8d2acca768 100644
    --- a/t/unit-tests/lib-oid.h
    +++ b/t/unit-tests/lib-oid.h
    @@ -1,7 +1,7 @@
     #ifndef LIB_OID_H
     #define LIB_OID_H
     
    -#include "hash-ll.h"
    +#include "hash.h"
     
     /*
      * Convert arbitrary hex string to object_id.
    diff --git a/t/unit-tests/t-example-decorate.c b/t/unit-tests/t-example-decorate.c
    index 3c856a8cf2..a4a75db735 100644
    --- a/t/unit-tests/t-example-decorate.c
    +++ b/t/unit-tests/t-example-decorate.c
    @@ -1,3 +1,5 @@
    +#define USE_THE_REPOSITORY_VARIABLE
    +
     #include "test-lib.h"
     #include "object.h"
     #include "decorate.h"

Patrick Steinhardt (21):
  hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
  hash: require hash algorithm in `hasheq()`, `hashcmp()` and
    `hashclr()`
  hash: require hash algorithm in `oidread()` and `oidclr()`
  global: ensure that object IDs are always padded
  hash: convert `oidcmp()` and `oideq()` to compare whole hash
  hash: make `is_null_oid()` independent of `the_repository`
  hash: require hash algorithm in `is_empty_{blob,tree}_oid()`
  hash: require hash algorithm in `empty_tree_oid_hex()`
  global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
  refs: avoid include cycle with "repository.h"
  hash-ll: merge with "hash.h"
  http-fetch: don't crash when parsing packfile without a repo
  oidset: pass hash algorithm when parsing file
  protocol-caps: use hash algorithm from passed-in repository
  replace-object: use hash algorithm from passed-in repository
  compat/fsmonitor: remove dependency on `the_repository` in Darwin IPC
  t/helper: use correct object hash in partial-clone helper
  t/helper: fix segfault in "oid-array" command without repository
  t/helper: remove dependency on `the_repository` in "oidtree"
  t/helper: remove dependency on `the_repository` in "proc-receive"
  hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE`

 add-interactive.c                            |   4 +-
 add-patch.c                                  |   4 +-
 apply.c                                      |   4 +-
 apply.h                                      |   2 +-
 archive-tar.c                                |   3 +
 archive-zip.c                                |   3 +
 archive.c                                    |   2 +
 attr.c                                       |   2 +
 bisect.c                                     |   2 +
 blame.c                                      |   4 +-
 bloom.c                                      |   1 +
 branch.c                                     |   2 +
 builtin.h                                    |   8 +
 builtin/am.c                                 |   8 +-
 builtin/blame.c                              |   3 +-
 builtin/fast-export.c                        |   2 +-
 builtin/fast-import.c                        |  43 ++-
 builtin/fetch-pack.c                         |   4 +-
 builtin/index-pack.c                         |  11 +-
 builtin/log.c                                |   4 +-
 builtin/merge.c                              |   7 +-
 builtin/notes.c                              |   2 +-
 builtin/pack-objects.c                       |   3 +-
 builtin/pack-redundant.c                     |  10 +-
 builtin/patch-id.c                           |   6 +-
 builtin/pull.c                               |   6 +-
 builtin/receive-pack.c                       |   4 +-
 builtin/replace.c                            |   2 +-
 builtin/rm.c                                 |   2 +-
 builtin/tag.c                                |   2 +-
 builtin/unpack-objects.c                     |   9 +-
 builtin/update-ref.c                         |   8 +-
 bulk-checkin.c                               |   3 +
 bundle-uri.c                                 |   2 +
 bundle.c                                     |   2 +
 cache-tree.c                                 |   7 +-
 checkout.c                                   |   2 +
 checkout.h                                   |   2 +-
 chunk-format.c                               |   2 +
 chunk-format.h                               |   2 +-
 combine-diff.c                               |   2 +
 commit-graph.c                               |  22 +-
 commit-graph.h                               |   2 +
 commit-reach.c                               |   2 +
 commit.c                                     |   2 +
 common-main.c                                |   2 +
 compat/fsmonitor/fsm-ipc-darwin.c            |   5 +-
 compat/sha1-chunked.c                        |   2 +-
 compat/win32/trace2_win32_process_info.c     |   2 +
 config.c                                     |   3 +
 connected.c                                  |   2 +
 convert.c                                    |   2 +
 convert.h                                    |   2 +-
 csum-file.c                                  |   9 +-
 csum-file.h                                  |   2 +-
 delta-islands.c                              |   2 +
 diagnose.c                                   |   2 +
 diff-lib.c                                   |   7 +-
 diff.c                                       |   9 +-
 diff.h                                       |   2 +-
 diffcore-break.c                             |   3 +
 diffcore-rename.c                            |   7 +-
 diffcore.h                                   |   2 +-
 dir.c                                        |   9 +-
 dir.h                                        |   2 +-
 entry.c                                      |   2 +
 environment.c                                |   3 +
 fetch-pack.c                                 |   2 +
 fmt-merge-msg.c                              |   2 +
 fsck.c                                       |   5 +-
 fsmonitor-ipc.c                              |   2 +
 git.c                                        |   2 +
 hash-ll.h                                    | 310 ----------------
 hash-lookup.c                                |   5 +-
 hash.h                                       | 366 ++++++++++++++++---
 help.c                                       |   2 +
 hex.c                                        |   8 +-
 hex.h                                        |  28 +-
 http-backend.c                               |   2 +
 http-fetch.c                                 |   8 +-
 http-push.c                                  |   5 +-
 http-walker.c                                |   6 +-
 http.c                                       |   2 +
 list-objects-filter-options.c                |   2 +
 list-objects-filter.c                        |   2 +
 list-objects.c                               |   2 +
 log-tree.c                                   |   2 +
 loose.c                                      |   2 +
 loose.h                                      |   2 +
 ls-refs.c                                    |   2 +
 mailmap.c                                    |   2 +
 match-trees.c                                |   6 +-
 merge-blobs.c                                |   2 +
 merge-ort.c                                  |   2 +
 merge-ort.h                                  |   2 +-
 merge-recursive.c                            |   3 +
 merge.c                                      |   2 +
 midx-write.c                                 |   2 +
 midx.c                                       |   5 +-
 negotiator/default.c                         |   2 +
 negotiator/skipping.c                        |   2 +
 notes-cache.c                                |   2 +
 notes-merge.c                                |   8 +-
 notes-utils.c                                |   2 +
 notes.c                                      |  14 +-
 object-file-convert.c                        |   6 +-
 object-file.c                                |  19 +-
 object-name.c                                |   2 +
 object.c                                     |   2 +
 object.h                                     |   2 +-
 oid-array.c                                  |   2 +
 oidmap.h                                     |   2 +-
 oidset.c                                     |   8 +-
 oidset.h                                     |   4 +-
 oidtree.c                                    |   4 +-
 oidtree.h                                    |   2 +-
 oss-fuzz/fuzz-commit-graph.c                 |   2 +
 pack-bitmap-write.c                          |   6 +-
 pack-bitmap.c                                |   5 +-
 pack-check.c                                 |   7 +-
 pack-revindex.c                              |   2 +
 pack-write.c                                 |   5 +-
 packfile.c                                   |  20 +-
 packfile.h                                   |   2 +
 parallel-checkout.c                          |   8 +-
 parse-options-cb.c                           |   2 +
 path.c                                       |   3 +
 pathspec.c                                   |   2 +
 pretty.c                                     |   2 +
 progress.c                                   |   2 +
 promisor-remote.c                            |   2 +
 protocol-caps.c                              |   5 +-
 range-diff.c                                 |   2 +
 reachable.c                                  |   2 +
 read-cache-ll.h                              |   2 +-
 read-cache.c                                 |  21 +-
 rebase-interactive.c                         |   2 +
 ref-filter.c                                 |   2 +
 reflog-walk.c                                |   2 +
 reflog.c                                     |   2 +
 refs.c                                       |   8 +-
 refs.h                                       |   8 +-
 refs/files-backend.c                         |   8 +-
 refs/packed-backend.c                        |   8 +-
 refs/ref-cache.h                             |   2 +-
 refs/reftable-backend.c                      |  39 +-
 refspec.c                                    |   2 +
 reftable/dump.c                              |   2 +-
 reftable/reftable-record.h                   |   2 +-
 reftable/system.h                            |   2 +-
 remote-curl.c                                |   2 +
 remote.c                                     |  10 +-
 remote.h                                     |   2 +-
 replace-object.c                             |   2 +-
 repository.h                                 |   9 +-
 rerere.c                                     |   2 +
 reset.c                                      |   2 +
 reset.h                                      |   2 +-
 resolve-undo.c                               |   5 +-
 resolve-undo.h                               |   2 +-
 revision.c                                   |   2 +
 run-command.c                                |   2 +
 scalar.c                                     |   2 +
 send-pack.c                                  |   2 +
 sequencer.c                                  |   8 +-
 serve.c                                      |   4 +-
 server-info.c                                |   2 +
 setup.c                                      |   2 +
 shallow.c                                    |   2 +
 split-index.c                                |   4 +-
 split-index.h                                |   2 +-
 streaming.c                                  |   3 +
 submodule-config.c                           |   4 +-
 submodule.c                                  |   8 +-
 t/helper/test-bitmap.c                       |   2 +
 t/helper/test-bloom.c                        |   2 +
 t/helper/test-cache-tree.c                   |   2 +
 t/helper/test-dump-cache-tree.c              |   2 +
 t/helper/test-dump-fsmonitor.c               |   2 +
 t/helper/test-dump-split-index.c             |   2 +
 t/helper/test-dump-untracked-cache.c         |   2 +
 t/helper/test-example-decorate.c             |   2 +
 t/helper/test-find-pack.c                    |   2 +
 t/helper/test-fsmonitor-client.c             |   2 +
 t/helper/test-hash-speed.c                   |   2 +-
 t/helper/test-lazy-init-name-hash.c          |   2 +
 t/helper/test-match-trees.c                  |   2 +
 t/helper/test-oid-array.c                    |   4 +
 t/helper/test-oidmap.c                       |   2 +
 t/helper/test-oidtree.c                      |   5 +-
 t/helper/test-pack-mtimes.c                  |   2 +
 t/helper/test-partial-clone.c                |   2 +-
 t/helper/test-proc-receive.c                 |   9 +-
 t/helper/test-reach.c                        |   2 +
 t/helper/test-read-cache.c                   |   2 +
 t/helper/test-read-graph.c                   |   2 +
 t/helper/test-read-midx.c                    |   2 +
 t/helper/test-ref-store.c                    |   2 +
 t/helper/test-repository.c                   |   2 +
 t/helper/test-revision-walking.c             |   2 +
 t/helper/test-scrap-cache-tree.c             |   2 +
 t/helper/test-sha1.c                         |   2 +-
 t/helper/test-sha256.c                       |   2 +-
 t/helper/test-submodule-config.c             |   4 +-
 t/helper/test-submodule-nested-repo-config.c |   2 +
 t/helper/test-submodule.c                    |   2 +
 t/helper/test-trace2.c                       |   2 +
 t/helper/test-write-cache.c                  |   2 +
 t/t0064-oid-array.sh                         |  18 +
 t/t5550-http-fetch-dumb.sh                   |   6 +
 t/test-lib-functions.sh                      |   5 +
 tag.c                                        |   2 +
 tmp-objdir.c                                 |   2 +
 transport-helper.c                           |   2 +
 transport.c                                  |   2 +
 tree-diff.c                                  |   1 +
 tree-walk.c                                  |   6 +-
 tree-walk.h                                  |   2 +-
 tree.c                                       |   2 +
 unpack-trees.c                               |   2 +
 upload-pack.c                                |   2 +
 walker.c                                     |   2 +
 worktree.c                                   |   2 +
 wt-status.c                                  |   6 +-
 xdiff-interface.c                            |   2 +
 xdiff-interface.h                            |   2 +-
 226 files changed, 993 insertions(+), 619 deletions(-)
 delete mode 100644 hash-ll.h

Comments

brian m. carlson June 11, 2024, 11:24 p.m. UTC | #1
On 2024-06-11 at 11:57:33, Patrick Steinhardt wrote:
> Hi,
> 
> use of the `the_repository` variable is nowadays considered to be
> deprecated, and over time we want to convert our codebase to stop using
> it in favor of explicitly passing down the repository to functions via
> parameters. This effort faces some important problems though.
> 
>   - It is hard to prove that a certain code unit does not use
>     `the_repository` anymore when sending patch series. The reviewer has
>     no way to verify that it's not used anymore without reading through
>     the code itself.
> 
>   - It is easy to sneak in new usages of `the_repository` by accident
>     into a code unit that is already `the_repository`-clean.
> 
>   - There are many functions which implicitly use `the_repository`,
>     which is really hard to spot.
> 
> This patch series aims to address those problems by introducing a new
> `USE_THE_REPOSITORY_VARIABLE` macro. When unset, then the declarations
> of `the_repository`, `the_hash_algo` and some functions that implicitly
> depend on them will be hidden away. This makes it trivial to demonstrate
> that a code unit is `the_repository`-free by removing the definition of
> any such macro.

Overall, I left a few comments, but I think this definitely moves us in
the right direction and I'm glad to see it.  This obviously improves the
experience with libification and unit testing in a lot of ways, which is
good.

My only caution is that using the *_any functions will cause us a world
of pain if we ever adopt another 256-bit hash function, since it will be
ambiguous which algorithm is to be used.  That's why, traditionally, we
haven't assumed a hash algorithm based on the object ID length.  I don't
think the amount of uses we have is excessive, even with your changes,
but we'll need to be mindful of that going forward.
Patrick Steinhardt June 12, 2024, 7:37 a.m. UTC | #2
On Tue, Jun 11, 2024 at 11:24:30PM +0000, brian m. carlson wrote:
> On 2024-06-11 at 11:57:33, Patrick Steinhardt wrote:
> > Hi,
> > 
> > use of the `the_repository` variable is nowadays considered to be
> > deprecated, and over time we want to convert our codebase to stop using
> > it in favor of explicitly passing down the repository to functions via
> > parameters. This effort faces some important problems though.
> > 
> >   - It is hard to prove that a certain code unit does not use
> >     `the_repository` anymore when sending patch series. The reviewer has
> >     no way to verify that it's not used anymore without reading through
> >     the code itself.
> > 
> >   - It is easy to sneak in new usages of `the_repository` by accident
> >     into a code unit that is already `the_repository`-clean.
> > 
> >   - There are many functions which implicitly use `the_repository`,
> >     which is really hard to spot.
> > 
> > This patch series aims to address those problems by introducing a new
> > `USE_THE_REPOSITORY_VARIABLE` macro. When unset, then the declarations
> > of `the_repository`, `the_hash_algo` and some functions that implicitly
> > depend on them will be hidden away. This makes it trivial to demonstrate
> > that a code unit is `the_repository`-free by removing the definition of
> > any such macro.
> 
> Overall, I left a few comments, but I think this definitely moves us in
> the right direction and I'm glad to see it.  This obviously improves the
> experience with libification and unit testing in a lot of ways, which is
> good.
> 
> My only caution is that using the *_any functions will cause us a world
> of pain if we ever adopt another 256-bit hash function, since it will be
> ambiguous which algorithm is to be used.  That's why, traditionally, we
> haven't assumed a hash algorithm based on the object ID length.  I don't
> think the amount of uses we have is excessive, even with your changes,
> but we'll need to be mindful of that going forward.

The only cases where I add new calls to `_any()` are in test helpers:

  - "t/helper/test-oidtree.c". This one is getting converted to a unit
    test by Ghanshyam, so I'll leave it to him to improve this.

  - "t/helper/test-proc-receive.c". Here we don't care about the actual
    algorithm, the only thing we care about is that we can correctly
    parse them and then eventually emit them via `oid_to_hex()` again.
    So even if we introduce a second hash function with the same length
    this code would continue to work alright.

So I think it should be fine in the context of this series. But the
remark is certainly valid and something we should be cautious about
going forward.

Thanks!

Patrick
Ghanshyam Thakkar June 12, 2024, 10:12 a.m. UTC | #3
On Wed, 12 Jun 2024, Patrick Steinhardt <ps@pks.im> wrote:
> On Tue, Jun 11, 2024 at 11:24:30PM +0000, brian m. carlson wrote:
> > On 2024-06-11 at 11:57:33, Patrick Steinhardt wrote:
> > > Hi,
> > > 
> > > use of the `the_repository` variable is nowadays considered to be
> > > deprecated, and over time we want to convert our codebase to stop using
> > > it in favor of explicitly passing down the repository to functions via
> > > parameters. This effort faces some important problems though.
> > > 
> > >   - It is hard to prove that a certain code unit does not use
> > >     `the_repository` anymore when sending patch series. The reviewer has
> > >     no way to verify that it's not used anymore without reading through
> > >     the code itself.
> > > 
> > >   - It is easy to sneak in new usages of `the_repository` by accident
> > >     into a code unit that is already `the_repository`-clean.
> > > 
> > >   - There are many functions which implicitly use `the_repository`,
> > >     which is really hard to spot.
> > > 
> > > This patch series aims to address those problems by introducing a new
> > > `USE_THE_REPOSITORY_VARIABLE` macro. When unset, then the declarations
> > > of `the_repository`, `the_hash_algo` and some functions that implicitly
> > > depend on them will be hidden away. This makes it trivial to demonstrate
> > > that a code unit is `the_repository`-free by removing the definition of
> > > any such macro.
> > 
> > Overall, I left a few comments, but I think this definitely moves us in
> > the right direction and I'm glad to see it.  This obviously improves the
> > experience with libification and unit testing in a lot of ways, which is
> > good.
> > 
> > My only caution is that using the *_any functions will cause us a world
> > of pain if we ever adopt another 256-bit hash function, since it will be
> > ambiguous which algorithm is to be used.  That's why, traditionally, we
> > haven't assumed a hash algorithm based on the object ID length.  I don't
> > think the amount of uses we have is excessive, even with your changes,
> > but we'll need to be mindful of that going forward.
> 
> The only cases where I add new calls to `_any()` are in test helpers:
> 
>   - "t/helper/test-oidtree.c". This one is getting converted to a unit
>     test by Ghanshyam, so I'll leave it to him to improve this.

Yeah, I don't use '_any()', and explicitly give algo using '_algop()'.

link:https://lore.kernel.org/git/20240608165731.29467-1-shyamthakkar001@gmail.com/

Thanks.

>   - "t/helper/test-proc-receive.c". Here we don't care about the actual
>     algorithm, the only thing we care about is that we can correctly
>     parse them and then eventually emit them via `oid_to_hex()` again.
>     So even if we introduce a second hash function with the same length
>     this code would continue to work alright.
> 
> So I think it should be fine in the context of this series. But the
> remark is certainly valid and something we should be cautious about
> going forward.
> 
> Thanks!
> 
> Patrick