mbox series

[00/20] packfile: avoid using the 'the_repository' global variable

Message ID cover.1729504640.git.karthik.188@gmail.com (mailing list archive)
Headers show
Series packfile: avoid using the 'the_repository' global variable | expand

Message

karthik nayak Oct. 21, 2024, 9:57 a.m. UTC
The `packfile.c` file uses the global variable 'the_repository' extensively
throughout the code. Let's remove all usecases of this, by modifying the
required functions to accept a 'struct repository' instead. This is to clean up
usage of global state.

The first 18 patches are mostly passing a `struct repository` to each of the
functions within `packfile.c` from other files. The last two patches move some
global config variables and make them local. I'm not too well versed with this
section of the code, so would be nice to get some eyes here.

Karthik Nayak (20):
  packfile: pass down repository to `odb_pack_name`
  packfile: pass down repository to `unuse_one_window`
  packfile: pass down repository to `close_one_pack`
  packfile: pass down repository to `add_packed_git`
  packfile: pass down repository to `unpack_object_header`
  packfile: pass down repository to `get_delta_base`
  packfile: use provided repository in `packed_object_info`
  packfile: pass down repository to `unpack_compressed_entry`
  packfile: pass down repository to `nth_packed_object_id`
  packfile: pass down repository to `find_pack_entry_one`
  packfile: pass down repository to `fill_pack_entry`
  packfile: pass down repository to `has_object[_kept]_pack`
  packfile: pass down repository to `for_each_packed_object`
  packfile: pass down repository to `is_promisor_object`
  object-store: pass down repository to `each_packed_object_fn`
  packfile: pass down repository to `open_pack_index`
  packfile: stop using 'the_hash_algo'
  packfile: pass down repository to `nth_packed_object_offset`
  config: make `delta_base_cache_limit` a non-global variable
  config: make `packed_git_(limit|window_size)` non-global variables

 builtin/cat-file.c          |  17 +-
 builtin/count-objects.c     |   4 +-
 builtin/fast-import.c       |  18 +-
 builtin/fsck.c              |  30 +--
 builtin/gc.c                |   5 +-
 builtin/index-pack.c        |  22 ++-
 builtin/pack-objects.c      |  67 ++++---
 builtin/pack-redundant.c    |   4 +-
 builtin/repack.c            |   9 +-
 builtin/rev-list.c          |   2 +-
 commit-graph.c              |  15 +-
 config.c                    |  22 ---
 connected.c                 |   7 +-
 diff.c                      |   3 +-
 environment.c               |   3 -
 environment.h               |   1 -
 fsck.c                      |   2 +-
 http-push.c                 |   5 +-
 http-walker.c               |   2 +-
 http.c                      |  15 +-
 list-objects.c              |   7 +-
 midx-write.c                |  16 +-
 midx.c                      |   8 +-
 object-name.c               |  16 +-
 object-store-ll.h           |  10 +-
 pack-bitmap.c               |  23 ++-
 pack-check.c                |  17 +-
 pack-mtimes.c               |   4 +-
 pack-mtimes.h               |   3 +-
 pack-objects.h              |   3 +-
 pack-revindex.c             |   7 +-
 pack-write.c                |   1 +
 pack.h                      |   1 +
 packfile.c                  | 376 +++++++++++++++++++++---------------
 packfile.h                  |  63 +++---
 promisor-remote.c           |   2 +-
 prune-packed.c              |   2 +-
 reachable.c                 |  14 +-
 revision.c                  |  15 +-
 streaming.c                 |   6 +-
 t/helper/test-find-pack.c   |   2 +-
 t/helper/test-pack-mtimes.c |   4 +-
 tag.c                       |   2 +-
 43 files changed, 482 insertions(+), 373 deletions(-)

Comments

Taylor Blau Oct. 21, 2024, 9:03 p.m. UTC | #1
On Mon, Oct 21, 2024 at 11:57:43AM +0200, Karthik Nayak wrote:
> The `packfile.c` file uses the global variable 'the_repository' extensively
> throughout the code. Let's remove all usecases of this, by modifying the
> required functions to accept a 'struct repository' instead. This is to clean up
> usage of global state.
>
> The first 18 patches are mostly passing a `struct repository` to each of the
> functions within `packfile.c` from other files. The last two patches move some
> global config variables and make them local. I'm not too well versed with this
> section of the code, so would be nice to get some eyes here.

I agree with the goal of this series, but I worry that as written it
will be quite disruptive to other topics on the list.

The standard way to avoid this disruption is to, for e.g. the first
change, do the following:

  - Introduce a new function repo_odb_pack_name() that takes in a
    'struct repository *', and rewrite odb_pack_name() in terms of it
    (passing 'the_repository' in as the argument).

  - Write a Coccinelle rule to replace all calls to odb_pack_name()
    with calls to repo_odb_pack_name().

  - Submit those patches without adjusting any non-obvious callers or
    ones that are not contained to a single compilation unit that you
    are already touching.

  - Wait until a new development cycle has begun, run spatch on the new
    rule to replace all other calls. Then optionally rename
    repo_odb_pack_name() to odb_pack_name().

I think Patrick (CC'd) has done one of these transitions recently, so
I'll defer to him in case I got any of the details wrong.

In the meantime, I'm going to hold this one out of seen as it may be
disruptive in the current state.

Thanks,
Taylor
karthik nayak Oct. 27, 2024, 9:23 p.m. UTC | #2
Taylor Blau <me@ttaylorr.com> writes:

> On Mon, Oct 21, 2024 at 11:57:43AM +0200, Karthik Nayak wrote:
>> The `packfile.c` file uses the global variable 'the_repository' extensively
>> throughout the code. Let's remove all usecases of this, by modifying the
>> required functions to accept a 'struct repository' instead. This is to clean up
>> usage of global state.
>>
>> The first 18 patches are mostly passing a `struct repository` to each of the
>> functions within `packfile.c` from other files. The last two patches move some
>> global config variables and make them local. I'm not too well versed with this
>> section of the code, so would be nice to get some eyes here.
>
> I agree with the goal of this series, but I worry that as written it
> will be quite disruptive to other topics on the list.
>

I agree, that as it currently sits, this is very disruptive.

> The standard way to avoid this disruption is to, for e.g. the first
> change, do the following:
>
>   - Introduce a new function repo_odb_pack_name() that takes in a
>     'struct repository *', and rewrite odb_pack_name() in terms of it
>     (passing 'the_repository' in as the argument).
>
>   - Write a Coccinelle rule to replace all calls to odb_pack_name()
>     with calls to repo_odb_pack_name().
>
>   - Submit those patches without adjusting any non-obvious callers or
>     ones that are not contained to a single compilation unit that you
>     are already touching.
>
>   - Wait until a new development cycle has begun, run spatch on the new
>     rule to replace all other calls. Then optionally rename
>     repo_odb_pack_name() to odb_pack_name().
>
> I think Patrick (CC'd) has done one of these transitions recently, so
> I'll defer to him in case I got any of the details wrong.
>
> In the meantime, I'm going to hold this one out of seen as it may be
> disruptive in the current state.
>
> Thanks,
> Taylor

While thinking about this over the last few days and also getting some
advice from Patrick, I realized that we don't need to be this disruptive
by simply adding the 'repository' variable to the already existing
'packed_git' struct. This allows us to leverage this information more
easily, since most of the functions already have access to the
'packed_git' struct.

This, plus the series by Jeff 'jk/dumb-http-finalize' which also removes
some existing functions. We reduce the impact to only 3 functions being
modified.

I think with such low impact, it might make more sense to not go with
the Coccinelle approach, since it is a lot simpler without it.

I'll post a new version tomorrow showcasing this approach, but I'll
leave the final decision to you whether it is still disruptive, and if
the approach you mentioned would be better.

Thanks
Taylor Blau Oct. 27, 2024, 11:54 p.m. UTC | #3
On Sun, Oct 27, 2024 at 05:23:24PM -0400, karthik nayak wrote:
> While thinking about this over the last few days and also getting some
> advice from Patrick, I realized that we don't need to be this disruptive
> by simply adding the 'repository' variable to the already existing
> 'packed_git' struct. This allows us to leverage this information more
> easily, since most of the functions already have access to the
> 'packed_git' struct.

Great idea!

> This, plus the series by Jeff 'jk/dumb-http-finalize' which also removes
> some existing functions. We reduce the impact to only 3 functions being
> modified.
>
> I think with such low impact, it might make more sense to not go with
> the Coccinelle approach, since it is a lot simpler without it.
>
> I'll post a new version tomorrow showcasing this approach, but I'll
> leave the final decision to you whether it is still disruptive, and if
> the approach you mentioned would be better.

I'll have to see the end result to know for sure, but it sounds like
this would be a good way to move it forward without being too
disruptive.

I wonder how it interacts with alternates, though...

Thanks,
Taylor
Jeff King Oct. 28, 2024, 5:31 a.m. UTC | #4
On Sun, Oct 27, 2024 at 05:23:24PM -0400, karthik nayak wrote:

> While thinking about this over the last few days and also getting some
> advice from Patrick, I realized that we don't need to be this disruptive
> by simply adding the 'repository' variable to the already existing
> 'packed_git' struct. This allows us to leverage this information more
> easily, since most of the functions already have access to the
> 'packed_git' struct.
> 
> This, plus the series by Jeff 'jk/dumb-http-finalize' which also removes
> some existing functions. We reduce the impact to only 3 functions being
> modified.

Yeah, I noticed while working on that topic that we were dropping some
uses of the_repository. And FWIW I had the same notion, that packed_git
should perhaps refer to the repository struct in which it resides.

As Taylor noted this is a tiny bit weird with respect to alternates,
which could exist in another repository (but don't have to! It could be
a bare objects/ directory). But I think from the perspective of a
particular process, we only have one repository struct that covers all
of its alternates for the duration of this process. So it would be OK in
practice. You might be able to get away with just storing a hash_algo
pointer in packed_git, which would be less weird (and is enough for the
bits I looked at, but perhaps not in the more general case).

Looking at odb_pack_name(), it will still need to take a repository
struct, since we sometimes form it before having a packed_git. But for
most calls, I suspect you could have an alternate function that takes a
packed_git and uses both its "hash" member and the algorithm.

Anyway, just my two cents having worked in the area recently.

-Peff
karthik nayak Oct. 28, 2024, 1:36 p.m. UTC | #5
Jeff King <peff@peff.net> writes:

> On Sun, Oct 27, 2024 at 05:23:24PM -0400, karthik nayak wrote:
>
>> While thinking about this over the last few days and also getting some
>> advice from Patrick, I realized that we don't need to be this disruptive
>> by simply adding the 'repository' variable to the already existing
>> 'packed_git' struct. This allows us to leverage this information more
>> easily, since most of the functions already have access to the
>> 'packed_git' struct.
>>
>> This, plus the series by Jeff 'jk/dumb-http-finalize' which also removes
>> some existing functions. We reduce the impact to only 3 functions being
>> modified.
>
> Yeah, I noticed while working on that topic that we were dropping some
> uses of the_repository. And FWIW I had the same notion, that packed_git
> should perhaps refer to the repository struct in which it resides.
>
> As Taylor noted this is a tiny bit weird with respect to alternates,
> which could exist in another repository (but don't have to! It could be
> a bare objects/ directory). But I think from the perspective of a
> particular process, we only have one repository struct that covers all
> of its alternates for the duration of this process. So it would be OK in
> practice. You might be able to get away with just storing a hash_algo
> pointer in packed_git, which would be less weird (and is enough for the
> bits I looked at, but perhaps not in the more general case).
>

This was my thought as well regarding alternates. Also it should be
noted that currently we're using the_repository anyways, so we will be
in the same state as before.

In a general case it seems more necessary to add the repo and not just
the hash_algo. Mostly because there are parts which require access to
the repository and also because some of my patches add config changes
which also require access to the repository.

> Looking at odb_pack_name(), it will still need to take a repository
> struct, since we sometimes form it before having a packed_git. But for
> most calls, I suspect you could have an alternate function that takes a
> packed_git and uses both its "hash" member and the algorithm.
>

I have four functions which would still need to take in a repository:
1. for_each_packed_object
2. has_object_pack
3. has_object_kept_pack
4. obd_pack_name

> Anyway, just my two cents having worked in the area recently.
>
> -Peff

Thanks for your input!
Taylor Blau Oct. 28, 2024, 3:21 p.m. UTC | #6
On Mon, Oct 28, 2024 at 08:36:50AM -0500, karthik nayak wrote:
> Jeff King <peff@peff.net> writes:
>
> > On Sun, Oct 27, 2024 at 05:23:24PM -0400, karthik nayak wrote:
> >
> >> While thinking about this over the last few days and also getting some
> >> advice from Patrick, I realized that we don't need to be this disruptive
> >> by simply adding the 'repository' variable to the already existing
> >> 'packed_git' struct. This allows us to leverage this information more
> >> easily, since most of the functions already have access to the
> >> 'packed_git' struct.
> >>
> >> This, plus the series by Jeff 'jk/dumb-http-finalize' which also removes
> >> some existing functions. We reduce the impact to only 3 functions being
> >> modified.
> >
> > Yeah, I noticed while working on that topic that we were dropping some
> > uses of the_repository. And FWIW I had the same notion, that packed_git
> > should perhaps refer to the repository struct in which it resides.
> >
> > As Taylor noted this is a tiny bit weird with respect to alternates,
> > which could exist in another repository (but don't have to! It could be
> > a bare objects/ directory). But I think from the perspective of a
> > particular process, we only have one repository struct that covers all
> > of its alternates for the duration of this process. So it would be OK in
> > practice. You might be able to get away with just storing a hash_algo
> > pointer in packed_git, which would be less weird (and is enough for the
> > bits I looked at, but perhaps not in the more general case).
>
> This was my thought as well regarding alternates. Also it should be
> noted that currently we're using the_repository anyways, so we will be
> in the same state as before.

Makes sense. Thanks, both, for thinking through it together.

> In a general case it seems more necessary to add the repo and not just
> the hash_algo. Mostly because there are parts which require access to
> the repository and also because some of my patches add config changes
> which also require access to the repository.

I could believe that ;-). I see that you posted some patches lower down
in the thread, which I figure probably uncover some cases where we need
more than just a pointer to the_hash_algo.

But let's read on and see exactly what shakes out.

> > Looking at odb_pack_name(), it will still need to take a repository
> > struct, since we sometimes form it before having a packed_git. But for
> > most calls, I suspect you could have an alternate function that takes a
> > packed_git and uses both its "hash" member and the algorithm.
>
> I have four functions which would still need to take in a repository:
> 1. for_each_packed_object
> 2. has_object_pack
> 3. has_object_kept_pack
> 4. obd_pack_name

That matches my own thinking, but perhaps there are others that neither
of us are coming up with off the tops of our heads. I think that as long
as for_each_packed_object() continues to call prepare_packed_git (which
sets up all of our alternates) and we continue to consult the
packed_git_mru cache, we should be OK.

> > Anyway, just my two cents having worked in the area recently.
> >
> > -Peff
>
> Thanks for your input!

Indeed. Thanks, both.

Thanks,
Taylor