mbox series

[0/4] Check .gitmodules when using packfile URIs

Message ID cover.1611455251.git.jonathantanmy@google.com (mailing list archive)
Headers show
Series Check .gitmodules when using packfile URIs | expand

Message

Jonathan Tan Jan. 24, 2021, 2:34 a.m. UTC
This patch set resolves the .gitmodules-and-tree-in-separate-packfiles
issue I mentioned in [1] by having index-pack print out all dangling
.gitmodules (instead of returning with an error code) and then teaching
fetch-pack to read those and run its own fsck checks after all
index-pack invocations are complete.

As part of this, index-pack has to output (1) the hash that goes into
the name of the .pack/.idx file and (2) the hashes of all dangling
.gitmodules. I just had (2) come after (1). If anyone has a better idea,
I'm interested.

I also discovered a bug in that different index-pack arguments were used
when processing the inline packfile and when processing the ones
referenced by URIs. Patch 1-3 fixes that bug by passing the arguments to
use as a space-separated URL-encoded list. (URL-encoded so that we can
have spaces in the arguments.) Again, if anyone has a better idea, I'm
interested. It is only in patch 4 that we have the dangling .gitmodules
fix.

[1] https://lore.kernel.org/git/20210115234300.350442-1-jonathantanmy@google.com/

Jonathan Tan (4):
  http: allow custom index-pack args
  http-fetch: allow custom index-pack args
  fetch-pack: with packfile URIs, use index-pack arg
  fetch-pack: print and use dangling .gitmodules

 Documentation/git-http-fetch.txt |   9 ++-
 Documentation/git-index-pack.txt |   7 +-
 builtin/index-pack.c             |   9 ++-
 builtin/receive-pack.c           |   2 +-
 fetch-pack.c                     | 106 ++++++++++++++++++++++++++-----
 fsck.c                           |  16 +++--
 fsck.h                           |   8 +++
 http-fetch.c                     |  35 +++++++++-
 http.c                           |  15 +++--
 http.h                           |  10 +--
 pack-write.c                     |   8 ++-
 pack.h                           |   2 +-
 t/t5550-http-fetch-dumb.sh       |   3 +-
 t/t5702-protocol-v2.sh           |  47 ++++++++++++++
 14 files changed, 232 insertions(+), 45 deletions(-)

Comments

Junio C Hamano Jan. 24, 2021, 6:29 a.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> As part of this, index-pack has to output (1) the hash that goes into
> the name of the .pack/.idx file and (2) the hashes of all dangling
> .gitmodules. I just had (2) come after (1). If anyone has a better idea,
> I'm interested.

I have this feeling that the "blobs that need to be validated across
packs" will *not* be the last enhancement we'd need to make to the
output from index-pack to allow richer communication between it and
its invoker.  While there is no reason to change how the first line
of the output looks like, we'd probably want to make sure that the
future versions of Git can easily tell "list of blobs that require
further validation" from other additional information.

I am not comfortable to recommend "ok, then let's add a delimiter
line '---\n' if/when we need to have something after the list of
blobs and append more stuff in future versions of Git", because we
may find need to emit new kinds of info before the list of blobs
that needs further validation, for example, in future versions of
Git.

Having said all that, the internal communication between the
index-pack and its caller do not need as much care about
compatibility across versions as output visible to end-users, so
when a future version of Git needs to send different kinds of
information in different order from what you created here, we can do
so pretty much freely, I would guess.

Thanks.
Jonathan Tan Jan. 28, 2021, 12:35 a.m. UTC | #2
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > As part of this, index-pack has to output (1) the hash that goes into
> > the name of the .pack/.idx file and (2) the hashes of all dangling
> > .gitmodules. I just had (2) come after (1). If anyone has a better idea,
> > I'm interested.
> 
> I have this feeling that the "blobs that need to be validated across
> packs" will *not* be the last enhancement we'd need to make to the
> output from index-pack to allow richer communication between it and
> its invoker.  While there is no reason to change how the first line
> of the output looks like, we'd probably want to make sure that the
> future versions of Git can easily tell "list of blobs that require
> further validation" from other additional information.
> 
> I am not comfortable to recommend "ok, then let's add a delimiter
> line '---\n' if/when we need to have something after the list of
> blobs and append more stuff in future versions of Git", because we
> may find need to emit new kinds of info before the list of blobs
> that needs further validation, for example, in future versions of
> Git.
> 
> Having said all that, the internal communication between the
> index-pack and its caller do not need as much care about
> compatibility across versions as output visible to end-users, so
> when a future version of Git needs to send different kinds of
> information in different order from what you created here, we can do
> so pretty much freely, I would guess.

Yeah, that's what I thought too - since this is an internal interface,
we can evolve them in lockstep. If we're really worried about the Git
binaries (on a user's system) getting out of sync, we could just make
sure that subsequent updates to this protocol are
non-backwards-compatible (e.g. have index-pack emit "foo <hash>", where
"foo" is a string that describes the new check, so that current
fetch-pack will reject "foo" since it is not a hash).
Ævar Arnfjörð Bjarmason Feb. 18, 2021, 11:31 a.m. UTC | #3
On Thu, Jan 28 2021, Jonathan Tan wrote:

>> Jonathan Tan <jonathantanmy@google.com> writes:
>> 
>> > As part of this, index-pack has to output (1) the hash that goes into
>> > the name of the .pack/.idx file and (2) the hashes of all dangling
>> > .gitmodules. I just had (2) come after (1). If anyone has a better idea,
>> > I'm interested.
>> 
>> I have this feeling that the "blobs that need to be validated across
>> packs" will *not* be the last enhancement we'd need to make to the
>> output from index-pack to allow richer communication between it and
>> its invoker.  While there is no reason to change how the first line
>> of the output looks like, we'd probably want to make sure that the
>> future versions of Git can easily tell "list of blobs that require
>> further validation" from other additional information.
>> 
>> I am not comfortable to recommend "ok, then let's add a delimiter
>> line '---\n' if/when we need to have something after the list of
>> blobs and append more stuff in future versions of Git", because we
>> may find need to emit new kinds of info before the list of blobs
>> that needs further validation, for example, in future versions of
>> Git.
>> 
>> Having said all that, the internal communication between the
>> index-pack and its caller do not need as much care about
>> compatibility across versions as output visible to end-users, so
>> when a future version of Git needs to send different kinds of
>> information in different order from what you created here, we can do
>> so pretty much freely, I would guess.
>
> Yeah, that's what I thought too - since this is an internal interface,
> we can evolve them in lockstep. If we're really worried about the Git
> binaries (on a user's system) getting out of sync, 

I'm thinking in reading "getting out of sync" that you may be missing an
aspect of the issue here.

We're not talking about some abnormal error in some packaging system,
but how we'd expect all installations of git to behave if you update
them with *.rpm, *.deb etc, e.g. when your binaries are in
/usr/libexec/git-core. I suppose NixOS or something where there's
hash-based paths may be exempt from this.

On those systems if you've got a server serving concurrent traffic and
update the "git" package you could expect failure if any git process
invoked by another is incompatible during such an upgrade.

If you browse some of the recent GIT_CONFIG_PARAMETERS discussion this
was discussed there. I.e. even if GIT_CONFIG_PARAMETERS is internal-only
we bent over backwards not to change it in such a way as to have process
A invoking process B and the two not understanding each other because of
such an upgrade.

That's exactly because of this case, where receive-pack may be started
on version A, someone runs "apt install git" in the background
concurrently, and now a version A of that program is talking to a
version B index-pack.

> we could just make sure that subsequent updates to this protocol are
> non-backwards-compatible (e.g. have index-pack emit "foo <hash>",
> where "foo" is a string that describes the new check, so that current
> fetch-pack will reject "foo" since it is not a hash).

And then presumably index-pack would die and receive-pack would die on
the push or whatever, so the push fails for the end user.
Junio C Hamano Feb. 18, 2021, 11:34 p.m. UTC | #4
Jonathan Tan <jonathantanmy@google.com> writes:

> This patch set resolves the .gitmodules-and-tree-in-separate-packfiles
> issue I mentioned in [1] by having index-pack print out all dangling
> .gitmodules (instead of returning with an error code) and then teaching
> fetch-pack to read those and run its own fsck checks after all
> index-pack invocations are complete.
>
> As part of this, index-pack has to output (1) the hash that goes into
> the name of the .pack/.idx file and (2) the hashes of all dangling
> .gitmodules. I just had (2) come after (1). If anyone has a better idea,
> I'm interested.
>
> I also discovered a bug in that different index-pack arguments were used
> when processing the inline packfile and when processing the ones
> referenced by URIs. Patch 1-3 fixes that bug by passing the arguments to
> use as a space-separated URL-encoded list. (URL-encoded so that we can
> have spaces in the arguments.) Again, if anyone has a better idea, I'm
> interested. It is only in patch 4 that we have the dangling .gitmodules
> fix.

This seems to have been stalled but I think it would be a better
approach to use a custom callback for error reporting, suggested by
Ævar, which would be where his fsck API clean-up topic would lead
to.

If it is not ultra-urgent, perhaps you can retract the ones that are
queued right now, work with Ævar to finish the error-callback work
and rebuild this topic on top of it?  Thanks.
Jonathan Tan Feb. 19, 2021, 12:46 a.m. UTC | #5
> This seems to have been stalled but I think it would be a better
> approach to use a custom callback for error reporting, suggested by
> Ævar, which would be where his fsck API clean-up topic would lead
> to.
> 
> If it is not ultra-urgent, perhaps you can retract the ones that are
> queued right now, work with Ævar to finish the error-callback work
> and rebuild this topic on top of it?  Thanks.

OK - that works. My original idea was to rewrite it using an
error-callback but using starts_with() instead of the ID that Ævar's
work will provide, but seeing that at least one other contributor (Peff)
seems OK with the patches, rebasing mine on top of his works too. I'll
also take a look at his patches.
Ævar Arnfjörð Bjarmason Feb. 19, 2021, 1:08 a.m. UTC | #6
On Fri, Feb 19 2021, Junio C Hamano wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> This patch set resolves the .gitmodules-and-tree-in-separate-packfiles
>> issue I mentioned in [1] by having index-pack print out all dangling
>> .gitmodules (instead of returning with an error code) and then teaching
>> fetch-pack to read those and run its own fsck checks after all
>> index-pack invocations are complete.
>>
>> As part of this, index-pack has to output (1) the hash that goes into
>> the name of the .pack/.idx file and (2) the hashes of all dangling
>> .gitmodules. I just had (2) come after (1). If anyone has a better idea,
>> I'm interested.
>>
>> I also discovered a bug in that different index-pack arguments were used
>> when processing the inline packfile and when processing the ones
>> referenced by URIs. Patch 1-3 fixes that bug by passing the arguments to
>> use as a space-separated URL-encoded list. (URL-encoded so that we can
>> have spaces in the arguments.) Again, if anyone has a better idea, I'm
>> interested. It is only in patch 4 that we have the dangling .gitmodules
>> fix.
>
> This seems to have been stalled but I think it would be a better
> approach to use a custom callback for error reporting, suggested by
> Ævar, which would be where his fsck API clean-up topic would lead
> to.
>
> If it is not ultra-urgent, perhaps you can retract the ones that are
> queued right now, work with Ævar to finish the error-callback work
> and rebuild this topic on top of it?  Thanks.

If my vote counts for something I think it makes sense to have
Jonathan's series go first and just ignore my fsck API improvement
patches (well, the part of my v1[1] which conflicts with his work).

I'm also happy to help him queue his on top of a v1 version of my
series.

But the end result of doing so (shown after the "--" in [1]) is just a
small re-arrangement of code to get a cleaner fsck API use, it doesn't
actually matter to anyone using git.

Whereas his patches actually do, we have in-the-wild server/repo/clone
setups that are getting on-clone errors, and the window for 2.31 is
getting closer.

We can always do the small API use refactoring later. My interest in
barking up that tree was just that I've been poking at that part of the
fsck API and have some follow-up work that hasn't made it onto the list
yet that makes other use of the fsck API.

So in the longer term I wanted us to think about not needing N special
cases like "print_dangling_gitmodules" if we could help it, but in the
shorter term having it is a non-issue.

1. https://lore.kernel.org/git/20210217194246.25342-1-avarab@gmail.com/
Junio C Hamano Feb. 20, 2021, 3:29 a.m. UTC | #7
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Fri, Feb 19 2021, Junio C Hamano wrote:
>
>> This seems to have been stalled but I think it would be a better
>> approach to use a custom callback for error reporting, suggested by
>> Ævar, which would be where his fsck API clean-up topic would lead
>> to.
>>
>> If it is not ultra-urgent, perhaps you can retract the ones that are
>> queued right now, work with Ævar to finish the error-callback work
>> and rebuild this topic on top of it?  Thanks.
>
> If my vote counts for something I think it makes sense to have
> Jonathan's series go first and just ignore my fsck API improvement
> patches (well, the part of my v1[1] which conflicts with his work).
>
> I'm also happy to help him queue his on top of a v1 version of my
> series.

Either would work for us, I would think.

Thanks.
Junio C Hamano Feb. 20, 2021, 3:31 a.m. UTC | #8
Jonathan Tan <jonathantanmy@google.com> writes:

>> This seems to have been stalled but I think it would be a better
>> approach to use a custom callback for error reporting, suggested by
>> Ævar, which would be where his fsck API clean-up topic would lead
>> to.
>> 
>> If it is not ultra-urgent, perhaps you can retract the ones that are
>> queued right now, work with Ævar to finish the error-callback work
>> and rebuild this topic on top of it?  Thanks.
>
> OK - that works. My original idea was to rewrite it using an
> error-callback but using starts_with() instead of the ID that Ævar's
> work will provide, but seeing that at least one other contributor (Peff)
> seems OK with the patches, rebasing mine on top of his works too. I'll
> also take a look at his patches.

Thanks, either way would work for me, but if the suggested route
forces you review Ævar's code and work together, that would be a
good bonus point ;-)