[v2,1/3] Move init_skiplist() outside of fsck

Message ID	20190117202919.157326-2-brho@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Thu, 17 Jan 2019 15:29:17 -0500 In-Reply-To: <20190117202919.157326-1-brho@google.com> Message-Id: <20190117202919.157326-2-brho@google.com> Mime-Version: 1.0 References: <20190107213013.231514-1-brho@google.com> <20190117202919.157326-1-brho@google.com> Subject: [PATCH v2 1/3] Move init_skiplist() outside of fsck From: Barret Rhoden <brho@google.com> To: git@vger.kernel.org Cc: " =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= " <avarab@gmail.com>, David Kastrup <dak@gnu.org>, Jeff King <peff@peff.net>, Jeff Smith <whydoubt@gmail.com>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, Junio C Hamano <gitster@pobox.com>, " =?utf-8?q?Ren=C3=A9_Scharfe?= " <l.s.r@web.de>, Stefan Beller <stefanbeller@gmail.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	blame: add the ability to ignore commits \| expand [v2,0/3] blame: add the ability to ignore commits [v2,1/3] Move init_skiplist() outside of fsck [v2,2/3] blame: add the ability to ignore commits and their changes [v2,3/3] blame: add a config option to mark ignored lines

Barret Rhoden Jan. 17, 2019, 8:29 p.m. UTC

init_skiplist() took a file consisting of SHA-1s and comments and added
the objects to an oidset.  This functionality is useful for other
commands.

Signed-off-by: Barret Rhoden <brho@google.com>
---
 fsck.c   | 37 +------------------------------------
 oidset.c | 35 +++++++++++++++++++++++++++++++++++
 oidset.h |  7 +++++++
 3 files changed, 43 insertions(+), 36 deletions(-)

Johannes Schindelin Jan. 18, 2019, 9:25 a.m. UTC | #1

Hi Barret,

On Thu, 17 Jan 2019, Barret Rhoden wrote:

> init_skiplist() took a file consisting of SHA-1s and comments and added
> the objects to an oidset.  This functionality is useful for other
> commands.
> 
> Signed-off-by: Barret Rhoden <brho@google.com>

This patch looks good, I have just one small suggestion: SHA-1's days are
counted. We already know the roadmap, that we want to use SHA-256 instead
at some stage. Why not talk about "object hashes" instead of "SHA-1s"?

Thanks,
Johannes

> ---
>  fsck.c   | 37 +------------------------------------
>  oidset.c | 35 +++++++++++++++++++++++++++++++++++
>  oidset.h |  7 +++++++
>  3 files changed, 43 insertions(+), 36 deletions(-)
> 
> diff --git a/fsck.c b/fsck.c
> index 68502ce85b11..80b53e6f4968 100644
> --- a/fsck.c
> +++ b/fsck.c
> @@ -181,41 +181,6 @@ static int fsck_msg_type(enum fsck_msg_id msg_id,
>  	return msg_type;
>  }
>  
> -static void init_skiplist(struct fsck_options *options, const char *path)
> -{
> -	FILE *fp;
> -	struct strbuf sb = STRBUF_INIT;
> -	struct object_id oid;
> -
> -	fp = fopen(path, "r");
> -	if (!fp)
> -		die("Could not open skip list: %s", path);
> -	while (!strbuf_getline(&sb, fp)) {
> -		const char *p;
> -		const char *hash;
> -
> -		/*
> -		 * Allow trailing comments, leading whitespace
> -		 * (including before commits), and empty or whitespace
> -		 * only lines.
> -		 */
> -		hash = strchr(sb.buf, '#');
> -		if (hash)
> -			strbuf_setlen(&sb, hash - sb.buf);
> -		strbuf_trim(&sb);
> -		if (!sb.len)
> -			continue;
> -
> -		if (parse_oid_hex(sb.buf, &oid, &p) || *p != '\0')
> -			die("Invalid SHA-1: %s", sb.buf);
> -		oidset_insert(&options->skiplist, &oid);
> -	}
> -	if (ferror(fp))
> -		die_errno("Could not read '%s'", path);
> -	fclose(fp);
> -	strbuf_release(&sb);
> -}
> -
>  static int parse_msg_type(const char *str)
>  {
>  	if (!strcmp(str, "error"))
> @@ -284,7 +249,7 @@ void fsck_set_msg_types(struct fsck_options *options, const char *values)
>  		if (!strcmp(buf, "skiplist")) {
>  			if (equal == len)
>  				die("skiplist requires a path");
> -			init_skiplist(options, buf + equal + 1);
> +			oidset_parse_file(&options->skiplist, buf + equal + 1);
>  			buf += len + 1;
>  			continue;
>  		}
> diff --git a/oidset.c b/oidset.c
> index fe4eb921df81..a4f38a040320 100644
> --- a/oidset.c
> +++ b/oidset.c
> @@ -35,3 +35,38 @@ void oidset_clear(struct oidset *set)
>  	kh_release_oid(&set->set);
>  	oidset_init(set, 0);
>  }
> +
> +void oidset_parse_file(struct oidset *set, const char *path)
> +{
> +	FILE *fp;
> +	struct strbuf sb = STRBUF_INIT;
> +	struct object_id oid;
> +
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		die("Could not open skip list: %s", path);
> +	while (!strbuf_getline(&sb, fp)) {
> +		const char *p;
> +		const char *hash;
> +
> +		/*
> +		 * Allow trailing comments, leading whitespace
> +		 * (including before commits), and empty or whitespace
> +		 * only lines.
> +		 */
> +		hash = strchr(sb.buf, '#');
> +		if (hash)
> +			strbuf_setlen(&sb, hash - sb.buf);
> +		strbuf_trim(&sb);
> +		if (!sb.len)
> +			continue;
> +
> +		if (parse_oid_hex(sb.buf, &oid, &p) || *p != '\0')
> +			die("Invalid SHA-1: %s", sb.buf);
> +		oidset_insert(set, &oid);
> +	}
> +	if (ferror(fp))
> +		die_errno("Could not read '%s'", path);
> +	fclose(fp);
> +	strbuf_release(&sb);
> +}
> diff --git a/oidset.h b/oidset.h
> index c9d0f6d3cc8b..a3452eb7de84 100644
> --- a/oidset.h
> +++ b/oidset.h
> @@ -73,6 +73,13 @@ int oidset_remove(struct oidset *set, const struct object_id *oid);
>   */
>  void oidset_clear(struct oidset *set);
>  
> +/**
> + * Add the contents of the file 'path' to an initialized oidset.  Each line is
> + * an unabbreviated SHA-1.  Comments begin with '#', and trailing comments are
> + * allowed.  Leading whitespace and empty or white-space only lines are ignored.
> + */
> +void oidset_parse_file(struct oidset *set, const char *path);
> +
>  struct oidset_iter {
>  	kh_oid_t *set;
>  	khiter_t iter;
> -- 
> 2.20.1.321.g9e740568ce-goog
> 
>

Ævar Arnfjörð Bjarmason Jan. 18, 2019, 9:45 a.m. UTC | #2

On Thu, Jan 17 2019, Barret Rhoden wrote:

> -		die("Could not open skip list: %s", path);
> [...]
> +		die("Could not open skip list: %s", path);

You're just moving this around, but now that this has two uses let's say
"Could not open SHA-1 list; %s" or something like that.

> +			die("Invalid SHA-1: %s", sb.buf);

Unlike Johannes I think it's fine to leave this. This file-format is
SHA-1 only now. We can cross the bridge of making it (and others)
SHA-256 somehow when we come to that, whether that'll be allowing
variable width or a different file.

Junio C Hamano Jan. 18, 2019, 5:36 p.m. UTC | #3

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Thu, Jan 17 2019, Barret Rhoden wrote:
>
>> -		die("Could not open skip list: %s", path);
>> [...]
>> +		die("Could not open skip list: %s", path);
>
> You're just moving this around, but now that this has two uses let's say
> "Could not open SHA-1 list; %s" or something like that.
>
>> +			die("Invalid SHA-1: %s", sb.buf);
>
> Unlike Johannes I think it's fine to leave this. This file-format is
> SHA-1 only now. We can cross the bridge of making it (and others)
> SHA-256 somehow when we come to that, whether that'll be allowing
> variable width or a different file.

I tend to agree.  The Documentation/glossary-contents.txt makes it
clear that "object name" is the most formal term to use here, with
synonyms like "object identifier" and much less formal "hash".  For
now, "SHA-1" is good enough, even though "object name" is acceptable
if we really want to future-proof.  But I would suspect that people
would colloquially keep saying Shaah-one even when we start using
different hash function(s), so such a future-proofing may not be
worth it ;-)

Johannes Schindelin Jan. 18, 2019, 8:59 p.m. UTC | #4

Hi,

On Fri, 18 Jan 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
> > On Thu, Jan 17 2019, Barret Rhoden wrote:
> >
> >> -		die("Could not open skip list: %s", path);
> >> [...]
> >> +		die("Could not open skip list: %s", path);
> >
> > You're just moving this around, but now that this has two uses let's say
> > "Could not open SHA-1 list; %s" or something like that.
> >
> >> +			die("Invalid SHA-1: %s", sb.buf);
> >
> > Unlike Johannes I think it's fine to leave this. This file-format is
> > SHA-1 only now. We can cross the bridge of making it (and others)
> > SHA-256 somehow when we come to that, whether that'll be allowing
> > variable width or a different file.
> 
> I tend to agree.  The Documentation/glossary-contents.txt makes it
> clear that "object name" is the most formal term to use here, with
> synonyms like "object identifier" and much less formal "hash".  For
> now, "SHA-1" is good enough, even though "object name" is acceptable
> if we really want to future-proof.  But I would suspect that people
> would colloquially keep saying Shaah-one even when we start using
> different hash function(s), so such a future-proofing may not be
> worth it ;-)

By that reasoning all the preparatory work for switching to SHA-256 and
making the references in the Git code base less tied to SHA-1 would be
irrelevant now, "because we can cross that bridge when we reach it".

You are suggesting to incur technical debt here. Let's be smarter about
this. We do not *have* to incur said technical debt. Nothing (except
mental laziness) makes use do that.

Instead, we can make our load "when we reach that bridge" a lot lighter
by already doing the right thing.

BTW I totally disagree that the skip list is bound to be SHA-1. It is
bound to be a list of object names, that's what its purpose is, and just
because we happen to not yet support other hash algorithms but SHA-1 does
not mean that the skip list is fixed to SHA-1. It'll always be whatever
hash algorithm is used in the current repository.

So no, introducing mentions of "SHA-1" *now* is not a smart thing to do.

Ciao,
Johannes

Jeff King Jan. 18, 2019, 9:30 p.m. UTC | #5

On Fri, Jan 18, 2019 at 09:59:21PM +0100, Johannes Schindelin wrote:

> By that reasoning all the preparatory work for switching to SHA-256 and
> making the references in the Git code base less tied to SHA-1 would be
> irrelevant now, "because we can cross that bridge when we reach it".
> 
> You are suggesting to incur technical debt here. Let's be smarter about
> this. We do not *have* to incur said technical debt. Nothing (except
> mental laziness) makes use do that.
> 
> Instead, we can make our load "when we reach that bridge" a lot lighter
> by already doing the right thing.
> 
> BTW I totally disagree that the skip list is bound to be SHA-1. It is
> bound to be a list of object names, that's what its purpose is, and just
> because we happen to not yet support other hash algorithms but SHA-1 does
> not mean that the skip list is fixed to SHA-1. It'll always be whatever
> hash algorithm is used in the current repository.

Yeah, I agree with this. In particular, the code has already been
modified to use "struct object_id" and parse_oid_hex(). So it is not
even like somebody will have to come through later and fix the
implementation here, and while they're at it change the "SHA-1" in the
message. It has literally already been fixed, and is just waiting on
parse_oid_hex() to learn about the new hashes behind the scenes.

IMHO the conversion to object_id probably would have been the time to
fix that message so we would not even have to be revisiting the
discussion now. But that conversion was such a monumental pain it is
hard to fault the authors for not picking up every scrap at that moment. ;)

That is no excuse not to do it now, though.

-Peff

Ævar Arnfjörð Bjarmason Jan. 18, 2019, 10:26 p.m. UTC | #6

On Fri, Jan 18 2019, Jeff King wrote:

> On Fri, Jan 18, 2019 at 09:59:21PM +0100, Johannes Schindelin wrote:
>
>> By that reasoning all the preparatory work for switching to SHA-256 and
>> making the references in the Git code base less tied to SHA-1 would be
>> irrelevant now, "because we can cross that bridge when we reach it".
>>
>> You are suggesting to incur technical debt here. Let's be smarter about
>> this. We do not *have* to incur said technical debt. Nothing (except
>> mental laziness) makes use do that.
>>
>> Instead, we can make our load "when we reach that bridge" a lot lighter
>> by already doing the right thing.
>>
>> BTW I totally disagree that the skip list is bound to be SHA-1. It is
>> bound to be a list of object names, that's what its purpose is, and just
>> because we happen to not yet support other hash algorithms but SHA-1 does
>> not mean that the skip list is fixed to SHA-1. It'll always be whatever
>> hash algorithm is used in the current repository.
>
> Yeah, I agree with this. In particular, the code has already been
> modified to use "struct object_id" and parse_oid_hex(). So it is not
> even like somebody will have to come through later and fix the
> implementation here, and while they're at it change the "SHA-1" in the
> message. It has literally already been fixed, and is just waiting on
> parse_oid_hex() to learn about the new hashes behind the scenes.
>
> IMHO the conversion to object_id probably would have been the time to
> fix that message so we would not even have to be revisiting the
> discussion now. But that conversion was such a monumental pain it is
> hard to fault the authors for not picking up every scrap at that moment. ;)
>
> That is no excuse not to do it now, though.

I stand corrected, I thought these still needed to be updated to parse
anything that wasn't 40 chars, since I hadn't seen anything about these
formats in the hash transition document.

So fair enough, let's change that while we're at it, but this seems like
something that needs to be planned for in more detail / documented in
the hash transition doc.

I.e. many (e.g. me) maintain some system-wide skiplist for strict fsck
cloning of legacy repos. So I can see there being some need for a
SHA1<->SHA256 map in this case, but since these files might stretch
across repo boundaries and not be checked into the repo itself this is a
new use-case that needs thinking about.

But now that I think about it this sort of thing would be a good
use-case for just fixing these various historical fsck issues while
we're at it when possible, e.g. "missing space before email" (probably
not all could be unambiguously fixed). So instead of sha256<->sha1
fn(sha256)<->fn(sha1)[1]?

1. https://public-inbox.org/git/87ftyyedqd.fsf@evledraar.gmail.com/

Jeff King Jan. 22, 2019, 7:12 a.m. UTC | #7

On Fri, Jan 18, 2019 at 11:26:29PM +0100, Ævar Arnfjörð Bjarmason wrote:

> I stand corrected, I thought these still needed to be updated to parse
> anything that wasn't 40 chars, since I hadn't seen anything about these
> formats in the hash transition document.
> 
> So fair enough, let's change that while we're at it, but this seems like
> something that needs to be planned for in more detail / documented in
> the hash transition doc.
> 
> I.e. many (e.g. me) maintain some system-wide skiplist for strict fsck
> cloning of legacy repos. So I can see there being some need for a
> SHA1<->SHA256 map in this case, but since these files might stretch
> across repo boundaries and not be checked into the repo itself this is a
> new use-case that needs thinking about.

My assumption had been that changing your local repository would be a
(local) flag day, and you'd update any ancillary files like skiplists,
mailmap.blob, etc at the same time. I'm not opposed to making those
features more clever, though.

> But now that I think about it this sort of thing would be a good
> use-case for just fixing these various historical fsck issues while
> we're at it when possible, e.g. "missing space before email" (probably
> not all could be unambiguously fixed). So instead of sha256<->sha1
> fn(sha256)<->fn(sha1)[1]?

That is a very tempting thing to do, but I think it comes with its own
complications. We do not want to do fn(sha1), I don't think; the reason
we care about sha1 at all is that those hashes are already set in stone.

There could be a "clean up the data as we convert to sha256" operation,
but:

  - it needs to be set in stone from day 1, I'd think. The last thing we
    want is to modify it after conversions are in the wild

  - I think we need to be bi-directional. So it must be a mapping that
    can be undone to retrieve the original bytes, so we can compute
    their "real" sha1.

At which point, I think it might be simpler to just make git more
permissive with respect to those minor data errors (and in fact, we are
already pretty permissive for the most part in non-fsck operations).

-Peff

Ævar Arnfjörð Bjarmason Jan. 22, 2019, 9:46 a.m. UTC | #8

On Tue, Jan 22 2019, Jeff King wrote:

> On Fri, Jan 18, 2019 at 11:26:29PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> I stand corrected, I thought these still needed to be updated to parse
>> anything that wasn't 40 chars, since I hadn't seen anything about these
>> formats in the hash transition document.
>>
>> So fair enough, let's change that while we're at it, but this seems like
>> something that needs to be planned for in more detail / documented in
>> the hash transition doc.
>>
>> I.e. many (e.g. me) maintain some system-wide skiplist for strict fsck
>> cloning of legacy repos. So I can see there being some need for a
>> SHA1<->SHA256 map in this case, but since these files might stretch
>> across repo boundaries and not be checked into the repo itself this is a
>> new use-case that needs thinking about.
>
> My assumption had been that changing your local repository would be a
> (local) flag day, and you'd update any ancillary files like skiplists,
> mailmap.blob, etc at the same time. I'm not opposed to making those
> features more clever, though.
>
>> But now that I think about it this sort of thing would be a good
>> use-case for just fixing these various historical fsck issues while
>> we're at it when possible, e.g. "missing space before email" (probably
>> not all could be unambiguously fixed). So instead of sha256<->sha1
>> fn(sha256)<->fn(sha1)[1]?
>
> That is a very tempting thing to do, but I think it comes with its own
> complications. We do not want to do fn(sha1), I don't think; the reason
> we care about sha1 at all is that those hashes are already set in stone.
>
> There could be a "clean up the data as we convert to sha256" operation,
> but:
>
>   - it needs to be set in stone from day 1, I'd think. The last thing we
>     want is to modify it after conversions are in the wild
>
>   - I think we need to be bi-directional. So it must be a mapping that
>     can be undone to retrieve the original bytes, so we can compute
>     their "real" sha1.

It needing to be bidirectional is a very good point, and I think that
makes my suggestion a non-starter. Thanks.

> At which point, I think it might be simpler to just make git more
> permissive with respect to those minor data errors (and in fact, we are
> already pretty permissive for the most part in non-fsck operations).

Yeah it's probably better to make some of these "errors" softer
warnings.

The X-Y issue I have is that I turned on transfer.fsckObjects, so then I
can't clone repos with various minor historical issues in commit headers
etc., so I maintain a big skip list. But what I was actually after was
fsck checks like the .gitmodules security check.

Of course I could chase them all down and turn them into
warn/error/ignore individually, but it would be better if we e.g. had
some way to say "serious things error, minor things warn", maybe with
the option of only having the looser version on fetch but not recieve
with the principle that we should be loose in what we accept from
existing data but strict with new data #leftoverbits

Junio C Hamano Jan. 22, 2019, 5:54 p.m. UTC | #9

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> It needing to be bidirectional is a very good point, and I think that
> makes my suggestion a non-starter. Thanks.

Yes, it is a bit sad that we need to carry the mistakes forward
while moving to the new hash, but bidi convertibility is a must
for the transition to work smoothly, I think.

Thanks for a good discussion.  FWIW, on the original issue that
brought it up, I think using "object name" from the glossary to
move away from saying "SHA-1" would be good.

Jeff King Jan. 22, 2019, 6:28 p.m. UTC | #10

On Tue, Jan 22, 2019 at 10:46:56AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > At which point, I think it might be simpler to just make git more
> > permissive with respect to those minor data errors (and in fact, we are
> > already pretty permissive for the most part in non-fsck operations).
> 
> Yeah it's probably better to make some of these "errors" softer
> warnings.
> 
> The X-Y issue I have is that I turned on transfer.fsckObjects, so then I
> can't clone repos with various minor historical issues in commit headers
> etc., so I maintain a big skip list. But what I was actually after was
> fsck checks like the .gitmodules security check.
>
> Of course I could chase them all down and turn them into
> warn/error/ignore individually, but it would be better if we e.g. had
> some way to say "serious things error, minor things warn", maybe with
> the option of only having the looser version on fetch but not recieve
> with the principle that we should be loose in what we accept from
> existing data but strict with new data #leftoverbits

Yeah, I think the current state here is rather unfortunate. The worst
part is that many of the things _are_ marked as warnings, but we reject
transfers even for warnings. So now we have "info" as well, which is
really just silly.

I think the big blocker to simply loosening "warning" is that the
current severities are pretty arbitrary. MISSING_NAME_BEFORE_EMAIL
probably ought to be warning, but it's an warning. Whereas HAS_DOTGIT is
a warning, but has pretty serious security implications.

So that does not save you from chasing them all down, but if you do, at
least the work could benefit everybody. ;)

-Peff

[v2,1/3] Move init_skiplist() outside of fsck

Commit Message

Comments

Patch