diff mbox series

Re* [PATCH 8/9] fast-export: respect the possibly-overridden default branch name

Message ID xmqqpna5bq2l.fsf_-_@gitster.c.googlers.com (mailing list archive)
State New, archived
Headers show
Series Re* [PATCH 8/9] fast-export: respect the possibly-overridden default branch name | expand

Commit Message

Junio C Hamano June 11, 2020, 3:05 p.m. UTC
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> It is a good argument.  I also heard a rumor that often branch names
>> contain codewords given to pre-released hardware that are highly
>> confidential in certain circles, and heard that it is one of the
>> reasons why Gerrit has server side ACL that lets you hide some
>> branches from authenticated users that can access other branches.
>
> Yes, branch names in general _can_ contain information users may prefer to
> keep private.
>
> However, we're not talking about branch names in general. We are talking
> about the default name of the main branch, to be picked in _all_ of your
> new repositories.

No, we are talking about the name of the branch, chosen to be the
primary one, in one particular repository whose contents are
exported via fast-export with explicit request from the user to
anonymize end-user data.

> Yes. And you're unlikely to configure the default name to be used for all
> of your future `git init` operations to be something non-generic.
>
> Now, if you suggest that `git fast-export --anonymize` should either not
> special-case the main branch, or at least have a configurable set of names
> it skips from protecting, then I will be much more in favor of those
> suggestions. However, those suggestions are quite a bit orthogonal to the
> patch series at hand, so I would want to discuss them in their own code
> contribution instead of here.


I think after writing the message about your "two variable"
approach, you would retract the "something non-generic" part in the
above sentence.  The original "we redact branch names but 'master'
is used by and known by everybody so there is no need to redact"
would have been a good argument.  Perhaps there is a value to keep
the primary branch identifiable even in an export stream that has
all the refnames and payload anonymized, and leaving 'master' intact
would have been a viable approach for solving that issue.

That trick NO LONGER applies once you allow the name of the primary
branch customizable, and the end user has used a name that is not to
be exposed.  Yes, "we want to ensure that readers of the export
stream can identify which ref is the primary branch of the
repository" is orthogonal from "how do we make primary branch
configurable in a live repository?" and "how do we make the default
name used for the primary branch in repositories newly created?".
But because the old solution would not work in the new world order
this topic created, a new solution needs to be found when you move
the world to the new order.

An easy solution would be to reserve "ref0" for the primary branch
in the repository and anonymize other refs "ref1", "ref2", ...

That can be done as a preparatory step regardless of the "'master'
may not be in the name of the primary branch in this repository"
topic.

-- >8 --
Subject: [PATCH] fast-export: do anonymize the primary branch name

In a fast-export stream with --anonymize option, all the end-user
data including refnames are munged to prevent exposure, but the
'master' branch is left intact.

There is a comment that explains why it is OK to leave 'master'
unanonymized (because everybody calls the primary branch 'master'
and it is no secret), but that does not justify why it is bad to
anonymize 'master' and make it undistinguishable from other
branches.  Assuming there _is_ a need to allow the readers of the
output to tell where the tip of the primary branch is, let's keep
the special casing of 'master', but still anonymize it to "ref0".
Because all other branches will be given ref+N where N is a positive
integer, this will keep the primary branch identifiable in the
output stream, without exposing what the name of the primary branch
is in the repository the export stream was taken from.

This is in preparation for introducing a mechanism to affect the
name of the primary branch used in the repository.  Once the
mechanism is in use, the name of the primary branch won't be
'master', and may not be allowed to be exposed.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/fast-export.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Junio C Hamano June 11, 2020, 4:44 p.m. UTC | #1
Junio C Hamano <gitster@pobox.com> writes:

> But because the old solution would not work in the new world order
> this topic created, a new solution needs to be found when you move
> the world to the new order.
>
> An easy solution would be to reserve "ref0" for the primary branch
> in the repository and anonymize other refs "ref1", "ref2", ...
>
> That can be done as a preparatory step regardless of the "'master'
> may not be in the name of the primary branch in this repository"
> topic.
> ...
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 85868162ee..a306a60d25 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -497,7 +497,7 @@ static void *anonymize_ref_component(const void *old, size_t *len)
>  {
>  	static int counter;
>  	struct strbuf out = STRBUF_INIT;
> -	strbuf_addf(&out, "ref%d", counter++);
> +	strbuf_addf(&out, "ref%d", ++counter);
>  	return strbuf_detach(&out, len);
>  }
>  
> @@ -522,7 +522,7 @@ static const char *anonymize_refname(const char *refname)
>  	 * anything interesting.
>  	 */
>  	if (!strcmp(refname, "refs/heads/master"))
> -		return refname;
> +		return "ref0";

This is obviously wrong.  It should return "refs/heads/ref0".

But another thing we could do, which is probably more backward
compatible, is to return "refs/heads/master" from here.  That way,
consumers of "fast-export" stream that expect 'master' to be the
primary branch would not get upset when the data source runs a newer
version of Git that allows the primary branch name to be customized.

Which means that, before such a change to allow the primary branch
name to be customized happens, there is no need for such a
preparatory patch, because the status quo is just fine.  So, I'm OK
with retracting the above.  "ref0" is not special, so there is no
need to have the first hunk above, either.

However, when the customization being discussed is implemented via
the "get_default_branch_name()" and "get_primary_branch_name()"
functions, we should update these lines like so:

-	if (!strcmp(refname, "refs/heads/master"))
-		return "refs/heads/master";
+	if (!strcmp(refname, get_primary_branch_name(DO_NOT_ABBREV)))
+		return get_default_branch_name(DO_NOT_ABBREV);

That is, the name of the "primary" branch used at the data source is
replaced by the more generic "this is the default" branch name used
in a random "git init" repository.

Imagine that there is a project that has an integration branch per
each device type, named after the confidential device name, owned by
a company.  An employee of the company works on one device type in
his own clone of the repository, and the primary branch in the
repository is set to that confidential device's name.

The employee can create an anonymized output, replacing the
confidential name with a generic "default" name that is not
confidential, like 'main' in the new world order or 'master' in the
backward compatible world order, with such an updated code.

After having thought about it a bit longer, I actually do prefer to
use the "ref0" approach, as it is possible for the employee in the
above example to have the "default" branch name tied to the primary
branch for the hardware type the emploee works on in the ~/.gitconfig
so the "alternative" I suggested in this message will reveal the
confidential name.

So, I guess we should just fix the patch I am responding to to
return "refs/heads/ref0" instead of "ref0", and queue it as one of
the preparatory steps.
Junio C Hamano June 11, 2020, 6:18 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

>> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
>> index 85868162ee..a306a60d25 100644
>> --- a/builtin/fast-export.c
>> +++ b/builtin/fast-export.c
>> @@ -497,7 +497,7 @@ static void *anonymize_ref_component(const void *old, size_t *len)
>>  {
>>  	static int counter;
>>  	struct strbuf out = STRBUF_INIT;
>> -	strbuf_addf(&out, "ref%d", counter++);
>> +	strbuf_addf(&out, "ref%d", ++counter);
>>  	return strbuf_detach(&out, len);
>>  }
>>  
>> @@ -522,7 +522,7 @@ static const char *anonymize_refname(const char *refname)
>>  	 * anything interesting.
>>  	 */
>>  	if (!strcmp(refname, "refs/heads/master"))
>> -		return refname;
>> +		return "ref0";
>
> This is obviously wrong.  It should return "refs/heads/ref0".
> ...
> So, I guess we should just fix the patch I am responding to to
> return "refs/heads/ref0" instead of "ref0", and queue it as one of
> the preparatory steps.

... and the follow-up step to become part of the series you are
working on to allow customing what the primary branch is called
would turn the second hunk to

	-	if (!strcmp(refname, "refs/heads/master"))
	+	if (!strcmp(refname, get_primary_branch_name(0)))
			return "refs/heads/ref0";

By the way, with your "two variables" approach to make both the
"default" (for 'init' and 'clone') and the "primary" (for
'fmt-merge-msg' and 'fast-export') configurable, we'd need accessor
function(s) for the primary branch name for the given repository.
The get_primary_branch_name() helper function might want to take a
"struct repository *" argument in addition to "please give me an
abbreviated refname" boolean, given the recent push to pass the
struct to everybody.
Johannes Schindelin June 12, 2020, 12:03 p.m. UTC | #3
Hi Junio,

On Thu, 11 Jun 2020, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> It is a good argument.  I also heard a rumor that often branch names
> >> contain codewords given to pre-released hardware that are highly
> >> confidential in certain circles, and heard that it is one of the
> >> reasons why Gerrit has server side ACL that lets you hide some
> >> branches from authenticated users that can access other branches.
> >
> > Yes, branch names in general _can_ contain information users may prefer to
> > keep private.
> >
> > However, we're not talking about branch names in general. We are talking
> > about the default name of the main branch, to be picked in _all_ of your
> > new repositories.
>
> No, we are talking about the name of the branch, chosen to be the
> primary one, in one particular repository whose contents are
> exported via fast-export with explicit request from the user to
> anonymize end-user data.
>
> > Yes. And you're unlikely to configure the default name to be used for all
> > of your future `git init` operations to be something non-generic.
> >
> > Now, if you suggest that `git fast-export --anonymize` should either not
> > special-case the main branch, or at least have a configurable set of names
> > it skips from protecting, then I will be much more in favor of those
> > suggestions. However, those suggestions are quite a bit orthogonal to the
> > patch series at hand, so I would want to discuss them in their own code
> > contribution instead of here.
>
>
> I think after writing the message about your "two variable"
> approach, you would retract the "something non-generic" part in the
> above sentence.

You are absolutely correct!

> [...]
> -- >8 --
> Subject: [PATCH] fast-export: do anonymize the primary branch name

I like this approach a lot. Do you want me to integrate it into this patch
series, or rather have it as a stand-alone patch?

Ciao,
Dscho
Johannes Schindelin June 12, 2020, 12:07 p.m. UTC | #4
Hi Junio,

On Thu, 11 Jun 2020, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
> >> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> >> index 85868162ee..a306a60d25 100644
> >> --- a/builtin/fast-export.c
> >> +++ b/builtin/fast-export.c
> >> @@ -497,7 +497,7 @@ static void *anonymize_ref_component(const void *old, size_t *len)
> >>  {
> >>  	static int counter;
> >>  	struct strbuf out = STRBUF_INIT;
> >> -	strbuf_addf(&out, "ref%d", counter++);
> >> +	strbuf_addf(&out, "ref%d", ++counter);
> >>  	return strbuf_detach(&out, len);
> >>  }
> >>
> >> @@ -522,7 +522,7 @@ static const char *anonymize_refname(const char *refname)
> >>  	 * anything interesting.
> >>  	 */
> >>  	if (!strcmp(refname, "refs/heads/master"))
> >> -		return refname;
> >> +		return "ref0";
> >
> > This is obviously wrong.  It should return "refs/heads/ref0".
> > ...
> > So, I guess we should just fix the patch I am responding to to
> > return "refs/heads/ref0" instead of "ref0", and queue it as one of
> > the preparatory steps.
>
> ... and the follow-up step to become part of the series you are
> working on to allow customing what the primary branch is called
> would turn the second hunk to
>
> 	-	if (!strcmp(refname, "refs/heads/master"))
> 	+	if (!strcmp(refname, get_primary_branch_name(0)))
> 			return "refs/heads/ref0";

Right.

> By the way, with your "two variables" approach to make both the
> "default" (for 'init' and 'clone') and the "primary" (for
> 'fmt-merge-msg' and 'fast-export') configurable, we'd need accessor
> function(s) for the primary branch name for the given repository.
> The get_primary_branch_name() helper function might want to take a
> "struct repository *" argument in addition to "please give me an
> abbreviated refname" boolean, given the recent push to pass the
> struct to everybody.

My current state defines a `repo_main_branch_name(flags)` function where
the flags can be `MAIN_BRANCH_SHORT_NAME` and `MAIN_BRANCH_FOR_INIT`.

So I think we're on the same page.

BTW I heard from a couple sides that "primary" would imply that there is
also a "secondary" branch, and potentially an ordering of all branches,
which is why I did not really consider "primary" as candidate. Besides, it
is so much more awkward to type than "main" (especially when you're as
tired as I am right now...).

That's why I try to stay with "main" for the moment.

Ciao,
Dscho
Junio C Hamano June 12, 2020, 12:32 p.m. UTC | #5
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> BTW I heard from a couple sides that "primary" would imply that there is
> also a "secondary" branch, and potentially an ordering of all branches,
> which is why I did not really consider "primary" as candidate. Besides, it
> is so much more awkward to type than "main" (especially when you're as
> tired as I am right now...).
>
> That's why I try to stay with "main" for the moment.

You are reading too much into it.  I used "primary" because I needed
a word that clearly conveys the concept of being a special among
others, a word that is about the concept which is different from the
concept of "default", and that is clearly different from any of the
candidates floated as concrete words to replace 'master'.  Because I
wanted to say things like:

    In the context of fast-export, you'd want to special case the
    "primary" branch name, because unlike the "default" branch name
    (which we envision to be 'main' for most people in the new world
    order, so it may be less worth anonymizing because of the same
    reason the current code keeps 'master' as-is), its name can be
    sensitive.

without becoming unnecessarily ambiguous (replace 'primary' with
'main' in the above).

It was not because I do not want us to pick 'main' as the
replacement word for 'master' as the default branch name.

And a set of things, among which there is a concept of "one thing
that is special and different from all the others", does not
necessarily have to be a totally ordered set.  When cloning, we say
that the (often bare) repository at the other end indicates which
branch is the primary branch of the project by pointing at it with
its HEAD (which is the reason why we try to fork it to our local
branch namespace and check it out after cloning).  There is no
"secondary" in such a use case, and there is no need to have one.
Junio C Hamano June 12, 2020, 12:50 p.m. UTC | #6
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> Subject: [PATCH] fast-export: do anonymize the primary branch name
>
> I like this approach a lot. Do you want me to integrate it into this patch
> series, or rather have it as a stand-alone patch?

I do not see any need for an off-series preparation step.  Without
configurable primary branch, keeping 'master' as 'master' used in
the current code is an OK way to anonymize refs and still keep the
primary line of development in the output identifyable.  The "ref0"
trick becomes necessary only when we introduce configurable primary;
even though it would not hurt to switch to the "ref0" approach to
anonymize-but-still-the-primary-is-identifiable early, it just is
not necessary to do so.

I'd say that we should take <xmqqpna5bq2l.fsf_-_@gitster.c.googlers.com>
with two changes,

 - see if the "refname" matches the fully
   spelled primary branch name by asking repo_main_branch_name(repo,
   0), instead of comparing with "refs/heads/master", and

 - if matches return "refs/heads/ref0" instead of "ref0".

and replace your [PATCH 8/9] (or whichever the counterpart patch is
in the rerolled series) with it.

Thanks.
Johannes Schindelin June 12, 2020, 12:53 p.m. UTC | #7
Hi Junio,

On Thu, 11 Jun 2020, Junio C Hamano wrote:

> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 85868162ee..a306a60d25 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -522,7 +522,7 @@ static const char *anonymize_refname(const char *refname)
>  	 * anything interesting.
>  	 */
>  	if (!strcmp(refname, "refs/heads/master"))
> -		return refname;
> +		return "ref0";

I just realized that the comment above reads:

        /*
         * We also leave "master" as a special case, since it does not reveal
         * anything interesting.
         */


Obviously, we need to change that comment here because we do not leave the
name unchanged. How about this?

        /*
         * We special-case the main branch, anonymizing it to `ref0`.
         */

Ciao,
Dscho

>
>  	strbuf_reset(&anon);
>  	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
>
Johannes Schindelin June 12, 2020, 1:18 p.m. UTC | #8
Hi Junio,

On Fri, 12 Jun 2020, Johannes Schindelin wrote:

> On Thu, 11 Jun 2020, Junio C Hamano wrote:
>
> > diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> > index 85868162ee..a306a60d25 100644
> > --- a/builtin/fast-export.c
> > +++ b/builtin/fast-export.c
> > @@ -522,7 +522,7 @@ static const char *anonymize_refname(const char *refname)
> >  	 * anything interesting.
> >  	 */
> >  	if (!strcmp(refname, "refs/heads/master"))
> > -		return refname;
> > +		return "ref0";
>
> I just realized that the comment above reads:
>
>         /*
>          * We also leave "master" as a special case, since it does not reveal
>          * anything interesting.
>          */
>
>
> Obviously, we need to change that comment here because we do not leave the
> name unchanged. How about this?
>
>         /*
>          * We special-case the main branch, anonymizing it to `ref0`.
>          */

Also, t9351 obviously needs to be adjusted. This one works for me:

-- snipsnap --
From: Junio C Hamano <gitster@pobox.com>
Date: Thu, 11 Jun 2020 08:05:38 -0700
Subject: [PATCH] fast-export: do anonymize the primary branch name

In a fast-export stream with --anonymize option, all the end-user
data including refnames are munged to prevent exposure, but the
'master' branch is left intact.

There is a comment that explains why it is OK to leave 'master'
unanonymized (because everybody calls the primary branch 'master'
and it is no secret), but that does not justify why it is bad to
anonymize 'master' and make it undistinguishable from other
branches.  Assuming there _is_ a need to allow the readers of the
output to tell where the tip of the primary branch is, let's keep
the special casing of 'master', but still anonymize it to "ref0".
Because all other branches will be given ref+N where N is a positive
integer, this will keep the primary branch identifiable in the
output stream, without exposing what the name of the primary branch
is in the repository the export stream was taken from.

This is in preparation for introducing a mechanism to affect the
name of the primary branch used in the repository.  Once the
mechanism is in use, the name of the primary branch won't be
'master', and may not be allowed to be exposed.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 builtin/fast-export.c            | 7 +++----
 t/t9351-fast-export-anonymize.sh | 9 +++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 85868162eec..f10e3b35e5b 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -497,7 +497,7 @@ static void *anonymize_ref_component(const void *old, size_t *len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
-	strbuf_addf(&out, "ref%d", counter++);
+	strbuf_addf(&out, "ref%d", ++counter);
 	return strbuf_detach(&out, len);
 }

@@ -518,11 +518,10 @@ static const char *anonymize_refname(const char *refname)
 	int i;

 	/*
-	 * We also leave "master" as a special case, since it does not reveal
-	 * anything interesting.
+	 * We special-case the main branch, anonymizing it to `ref0`.
 	 */
 	if (!strcmp(refname, "refs/heads/master"))
-		return refname;
+		return "refs/heads/ref0";

 	strbuf_reset(&anon);
 	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 897dc509075..2415f0ec213 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -26,8 +26,9 @@ test_expect_success 'stream omits path names' '
 	! grep xyzzy stream
 '

-test_expect_success 'stream allows master as refname' '
-	grep master stream
+test_expect_success 'stream translates master to ref0' '
+	grep refs/heads/ref0 stream &&
+	! grep master stream
 '

 test_expect_success 'stream omits other refnames' '
@@ -57,7 +58,7 @@ test_expect_success 'import stream to new repository' '
 test_expect_success 'result has two branches' '
 	git for-each-ref --format="%(refname)" refs/heads >branches &&
 	test_line_count = 2 branches &&
-	other_branch=$(grep -v refs/heads/master branches)
+	other_branch=$(grep -v refs/heads/ref0 branches)
 '

 test_expect_success 'repo has original shape and timestamps' '
@@ -65,7 +66,7 @@ test_expect_success 'repo has original shape and timestamps' '
 		git log --format="%m %ct" --left-right --boundary "$@"
 	} &&
 	(cd .. && shape master...other) >expect &&
-	shape master...$other_branch >actual &&
+	shape ref0...$other_branch >actual &&
 	test_cmp expect actual
 '

--
2.26.0.windows.1
Junio C Hamano June 12, 2020, 3:14 p.m. UTC | #9
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> I just realized that the comment above reads:
>
>         /*
>          * We also leave "master" as a special case, since it does not reveal
>          * anything interesting.
>          */
>
>
> Obviously, we need to change that comment here because we do not leave the
> name unchanged. How about this?
>
>         /*
>          * We special-case the main branch, anonymizing it to `ref0`.
>          */

If you are going to update it, why not make it useful?

I complained number of times during the discussion that the original
comment explains why leaving 'master' as-is does not reveal anything
useful to adversaries but does not justify what the code attempts to
achieve by special casing 'master' in the first place.  

It is not an improvement to literally adjust that inadequate comment
to the new world order to just parrot what the code already says
without explaining why it does so.

	/*
	 * Anonymize the name used for the primary branch in this
	 * repository, but reserve `ref0` for it, so that it can
	 * be identified among other refs in the output.
	 */

is the minimum I would expect before calling it an improvement.  We
could add

	It is often `main` for new repositories (and `master` for
	aged ones) and such well-known names may not need
	anonymizing, but it could be configured to use a secret word
	that the user may not want to reveal.

at the end to explain the motivation behind anonymizing even more,
if we wanted to.

Now, "so that ..." part is totally a fabrication based on my best
guess.  I do not know what the original author was thinking when the
decision to leave the master as-is was made.

Thanks.
Junio C Hamano June 12, 2020, 3:19 p.m. UTC | #10
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Also, t9351 obviously needs to be adjusted. This one works for me:
>
> -- snipsnap --
> From: Junio C Hamano <gitster@pobox.com>
> Date: Thu, 11 Jun 2020 08:05:38 -0700
> Subject: [PATCH] fast-export: do anonymize the primary branch name
>
> In a fast-export stream with --anonymize option, all the end-user
> data including refnames are munged to prevent exposure, but the
> 'master' branch is left intact.
> ...
> This is in preparation for introducing a mechanism to affect the
> name of the primary branch used in the repository.  Once the
> mechanism is in use, the name of the primary branch won't be
> 'master', and may not be allowed to be exposed.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  builtin/fast-export.c            | 7 +++----
>  t/t9351-fast-export-anonymize.sh | 9 +++++----
>  2 files changed, 8 insertions(+), 8 deletions(-)
> ...
>  	if (!strcmp(refname, "refs/heads/master"))
> -		return refname;
> +		return "refs/heads/ref0";

As I said already, I personally do not think that this needs to be a
preparatory patch to anonymize 'master' that cannot be configured to
something else into 'ref0'.  This will become necessary when we make
the primary branch configurable, so I think it is easier to replace
the counterpart to your [PATCH 8/9] in the original series with it
in the v2 series.

Regarding the update to the comment before this "special case", I
would suggest to explain "why" not just "what".

Thanks.
Junio C Hamano June 12, 2020, 3:22 p.m. UTC | #11
Junio C Hamano <gitster@pobox.com> writes:

> As I said already, I personally do not think that this needs to be a
> preparatory patch to anonymize 'master' that cannot be configured to
> something else into 'ref0'.  This will become necessary when we make
> the primary branch configurable, so I think it is easier to replace
> the counterpart to your [PATCH 8/9] in the original series with it
> in the v2 series.

Ah, I forgot to say, if you think it is easier to manage the main
set of patches for the topic to eject as much preparatory changes as
possible, I do not at all mind treating this as one of the
preparatory step and queue it separately, making the main series
depend on it.  I just wanted to say that it is not necessary, even
though it does not hurt.



> Regarding the update to the comment before this "special case", I
> would suggest to explain "why" not just "what".
>
> Thanks.
Johannes Schindelin June 13, 2020, 5 a.m. UTC | #12
Hi Junio,

On Fri, 12 Jun 2020, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
> > As I said already, I personally do not think that this needs to be a
> > preparatory patch to anonymize 'master' that cannot be configured to
> > something else into 'ref0'.  This will become necessary when we make
> > the primary branch configurable, so I think it is easier to replace
> > the counterpart to your [PATCH 8/9] in the original series with it
> > in the v2 series.
>
> Ah, I forgot to say, if you think it is easier to manage the main
> set of patches for the topic to eject as much preparatory changes as
> possible, I do not at all mind treating this as one of the
> preparatory step and queue it separately, making the main series
> depend on it.  I just wanted to say that it is not necessary, even
> though it does not hurt.

Due to the need to adjust t9351, I think it is clearer if it is a two-part
change: one to introduce the "main branch -> ref0" change, and another one
to respect `core.mainBranch`. Those are separate concerns in my mind.

I moved the first one to the beginning of the patch series so that you're
still at liberty to take it early vs keeping it within the topic branch.

Thanks,
Dscho
Johannes Sixt June 13, 2020, 11:49 a.m. UTC | #13
Am 12.06.20 um 17:14 schrieb Junio C Hamano:
> 	/*
> 	 * Anonymize the name used for the primary branch in this
> 	 * repository, but reserve `ref0` for it, so that it can
> 	 * be identified among other refs in the output.
> 	 */
> 
> is the minimum I would expect before calling it an improvement.  We
> could add
> 
> 	It is often `main` for new repositories (and `master` for
> 	aged ones) and such well-known names may not need
> 	anonymizing, but it could be configured to use a secret word
> 	that the user may not want to reveal.
> 
> at the end to explain the motivation behind anonymizing even more,
> if we wanted to.

IMO, making the primary branch identifiable is a reasonable
justification to treat it specially. But then, why does it have to be
renamed to 'ref0'? Couldn't it just be renamed to Git's default primary
branch name, be it 'master' today or 'main' or whatever in the future?
After the anonymization, nobody will know whether that was the real name
of the primary branch or not. Leaving it at 'master'/'main' reduces the
mental burden of the recipient of the anonymous repo.

-- Hannes
Johannes Schindelin June 13, 2020, 2:44 p.m. UTC | #14
Hi Junio,

On Fri, 12 Jun 2020, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > I just realized that the comment above reads:
> >
> >         /*
> >          * We also leave "master" as a special case, since it does not reveal
> >          * anything interesting.
> >          */
> >
> >
> > Obviously, we need to change that comment here because we do not leave the
> > name unchanged. How about this?
> >
> >         /*
> >          * We special-case the main branch, anonymizing it to `ref0`.
> >          */
>
> If you are going to update it, why not make it useful?
>
> I complained number of times during the discussion that the original
> comment explains why leaving 'master' as-is does not reveal anything
> useful to adversaries but does not justify what the code attempts to
> achieve by special casing 'master' in the first place.

True. Sorry about forgetting that when adjusting the code comment.

In my defense, I am/was much more worried about transmogrifying the patch
series to reflect the separation between `init.defaultBranch` and
`core.mainBranch` and the associated fall-out (I highly doubt that the
range-diff between v1 and v2 will be useful...).

> It is not an improvement to literally adjust that inadequate comment
> to the new world order to just parrot what the code already says
> without explaining why it does so.
>
> 	/*
> 	 * Anonymize the name used for the primary branch in this
> 	 * repository, but reserve `ref0` for it, so that it can
> 	 * be identified among other refs in the output.
> 	 */

That is indeed an improvement, thank you so much.

> is the minimum I would expect before calling it an improvement.  We
> could add
>
> 	It is often `main` for new repositories (and `master` for
> 	aged ones) and such well-known names may not need
> 	anonymizing, but it could be configured to use a secret word
> 	that the user may not want to reveal.
>
> at the end to explain the motivation behind anonymizing even more,
> if we wanted to.

Maybe we add that to the comment in the patch that teaches `fast-export`
about `core.mainBranch`? Yeah, I think I like that idea best.

> Now, "so that ..." part is totally a fabrication based on my best
> guess.  I do not know what the original author was thinking when the
> decision to leave the master as-is was made.

This comment comes from a8722750985 (teach fast-export an --anonymize
option, 2014-08-27), and I agree that there is no explicit explanation why
the main branch is special-cased.

However, I think that your guess is a good one: it might be an interesting
aspect to identify the commits from the main branch, without necessarily
needing to know the actual name of said branch, e.g. to reproduce a
reported issue.

Ciao,
Dscho
Johannes Schindelin June 13, 2020, 2:47 p.m. UTC | #15
Hi Junio & Hannes,

On Sat, 13 Jun 2020, Junio C Hamano wrote:

> Johannes Sixt <j6t@kdbg.org> writes:
>
> > Am 12.06.20 um 17:14 schrieb Junio C Hamano:
> >> 	/*
> >> 	 * Anonymize the name used for the primary branch in this
> >> 	 * repository, but reserve `ref0` for it, so that it can
> >> 	 * be identified among other refs in the output.
> >> 	 */
> >>
> >> is the minimum I would expect before calling it an improvement.  We
> >> could add
> >>
> >> 	It is often `main` for new repositories (and `master` for
> >> 	aged ones) and such well-known names may not need
> >> 	anonymizing, but it could be configured to use a secret word
> >> 	that the user may not want to reveal.
> >>
> >> at the end to explain the motivation behind anonymizing even more,
> >> if we wanted to.
> >
> > IMO, making the primary branch identifiable is a reasonable
> > justification to treat it specially. But then, why does it have to be
> > renamed to 'ref0'? Couldn't it just be renamed to Git's default primary
> > branch name, be it 'master' today or 'main' or whatever in the future?
>
> That comes from https://lore.kernel.org/git/xmqqtuzha6xn.fsf@gitster.c.googlers.com/
>
> But I agree with you 100% if you literally mean 'master' (or 'main')
> hardcoded without any end-user customization.  What I rejected and
> replaced with the vanilla "ref0" was to return the configured name
> that will be used for the primary branch in new repositories.  The
> above proposal suggested a faulty:
>
> -	if (!strcmp(refname, "refs/heads/master"))
> -		return refname;
> +	if (!strcmp(refname, get_primary_branch_name(DO_NOT_ABBREV)))
> +		return get_default_branch_name(DO_NOT_ABBREV);
>
> A corrected code should return a hardwired constant 'main' (it
> probably gets behind a C preprocessor macro, but the point is that
> we do not want end-user customization) for the reason stated in that
> message.

I like `ref0` better, for two reasons:

- it is more consistent to just have all anonymized branches be named
  `ref<N>`, and

- using `main` both for an original `main` and an original `master` can be
  a bit confusing, as the reader might assume that this branch name (as it
  does not follow the `ref<N>` convention) was _not_ anonymized, when it
  very well might have been.

Ciao,
Dscho
Junio C Hamano June 13, 2020, 4:25 p.m. UTC | #16
Johannes Sixt <j6t@kdbg.org> writes:

> Am 12.06.20 um 17:14 schrieb Junio C Hamano:
>> 	/*
>> 	 * Anonymize the name used for the primary branch in this
>> 	 * repository, but reserve `ref0` for it, so that it can
>> 	 * be identified among other refs in the output.
>> 	 */
>> 
>> is the minimum I would expect before calling it an improvement.  We
>> could add
>> 
>> 	It is often `main` for new repositories (and `master` for
>> 	aged ones) and such well-known names may not need
>> 	anonymizing, but it could be configured to use a secret word
>> 	that the user may not want to reveal.
>> 
>> at the end to explain the motivation behind anonymizing even more,
>> if we wanted to.
>
> IMO, making the primary branch identifiable is a reasonable
> justification to treat it specially. But then, why does it have to be
> renamed to 'ref0'? Couldn't it just be renamed to Git's default primary
> branch name, be it 'master' today or 'main' or whatever in the future?

That comes from https://lore.kernel.org/git/xmqqtuzha6xn.fsf@gitster.c.googlers.com/

But I agree with you 100% if you literally mean 'master' (or 'main')
hardcoded without any end-user customization.  What I rejected and
replaced with the vanilla "ref0" was to return the configured name
that will be used for the primary branch in new repositories.  The
above proposal suggested a faulty:

-	if (!strcmp(refname, "refs/heads/master"))
-		return refname;
+	if (!strcmp(refname, get_primary_branch_name(DO_NOT_ABBREV)))
+		return get_default_branch_name(DO_NOT_ABBREV);

A corrected code should return a hardwired constant 'main' (it
probably gets behind a C preprocessor macro, but the point is that
we do not want end-user customization) for the reason stated in that
message.

Thanks.
Junio C Hamano June 13, 2020, 6:49 p.m. UTC | #17
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> A corrected code should return a hardwired constant 'main' (it
>> probably gets behind a C preprocessor macro, but the point is that
>> we do not want end-user customization) for the reason stated in that
>> message.
>
> I like `ref0` better, for two reasons:
>
> - it is more consistent to just have all anonymized branches be named
>   `ref<N>`, and
>
> - using `main` both for an original `main` and an original `master` can be
>   a bit confusing, as the reader might assume that this branch name (as it
>   does not follow the `ref<N>` convention) was _not_ anonymized, when it
>   very well might have been.

A pro for keeping a hardcoded 'master' is that it is compatible with
the current world order, and flipping it to hardcoded 'main' upon
transition is just to use the moral equivalent, so we do not need to
immediately have to change anything.  The _new_ consistency across
ref<N> does feel attractive, but because it is new, there always is
a pushback not to "fix" what is not broken.

I am personally OK either way.  

By the way, we'd need to devise a transition plan for switching the
default branch name (i.e. the name used for the primary branch in a
newly created repository unless the user configures it to some other
value) to 'main' (oh, I just found one reason why I will not want to
use that name in my project(s)---it is too close to 'maint').  

It might roughly go like:

 1. We introduce core.defaultBranchName; when it is not set, its
    value defaults to 'master' in the 1st phase of the transition.
    "git init" and "git clone" however issue a warning that says
    "unless you configure core.defaultBranchName, we use 'master'
    for now for backward compatibility but we will start using
    'main' in three major releases of Git in the future".  These
    commands use the default branch name when creating a new
    repository in the 1st phase, and set core.primaryBranchName to
    that name in the resulting repository.

    This is to encourage early adopters to set it to 'maint'^W'main'
    (eek, see, I again made that typo), while allowing those who
    have toolset that depends more heavily on the current default
    branch name than other people to set it to 'master' for
    stability.

    In the 1st phase, a few commands that care about what the
    primary branch is in a repository (i.e. fmt-merge-msg and
    fast-export are the two we have identified so far) pay attention
    to the core.primaryBranchName configuration, and default to
    'master' if the configuration does not exist.  

    These commands issue a warning that says "unless you configure
    core.primaryBranchName in the repository, we use 'master' for
    now but we will start using 'main' in three major releases of
    Git in the future".

    The above two warning messages will be squelched once the user
    sets respective configuration variable.

 2. We flip the default for the two variables from 'master' to
    'main' in three major releases of Git (i.e. 24-30 weeks from the
    1st phase).  The two warning messages added for the 1st phase
    will be reworded for the updated default.  We no longer need to
    say "in three major releases" in there.

 3. After long time passes, remove the warning.
Johannes Schindelin June 14, 2020, 8:55 a.m. UTC | #18
Hi Junio,

On Sat, 13 Jun 2020, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> A corrected code should return a hardwired constant 'main' (it
> >> probably gets behind a C preprocessor macro, but the point is that
> >> we do not want end-user customization) for the reason stated in that
> >> message.
> >
> > I like `ref0` better, for two reasons:
> >
> > - it is more consistent to just have all anonymized branches be named
> >   `ref<N>`, and
> >
> > - using `main` both for an original `main` and an original `master` can be
> >   a bit confusing, as the reader might assume that this branch name (as it
> >   does not follow the `ref<N>` convention) was _not_ anonymized, when it
> >   very well might have been.
>
> A pro for keeping a hardcoded 'master' is that it is compatible with
> the current world order, and flipping it to hardcoded 'main' upon
> transition is just to use the moral equivalent, so we do not need to
> immediately have to change anything.  The _new_ consistency across
> ref<N> does feel attractive, but because it is new, there always is
> a pushback not to "fix" what is not broken.
>
> I am personally OK either way.
>
> By the way, we'd need to devise a transition plan for switching the
> default branch name (i.e. the name used for the primary branch in a
> newly created repository unless the user configures it to some other
> value) to 'main' (oh, I just found one reason why I will not want to
> use that name in my project(s)---it is too close to 'maint').

Yes, the trouble with `maint` did cross my mind, but I try not to
"overfit" to git/git. :-)

> It might roughly go like:
>
>  1. We introduce core.defaultBranchName; when it is not set, its
>     value defaults to 'master' in the 1st phase of the transition.
>     "git init" and "git clone" however issue a warning that says
>     "unless you configure core.defaultBranchName, we use 'master'
>     for now for backward compatibility but we will start using
>     'main' in three major releases of Git in the future".  These
>     commands use the default branch name when creating a new
>     repository in the 1st phase, and set core.primaryBranchName to
>     that name in the resulting repository.
>
>     This is to encourage early adopters to set it to 'maint'^W'main'
>     (eek, see, I again made that typo), while allowing those who
>     have toolset that depends more heavily on the current default
>     branch name than other people to set it to 'master' for
>     stability.
>
>     In the 1st phase, a few commands that care about what the
>     primary branch is in a repository (i.e. fmt-merge-msg and
>     fast-export are the two we have identified so far) pay attention
>     to the core.primaryBranchName configuration, and default to
>     'master' if the configuration does not exist.
>
>     These commands issue a warning that says "unless you configure
>     core.primaryBranchName in the repository, we use 'master' for
>     now but we will start using 'main' in three major releases of
>     Git in the future".
>
>     The above two warning messages will be squelched once the user
>     sets respective configuration variable.
>
>  2. We flip the default for the two variables from 'master' to
>     'main' in three major releases of Git (i.e. 24-30 weeks from the
>     1st phase).  The two warning messages added for the 1st phase
>     will be reworded for the updated default.  We no longer need to
>     say "in three major releases" in there.
>
>  3. After long time passes, remove the warning.

Yes, that's what I had in my mind, too (modulo the concrete part about the
three major versions, which is something I would have asked about at some
stage, thank you for answering that question already!).

Thank you,
Dscho
Junio C Hamano June 17, 2020, 8:06 p.m. UTC | #19
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Yes, the trouble with `maint` did cross my mind, but I try not to
> "overfit" to git/git. :-)

I do not think it is overfitting; if the solution cannot even
support the originating project well, there is something wrong.

Most likely, I'd be tempted to rename it myself away from any name
that is too similar to 'maint'; perhaps to 'stable' (or 'devo', h/t
tla ;-).
Johannes Schindelin June 23, 2020, 9:11 p.m. UTC | #20
Hi Junio,

On Wed, 17 Jun 2020, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > Yes, the trouble with `maint` did cross my mind, but I try not to
> > "overfit" to git/git. :-)
>
> I do not think it is overfitting; if the solution cannot even
> support the originating project well, there is something wrong.
>
> Most likely, I'd be tempted to rename it myself away from any name
> that is too similar to 'maint'; perhaps to 'stable' (or 'devo', h/t
> tla ;-).

You could also use `next` instead of `master`, which would make intuitive
sense because the commits that make it into that branch are slated to be
part of the next major Git version.

And a relatively obvious name for the current `next` might be `cooking`.

I refrained from proposing this earlier, thinking that this would be too
disruptive, but since `pu` was renamed to `seen`...

:-)

Ciao,
Dscho
Junio C Hamano June 23, 2020, 9:32 p.m. UTC | #21
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> You could also use `next` instead of `master`, which would make intuitive
> sense because the commits that make it into that branch are slated to be
> part of the next major Git version.
>
> And a relatively obvious name for the current `next` might be `cooking`.
>
> I refrained from proposing this earlier, thinking that this would be too
> disruptive, but since `pu` was renamed to `seen`...

Renaming 'pu' away from two-letter has a positive technical and
social effect.  Using 'next' for anything but what it currently
means does not have any such upside and only the downside of
confusing people.

As to 'cooking', I am not sure.  Personally I consider that the
topics that are in 'next' plus those that are soon to be in 'next'
are all 'cooking'.  But I do not think anybody's dying to rename
'next', so...
diff mbox series

Patch

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 85868162ee..a306a60d25 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -497,7 +497,7 @@  static void *anonymize_ref_component(const void *old, size_t *len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
-	strbuf_addf(&out, "ref%d", counter++);
+	strbuf_addf(&out, "ref%d", ++counter);
 	return strbuf_detach(&out, len);
 }
 
@@ -522,7 +522,7 @@  static const char *anonymize_refname(const char *refname)
 	 * anything interesting.
 	 */
 	if (!strcmp(refname, "refs/heads/master"))
-		return refname;
+		return "ref0";
 
 	strbuf_reset(&anon);
 	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {