Message ID | 20200619132546.GA2540774@coredump.intra.peff.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fast-export: allow dumping anonymization mappings | expand |
On Fri, Jun 19, 2020 at 9:25 AM Jeff King <peff@peff.net> wrote: > [...] > Let's make it possible to dump the mapping separate from the output > stream. This can be used by a bug reporter to modify their reproduction > recipe without revealing the original names (see the example in the > documentation). > [...] > Signed-off-by: Jeff King <peff@peff.net> > --- > diff --git a/builtin/fast-export.c b/builtin/fast-export.c > @@ -118,6 +120,32 @@ static int has_unshown_parent(struct commit *commit) > +static void maybe_dump_anon(FILE *out, struct seen_set *seen, > + const char *orig, const char *anon) > +{ > + if (!out) > + return; > + if (!check_and_mark_seen(seen, orig)) > + fprintf(out, "%s %s\n", orig, anon); > +} Nit: For a function which has only a single caller in this patch and one more caller added in 3/3, I wonder if the, um, convenience of this function short-circuiting as the very first thing it does outweighs the loss of clarity of having the callers make the check themselves. That is, have the caller do this instead: if (anonymized_refnames_handle) dump_anon(anonymized_refnames_handle, ...); > @@ -1213,6 +1249,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) > + if (anonymized_refnames_file) > + anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w"); For completeness, do you want to close this file at some point? > diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh > @@ -46,6 +46,18 @@ test_expect_success 'stream omits tag message' ' > +test_expect_success 'refname mapping can be dumped' ' > + git fast-export --anonymize --all \ > + --dump-anonymized-refnames=refs.out >/dev/null && > + # we make no guarantees of the exact anonymized names, > + # so just check that we have the right number and > + # that a sample line looks sane. > + # Note that master is not anonymized, and so not included > + # in the mapping. > + test_line_count = 6 refs.out && This hard-coded "6" seems rather fragile, causing the test to break if other refs are ever added or removed. Perhaps just count the refs dynamically? git show-ref >refs && nrefs=$(wc -l refs) && # Note that master is not anonymized, and so not included # in the mapping. nrefs=$((nrefs - 1)) test_line_count = $nrefs refs.out && and then drop the 'nrefs=$((nrefs - 1))' line and associated comment in patch 2/3 which removes the "master" special case. > + grep "^refs/heads/other refs/heads/" refs.out > +'
On Fri, Jun 19, 2020 at 11:51:06AM -0400, Eric Sunshine wrote: > > +static void maybe_dump_anon(FILE *out, struct seen_set *seen, > > + const char *orig, const char *anon) > > +{ > > + if (!out) > > + return; > > + if (!check_and_mark_seen(seen, orig)) > > + fprintf(out, "%s %s\n", orig, anon); > > +} > > Nit: For a function which has only a single caller in this patch and > one more caller added in 3/3, I wonder if the, um, convenience of this > function short-circuiting as the very first thing it does outweighs > the loss of clarity of having the callers make the check themselves. > That is, have the caller do this instead: > > if (anonymized_refnames_handle) > dump_anon(anonymized_refnames_handle, ...); <shrug> The names were long enough that I thought it was more clear not to repeat myself. I actually wrote originally: if (anonymized_refnames_handle && check_and_mark_seen(&seen, full_refname)) fprintf(anonymized_refnames_handle, "%s %s\n", full_refname, anon.buf)); but I wasn't sure if it was better not to duplicate the output format. OTOH, we _could_ be using quote_path() for the path dumping, in which case the formats would start to differ. > > @@ -1213,6 +1249,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) > > + if (anonymized_refnames_file) > > + anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w"); > > For completeness, do you want to close this file at some point? My thought was to just let it auto-close at exit() time, since it's effectively process-length. > > diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh > > @@ -46,6 +46,18 @@ test_expect_success 'stream omits tag message' ' > > +test_expect_success 'refname mapping can be dumped' ' > > + git fast-export --anonymize --all \ > > + --dump-anonymized-refnames=refs.out >/dev/null && > > + # we make no guarantees of the exact anonymized names, > > + # so just check that we have the right number and > > + # that a sample line looks sane. > > + # Note that master is not anonymized, and so not included > > + # in the mapping. > > + test_line_count = 6 refs.out && > > This hard-coded "6" seems rather fragile, causing the test to break if > other refs are ever added or removed. Perhaps just count the refs > dynamically? > > git show-ref >refs && > nrefs=$(wc -l refs) && > # Note that master is not anonymized, and so not included > # in the mapping. > nrefs=$((nrefs - 1)) > test_line_count = $nrefs refs.out && > > and then drop the 'nrefs=$((nrefs - 1))' line and associated comment > in patch 2/3 which removes the "master" special case. That's exactly what I wrote originally, but it failed on macos due to extra spaces in the "wc" output. There are other tests nearby that also count the refs (e.g., exactly 2 branches), so I didn't think it was worth trying to deal with the portability issue. -Peff
On Fri, Jun 19, 2020 at 12:01 PM Jeff King <peff@peff.net> wrote: > On Fri, Jun 19, 2020 at 11:51:06AM -0400, Eric Sunshine wrote: > > That is, have the caller do this instead: > > > > if (anonymized_refnames_handle) > > dump_anon(anonymized_refnames_handle, ...); > > <shrug> The names were long enough that I thought it was more clear not > to repeat myself. [...] Indeed, it's a minor point and subjective. Certainly not worth a re-roll or even worrying about it. > > This hard-coded "6" seems rather fragile, causing the test to break if > > other refs are ever added or removed. Perhaps just count the refs > > dynamically? > > > > git show-ref >refs && > > nrefs=$(wc -l refs) && > > # Note that master is not anonymized, and so not included > > # in the mapping. > > nrefs=$((nrefs - 1)) > > test_line_count = $nrefs refs.out && > > > That's exactly what I wrote originally, but it failed on macos due to > extra spaces in the "wc" output. Hmph, that shouldn't have failed. Did you quote the $(wc -l refs) invocation? Quoting it would cause it to fail. I just applied this atop your patch and it worked fine on Mac OS: diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index 0c5dd2a4fb..849192b1b0 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -52,9 +52,12 @@ test_expect_success 'refname mapping can be dumped' ' # we make no guarantees of the exact anonymized names, # so just check that we have the right number and # that a sample line looks sane. + git show-ref >refs && + nrefs=$(wc -l <refs) && # Note that master is not anonymized, and so not included # in the mapping. - test_line_count = 6 refs.out && + nrefs=$((nrefs - 1)) && + test_line_count = $nrefs refs.out && grep "^refs/heads/other refs/heads/" refs.out '
On Fri, Jun 19, 2020 at 12:18:16PM -0400, Eric Sunshine wrote: > > That's exactly what I wrote originally, but it failed on macos due to > > extra spaces in the "wc" output. > > Hmph, that shouldn't have failed. Did you quote the $(wc -l refs) > invocation? Quoting it would cause it to fail. Nope (and indeed, I was wary of the issue and made sure I didn't use quotes). My original was: test_expect_success 'refname mapping can be dumped' ' git fast-export --anonymize --all \ --dump-anonymized-refnames=refs.out >/dev/null && # we make no guarantees of the exact anonymized names, # so just check that we have the right number and # that a sample line looks sane. expected_count=$(git for-each-ref | wc -l) && # Note that master is not anonymized, and so not included # in the mapping. expected_count=$((expected_count - 1)) && test_line_count = "$expected_count" refs.out && grep "^refs/heads/other refs/heads/" refs.out ' So I guess I did quote the variable later. It works fine on Linux, but one of the osx ci jobs failed: https://github.com/peff/git/runs/787911270 The relevant log is: ++ git fast-export --anonymize --all --dump-anonymized-refnames=refs.out +++ git for-each-ref +++ wc -l ++ expected_count=' 7' ++ test_line_count = ' 7' refs.out ++ test 3 '!=' 3 +++ wc -l ++ test 7 = ' 7' ++ echo 'test_line_count: line count for refs.out != 7' test_line_count: line count for refs.out != 7 so the whitespace is eaten not when "wc" is run, but rather when the variable is expanded. -Peff
On Fri, Jun 19, 2020 at 1:45 PM Jeff King <peff@peff.net> wrote: > On Fri, Jun 19, 2020 at 12:18:16PM -0400, Eric Sunshine wrote: > > Hmph, that shouldn't have failed. Did you quote the $(wc -l refs) > > invocation? Quoting it would cause it to fail. > > Nope (and indeed, I was wary of the issue and made sure I didn't use > quotes). My original was: > > test_expect_success 'refname mapping can be dumped' ' > [...] > expected_count=$(git for-each-ref | wc -l) && > expected_count=$((expected_count - 1)) && > test_line_count = "$expected_count" refs.out && > > So I guess I did quote the variable later. It works fine on Linux, but > one of the osx ci jobs failed: > > ++ expected_count=' 7' > ++ test_line_count = ' 7' refs.out > ++ test 7 = ' 7' > ++ echo 'test_line_count: line count for refs.out != 7' > > so the whitespace is eaten not when "wc" is run, but rather when the > variable is expanded. Not something that should be done by this series (more a left-over-bitty thing, perhaps), but this almost suggests that test_line_count() deserves a tweak to make it more robust against that sort of thing: test_line_count () { if test $# != 3 then BUG "not 3 parameters to test_line_count" elif ! test $(wc -l <"$3") "$1" "$2" then echo "test_line_count: line count for $3 !$1 $2" cat "$3" return 1 fi } If we drop the quotes around $2 from the 'test': elif ! test $(wc -l <"$3") "$1" $2 then your code would have worked as expected. My only worry about that is that a poorly written caller would get a weird and unhelpful error message: test_line_count = 4 4 --> sh: test: too many arguments
Eric Sunshine <sunshine@sunshineco.com> writes: > On Fri, Jun 19, 2020 at 12:01 PM Jeff King <peff@peff.net> wrote: >> On Fri, Jun 19, 2020 at 11:51:06AM -0400, Eric Sunshine wrote: >> > That is, have the caller do this instead: >> > >> > if (anonymized_refnames_handle) >> > dump_anon(anonymized_refnames_handle, ...); >> >> <shrug> The names were long enough that I thought it was more clear not >> to repeat myself. [...] > > Indeed, it's a minor point and subjective. Certainly not worth a > re-roll or even worrying about it. It probably is subjective but fwiw I too find yours easier to follow. >> > This hard-coded "6" seems rather fragile, causing the test to break if >> > other refs are ever added or removed. Perhaps just count the refs >> > dynamically? >> > >> > git show-ref >refs && >> > nrefs=$(wc -l refs) && >> > # Note that master is not anonymized, and so not included >> > # in the mapping. >> > nrefs=$((nrefs - 1)) >> > test_line_count = $nrefs refs.out && >> > >> That's exactly what I wrote originally, but it failed on macos due to >> extra spaces in the "wc" output. > > Hmph, that shouldn't have failed. Did you quote the $(wc -l refs) > invocation? Quoting it would cause it to fail. > + git show-ref >refs && > + nrefs=$(wc -l <refs) && Yup, I've seen that workaround for macs too many times and it should work well. > # Note that master is not anonymized, and so not included > # in the mapping. > - test_line_count = 6 refs.out && > + nrefs=$((nrefs - 1)) && > + test_line_count = $nrefs refs.out && > grep "^refs/heads/other refs/heads/" refs.out > '
On Fri, Jun 19, 2020 at 02:00:47PM -0400, Eric Sunshine wrote: > > so the whitespace is eaten not when "wc" is run, but rather when the > > variable is expanded. > > Not something that should be done by this series (more a > left-over-bitty thing, perhaps), but this almost suggests that > test_line_count() deserves a tweak to make it more robust against that > sort of thing: > > test_line_count () { > if test $# != 3 > then > BUG "not 3 parameters to test_line_count" > elif ! test $(wc -l <"$3") "$1" "$2" > then > echo "test_line_count: line count for $3 !$1 $2" > cat "$3" > return 1 > fi > } > > If we drop the quotes around $2 from the 'test': > > elif ! test $(wc -l <"$3") "$1" $2 > > then your code would have worked as expected. > > My only worry about that is that a poorly written caller would get a > weird and unhelpful error message: > > test_line_count = 4 4 > --> sh: test: too many arguments I think your unhelpful-error-message case would happen only if the length argument contains two non-whitespace tokens separated by a whitespace (so the shell splits them into two arguments), _and_ the caller passed that argument in quotes (otherwise the shell would split it at the function call and we'd hit the BUG message). In which case, what are they trying to do with passing "4 4" to "test"? And since we're using "=" and not "-eq", I think "test" would be complaining about that. So it seems like it might be a reasonable change to make things more friendly. -Peff
On Fri, Jun 19, 2020 at 12:20:38PM -0700, Junio C Hamano wrote: > >> <shrug> The names were long enough that I thought it was more clear not > >> to repeat myself. [...] > > > > Indeed, it's a minor point and subjective. Certainly not worth a > > re-roll or even worrying about it. > > It probably is subjective but fwiw I too find yours easier to > follow. I ended up doing it this way, since I decided to quote the pathnames (and thus they needed their own fprintf, at which point there was really nothing left for the two to share). > > Hmph, that shouldn't have failed. Did you quote the $(wc -l refs) > > invocation? Quoting it would cause it to fail. > > > + git show-ref >refs && > > + nrefs=$(wc -l <refs) && > > Yup, I've seen that workaround for macs too many times and it should > work well. I switched to this style in the re-roll (and ditto for the path output, which actually turned up a bug). -Peff
diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index e8950de3ba..e809bb3f18 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -119,6 +119,12 @@ by keeping the marks the same across runs. the shape of the history and stored tree. See the section on `ANONYMIZING` below. +--dump-anonymized-refnames=<file>:: + Output the mapping of real refnames to anonymized refnames to + <file>. The output will contain one line per ref that appears in + the output stream, with the original refname, a space, and its + anonymized counterpart. See the section on `ANONYMIZING` below. + --reference-excluded-parents:: By default, running a command such as `git fast-export master~5..master` will not include the commit master{tilde}5 @@ -238,6 +244,22 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much smaller output, and it is usually easy to quickly confirm that there is no private data in the stream. +Reproducing some bugs may require referencing particular commits, which +becomes challenging after the refnames have all been anonymized. You can +use `--dump-anonymized-refnames` to output the mapping, and then alter +your reproduction recipe to use the anonymized names. E.g., if you find +a bug with `git rev-list v1.0..v2.0` in the private repository, you can +run: + +--------------------------------------------------- +$ git fast-export --anonymize --all --dump-anonymized-refnames=refs.out >stream +$ grep '^refs/tags/v[12].0' refs.out +refs/tags/v1.0 refs/tags/ref31 +refs/tags/v2.0 refs/tags/ref50 +--------------------------------------------------- + +which tells you that `git rev-list ref31..ref50` may produce the same +bug in the re-imported anonymous repository. LIMITATIONS ----------- diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 85868162ee..6caea6f290 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -24,6 +24,7 @@ #include "remote.h" #include "blob.h" #include "commit-slab.h" +#include "khash.h" static const char *fast_export_usage[] = { N_("git fast-export [rev-list-opts]"), @@ -45,6 +46,7 @@ static struct string_list extra_refs = STRING_LIST_INIT_NODUP; static struct string_list tag_refs = STRING_LIST_INIT_NODUP; static struct refspec refspecs = REFSPEC_INIT_FETCH; static int anonymize; +static FILE *anonymized_refnames_handle; static struct revision_sources revision_sources; static int parse_opt_signed_tag_mode(const struct option *opt, @@ -118,6 +120,32 @@ static int has_unshown_parent(struct commit *commit) return 0; } +KHASH_INIT(strset, const char *, int, 0, kh_str_hash_func, kh_str_hash_equal); + +struct seen_set { + kh_strset_t *set; +}; + +static int check_and_mark_seen(struct seen_set *seen, const char *str) +{ + int hashret; + if (!seen->set) + seen->set = kh_init_strset(); + if (kh_get_strset(seen->set, str) < kh_end(seen->set)) + return 1; + kh_put_strset(seen->set, xstrdup(str), &hashret); + return 0; +} + +static void maybe_dump_anon(FILE *out, struct seen_set *seen, + const char *orig, const char *anon) +{ + if (!out) + return; + if (!check_and_mark_seen(seen, orig)) + fprintf(out, "%s %s\n", orig, anon); +} + struct anonymized_entry { struct hashmap_entry hash; const char *orig; @@ -515,6 +543,8 @@ static const char *anonymize_refname(const char *refname) }; static struct hashmap refs; static struct strbuf anon = STRBUF_INIT; + static struct seen_set seen; + const char *full_refname = refname; int i; /* @@ -533,6 +563,8 @@ static const char *anonymize_refname(const char *refname) } anonymize_path(&anon, refname, &refs, anonymize_ref_component); + maybe_dump_anon(anonymized_refnames_handle, &seen, + full_refname, anon.buf); return anon.buf; } @@ -1144,6 +1176,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) char *export_filename = NULL, *import_filename = NULL, *import_filename_if_exists = NULL; + const char *anonymized_refnames_file = NULL; uint32_t lastimportid; struct string_list refspecs_list = STRING_LIST_INIT_NODUP; struct string_list paths_of_changed_objects = STRING_LIST_INIT_DUP; @@ -1177,6 +1210,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"), N_("Apply refspec to exported refs")), OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")), + OPT_STRING(0, "dump-anonymized-refnames", + &anonymized_refnames_file, N_("file"), + N_("output anonymized refname mapping to <file>")), OPT_BOOL(0, "reference-excluded-parents", &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")), OPT_BOOL(0, "show-original-ids", &show_original_ids, @@ -1213,6 +1249,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) string_list_clear(&refspecs_list, 1); } + if (anonymized_refnames_file) + anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w"); + if (use_done_feature) printf("feature done\n"); diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index 897dc50907..0c5dd2a4fb 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -46,6 +46,18 @@ test_expect_success 'stream omits tag message' ' ! grep "annotated tag" stream ' +test_expect_success 'refname mapping can be dumped' ' + git fast-export --anonymize --all \ + --dump-anonymized-refnames=refs.out >/dev/null && + # we make no guarantees of the exact anonymized names, + # so just check that we have the right number and + # that a sample line looks sane. + # Note that master is not anonymized, and so not included + # in the mapping. + test_line_count = 6 refs.out && + grep "^refs/heads/other refs/heads/" refs.out +' + # NOTE: we chdir to the new, anonymized repository # after this. All further tests should assume this. test_expect_success 'import stream to new repository' '