diff mbox series

[v2,RFC] git-p4: improve encoding handling to support inconsistent encodings

Message ID pull.1206.v2.git.1649831069578.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v2,RFC] git-p4: improve encoding handling to support inconsistent encodings | expand

Commit Message

Tao Klerks April 13, 2022, 6:24 a.m. UTC
From: Tao Klerks <tao@klerks.biz>

git-p4 is designed to run correctly under python2.7 and python3, but
its functional behavior wrt importing user-entered text differs across
these environments:

Under python2, git-p4 "naively" writes the Perforce bytestream into git
metadata (and does not set an "encoding" header on the commits); this
means that any non-utf-8 byte sequences end up creating invalidly-encoded
commit metadata in git.

Under python3, git-p4 attempts to decode the Perforce bytestream as utf-8
data, and fails badly (with an unhelpful error) when non-utf-8 data is
encountered.

Perforce clients (especially p4v) encourage user entry of changelist
descriptions (and user full names) in OS-local encoding, and store the
resulting bytestream to the server unmodified - such that different
clients can end up creating mutually-unintelligible messages. The most
common inconsistency, in many Perforce environments, is likely to be utf-8
(typical in linux) vs cp-1252 (typical in windows).

Make the changelist-description- and user-fullname-handling code
python-runtime-agnostic, introducing three "strategies" selectable via
config:
- 'legacy', behaving as previously under python2,
- 'strict', behaving as previously under python3, and
- 'fallback', favoring utf-8 but supporting a secondary encoding when
utf-8 decoding fails, and finally replacing remaining unmappable bytes.

Keep the python2 default behavior as-is ('legacy' strategy), but switch
the python3 default strategy to 'fallback' with fallback encoding
'cp1252'.

Also include tests exercising these encoding strategies, documentation for
the new config, and improve the user-facing error messages when decoding
does fail.

Signed-off-by: Tao Klerks <tao@klerks.biz>
---
    RFC: Git p4 encoding strategy
    
    OPEN QUESTIONS:
    
     * Does it make sense to make "fallback" the default decoding strategy
       in python3? This is definitely a change in behavior, but I believe
       for the better; failing with "we defaulted to strict, but you can run
       again with this other option if you want it to work" seems unkind,
       only making sense if we thought fallback to cp1252 would be wrong in
       a substantial proportion of cases...
     * Is it OK to duplicate the bulk of the testing code across
       t9835-git-p4-metadata-encoding-python2.sh and
       t9836-git-p4-metadata-encoding-python3.sh?
     * Is it OK to explicitly call "git-p4.py" in tests, rather than the
       build output "git-p4", in order to be able to select the python
       runtime on a per-test basis? Is there a better approach?
     * Is the naming of the strategies appropriate? Should the default
       python2 strategy be called something less opinionated, like
       "passthrough"?
    
    Changes wrt v1:
    
     * Added "and replace any remaining unmappable bytes" behavior to the
       "fallback" strategy; common reasonable encodings like cp1252 still
       contain unmapped codepoints, and if those are found, there is really
       nothing that can be done about it other than ignoring the crazy
       bytes; this approach is consistent with the longstanding
       path-encoding-handling strategy.
     * Simplified error-handling accordingly
     * Cleaned up tests & commit messages slightly

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1206%2FTaoK%2Fgit-p4-encoding-strategy-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1206/TaoK/git-p4-encoding-strategy-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1206

Range-diff vs v1:

 1:  9d33aa125b0 ! 1:  6d227ad57ea [RFC] git-p4: improve encoding handling to support inconsistent encodings
     @@ Commit message
          Under python2, git-p4 "naively" writes the Perforce bytestream into git
          metadata (and does not set an "encoding" header on the commits); this
          means that any non-utf-8 byte sequences end up creating invalidly-encoded
     -    data in git.
     +    commit metadata in git.
      
          Under python3, git-p4 attempts to decode the Perforce bytestream as utf-8
          data, and fails badly (with an unhelpful error) when non-utf-8 data is
          encountered.
      
     -    Perforce clients (esp. p4v) encourage user entry of changelist
     +    Perforce clients (especially p4v) encourage user entry of changelist
          descriptions (and user full names) in OS-local encoding, and store the
          resulting bytestream to the server unmodified - such that different
          clients can end up creating mutually-unintelligible messages. The most
     @@ Commit message
          - 'legacy', behaving as previously under python2,
          - 'strict', behaving as previously under python3, and
          - 'fallback', favoring utf-8 but supporting a secondary encoding when
     -    utf-8 decoding fails.
     +    utf-8 decoding fails, and finally replacing remaining unmappable bytes.
      
          Keep the python2 default behavior as-is ('legacy' strategy), but switch
          the python3 default strategy to 'fallback' with fallback encoding
     @@ Documentation/git-p4.txt: git-p4.pathEncoding::
      +	encoded as utf-8, and fails to import when this is not true.
      +	'fallback' attempts to interpret the data as utf-8, and otherwise
      +	falls back to using a secondary encoding - by default the common
     -+	windows encoding 'cp-1252'.
     ++	windows encoding 'cp-1252' - with any remaining unparseable bytes
     ++	replaced with a placeholder character.
      +	Under python2 the default strategy is 'legacy' for historical
      +	reasons, and under python3 the default is 'fallback'.
      +	When 'strict' is selected and decoding fails, the error message will
     @@ git-p4.py: else:
           def encode_text_stream(s):
               return s.encode('utf_8') if isinstance(s, unicode) else s
       
     ++
      +class MetadataDecodingException(Exception):
     -+    def __init__(self, input_string, fallback_encoding, fallback_error):
     ++    def __init__(self, input_string):
      +        self.input_string = input_string
     -+        self.fallback_encoding = fallback_encoding
     -+        self.fallback_error = fallback_error
      +
      +    def __str__(self):
     -+        error_message = """Decoding returned data failed!
     ++        return """Decoding perforce metadata failed!
      +The failing string was:
      +---
      +{}
     -+---""".format(self.input_string)
     -+
     -+        if not self.fallback_error:
     -+            error_message += """
     ++---
      +Consider setting the git-p4.metadataDecodingStrategy config option to
      +'fallback', to allow metadata to be decoded using a fallback encoding,
     -+defaulting to cp1252."""
     -+        else:
     -+            error_message += """
     -+The conversion failed while using the fallback encoding '{}';
     -+consider using a more forgiving one. Conversion error text:
     -+{}
     -+""".format(self.fallback_encoding, self.fallback_error)
     ++defaulting to cp1252.""".format(self.input_string)
      +
     -+        return error_message
      +
      +def metadata_stream_to_writable_bytes(s):
      +    encodingStrategy = gitConfig('git-p4.metadataDecodingStrategy') or defaultMetadataDecodingStrategy
     @@ git-p4.py: else:
      +        s.decode('utf_8')
      +        return s
      +    except UnicodeDecodeError:
     -+        fallback_error = None
      +        if encodingStrategy == 'fallback' and fallbackEncoding:
     -+            try:
     -+                return s.decode(fallbackEncoding).encode('utf_8')
     -+            except Exception as e:
     -+                fallback_error = e
     -+        raise MetadataDecodingException(s, fallbackEncoding, fallback_error)
     ++            return s.decode(fallbackEncoding, errors='replace').encode('utf_8')
     ++        raise MetadataDecodingException(s)
      +
       def decode_path(path):
           """Decode a given string (bytes or otherwise) using configured path encoding options
     @@ t/t9835-git-p4-metadata-encoding-python2.sh (new)
      +## SECTION REPEATED IN t9836 ##
      +###############################
      +
     -+# HORRIBLE HACK TO ENSURE PYTHON VERSION!
     -+# Weirdnesses:
     -+#  - Looking for python2 and python3 in a very specific path (/usr/bin/)
     -+#  - Code is inelegant
     -+#  - Code is duplicated (like most of this test script)
     -+#  - Test calls "git-p4.py" rather than "git-p4", because the latter references a specific path
     ++# Please note: this test calls "git-p4.py" rather than "git-p4", because the
     ++# latter references a specific path so we can't easily force it to run under
     ++# the python version we need to.
      +
      +python_major_version=$(python -V 2>&1 | cut -c  8)
     -+python_target_exists=$(/usr/bin/python$python_target_version -V 2>&1)
     -+if ! test "$python_major_version" = "$python_target_version" && test "$python_target_exists"
     ++python_target_binary=$(which python$python_target_version)
     ++if ! test "$python_major_version" = "$python_target_version" && test "$python_target_binary"
      +then
      +	mkdir temp_python
      +	PATH="$(pwd)/temp_python:$PATH" && export PATH
     -+	ln -s /usr/bin/python$python_target_version temp_python/python
     ++	ln -s $python_target_binary temp_python/python
      +fi
      +
      +python_major_version=$(python -V 2>&1 | cut -c  8)
     @@ t/t9835-git-p4-metadata-encoding-python2.sh (new)
      +	test_when_finished cleanup_git &&
      +	test_when_finished remove_user_cache &&
      +	test_must_fail git -c git-p4.metadataDecodingStrategy=strict p4.py clone --dest="$git" //depot@all 2>err &&
     -+	grep "Decoding returned data failed!" err
     ++	grep "Decoding perforce metadata failed!" err
      +'
      +
      +test_expect_success 'check utf-8 contents with legacy strategy' '
     @@ t/t9836-git-p4-metadata-encoding-python3.sh (new)
      +## SECTION REPEATED IN t9835 ##
      +###############################
      +
     -+# HORRIBLE HACK TO ENSURE PYTHON VERSION!
     -+# Weirdnesses:
     -+#  - Looking for python2 and python3 in a very specific path (/usr/bin/)
     -+#  - Code is inelegant
     -+#  - Code is duplicated (like most of this test script)
     -+#  - Test calls "git-p4.py" rather than "git-p4", because the latter references a specific path
     ++# Please note: this test calls "git-p4.py" rather than "git-p4", because the
     ++# latter references a specific path so we can't easily force it to run under
     ++# the python version we need to.
      +
      +python_major_version=$(python -V 2>&1 | cut -c  8)
     -+python_target_exists=$(/usr/bin/python$python_target_version -V 2>&1)
     -+if ! test "$python_major_version" = "$python_target_version" && test "$python_target_exists"
     ++python_target_binary=$(which python$python_target_version)
     ++if ! test "$python_major_version" = "$python_target_version" && test "$python_target_binary"
      +then
      +	mkdir temp_python
      +	PATH="$(pwd)/temp_python:$PATH" && export PATH
     -+	ln -s /usr/bin/python$python_target_version temp_python/python
     ++	ln -s $python_target_binary temp_python/python
      +fi
      +
      +python_major_version=$(python -V 2>&1 | cut -c  8)
     @@ t/t9836-git-p4-metadata-encoding-python3.sh (new)
      +	test_when_finished cleanup_git &&
      +	test_when_finished remove_user_cache &&
      +	test_must_fail git -c git-p4.metadataDecodingStrategy=strict p4.py clone --dest="$git" //depot@all 2>err &&
     -+	grep "Decoding returned data failed!" err
     ++	grep "Decoding perforce metadata failed!" err
      +'
      +
      +test_expect_success 'check utf-8 contents with legacy strategy' '


 Documentation/git-p4.txt                    |  37 +++-
 git-p4.py                                   |  89 ++++++++--
 t/lib-git-p4.sh                             |   3 +-
 t/t9835-git-p4-metadata-encoding-python2.sh | 182 +++++++++++++++++++
 t/t9836-git-p4-metadata-encoding-python3.sh | 183 ++++++++++++++++++++
 5 files changed, 476 insertions(+), 18 deletions(-)
 create mode 100755 t/t9835-git-p4-metadata-encoding-python2.sh
 create mode 100755 t/t9836-git-p4-metadata-encoding-python3.sh


base-commit: 11cfe552610386954886543f5de87dcc49ad5735

Comments

Ævar Arnfjörð Bjarmason April 13, 2022, 1:59 p.m. UTC | #1
On Wed, Apr 13 2022, Tao Klerks via GitGitGadget wrote:

> Under python2, git-p4 "naively" writes the Perforce bytestream into git
> metadata (and does not set an "encoding" header on the commits); this
> means that any non-utf-8 byte sequences end up creating invalidly-encoded
> commit metadata in git.

If it doesn't have an "encoding" header isn't any sequence of bytes OK
with git, so how does it create invalid metadata in git?

Do you mean that something on the Python side gets confused and doesn't
correctly encode it in that case, or that it's e.g. valid UTF-8 but
we're lacking the metadata?
Tao Klerks April 13, 2022, 3:18 p.m. UTC | #2
On Wed, Apr 13, 2022 at 4:01 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Wed, Apr 13 2022, Tao Klerks via GitGitGadget wrote:
>
> > Under python2, git-p4 "naively" writes the Perforce bytestream into git
> > metadata (and does not set an "encoding" header on the commits); this
> > means that any non-utf-8 byte sequences end up creating invalidly-encoded
> > commit metadata in git.
>
> If it doesn't have an "encoding" header isn't any sequence of bytes OK
> with git, so how does it create invalid metadata in git?

Just because git allows you to shove any sequence of bytes into a
commit header, doesn't mean the resulting text is "valid" metadata
text for all or most purposes. The correct way to encode text in git
commit metadata is utf-8 (OR tell any readers of this data that it's
something other than utf-8 via the encoding header) - it's just that
git itself, the official client, is tolerant of bad byte sequences.
Other clients are less tolerant. "Sublime Merge", for example, will
fail to display the commit text at all in some contexts if the bytes
are not valid utf-8 (or noted as being something else).

>
> Do you mean that something on the Python side gets confused and doesn't
> correctly encode it in that case, or that it's e.g. valid UTF-8 but
> we're lacking the metadata?

In git-p4 under python2, the bytes are simply copied from the perforce
commit metadata into the git commit metadata verbatim; if those bytes
happen to be valid utf-8, then they will be interpreted as such in git
and everything is great. If that is *not* the case, eg the bytes are
actually windows cp1252 (with bytes/characters in the x8a+ range),
then "git log" for example will output the raw bytes, and anything
looking at those bytes (a terminal, or a process that called git) will
get those unexpected bytes, and need to deal accordingly. A terminal
will probably display "unprintable character" glyphs, python3 will
blow up by default, python 2 will be perfectly happy by default, etc.

I summarize this "non-utf-8 bytes in a git commit message without a
qualifying 'encoding' header" situation as "invalidly-encoded commit
metadata in git", due to the impact on downstream consumers of git
metadata. Is there a better characterization?

Thanks,
Tao
Ævar Arnfjörð Bjarmason April 13, 2022, 6:52 p.m. UTC | #3
On Wed, Apr 13 2022, Tao Klerks wrote:

> On Wed, Apr 13, 2022 at 4:01 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Wed, Apr 13 2022, Tao Klerks via GitGitGadget wrote:
>>
>> > Under python2, git-p4 "naively" writes the Perforce bytestream into git
>> > metadata (and does not set an "encoding" header on the commits); this
>> > means that any non-utf-8 byte sequences end up creating invalidly-encoded
>> > commit metadata in git.
>>
>> If it doesn't have an "encoding" header isn't any sequence of bytes OK
>> with git, so how does it create invalid metadata in git?
>
> Just because git allows you to shove any sequence of bytes into a
> commit header, doesn't mean the resulting text is "valid" metadata
> text for all or most purposes. The correct way to encode text in git
> commit metadata is utf-8 (OR tell any readers of this data that it's
> something other than utf-8 via the encoding header) - it's just that
> git itself, the official client, is tolerant of bad byte sequences.
> Other clients are less tolerant. "Sublime Merge", for example, will
> fail to display the commit text at all in some contexts if the bytes
> are not valid utf-8 (or noted as being something else).
>
>>
>> Do you mean that something on the Python side gets confused and doesn't
>> correctly encode it in that case, or that it's e.g. valid UTF-8 but
>> we're lacking the metadata?
>
> In git-p4 under python2, the bytes are simply copied from the perforce
> commit metadata into the git commit metadata verbatim; if those bytes
> happen to be valid utf-8, then they will be interpreted as such in git
> and everything is great. If that is *not* the case, eg the bytes are
> actually windows cp1252 (with bytes/characters in the x8a+ range),
> then "git log" for example will output the raw bytes, and anything
> looking at those bytes (a terminal, or a process that called git) will
> get those unexpected bytes, and need to deal accordingly. A terminal
> will probably display "unprintable character" glyphs, python3 will
> blow up by default, python 2 will be perfectly happy by default, etc.
>
> I summarize this "non-utf-8 bytes in a git commit message without a
> qualifying 'encoding' header" situation as "invalidly-encoded commit
> metadata in git", due to the impact on downstream consumers of git
> metadata. Is there a better characterization?

I must admit that all this time I'd been missing the "Lack of this
header implies that the commit log message is encoded in UTF-8." part of
the docs added in 5dc7bcc2453 (Documentation: i18n commit log message
notes., 2006-12-30). I.e. I thought that "encoding"-header-less just
meant/implied "whatever".

But then again there wouldn't be much point in the encoding header if
its value is "utf8", so surely we'd want to use the lack of a header to
disambiguate utf8 v.s. non-utf8.

AFAICT we only allow selecting between encodings, not "no idea what this
is, but here's some raw sequence of bytes", except by omitting the
header.

It seems to me that between legacy/strict/fallback there's a 4th setting
missing here. I.e. a "try-encoding". One where if your data is valid
utf8 (or cp1252 if we want to get fancy and combine it with "fallback")
you get an encoding header, but having tried that we'll just write
whatever raw data we found, but *not* munge it.

I haven't worked with p4, but having done some legacy SCM imports it was
nice to be able to map data 1=1 to git in those "odd encoding" cases, or
even cases where there was raw binary in a commit message or whatever...

Anyway, all of this is fine with me as-is, I just had a drive-by
question after some admittedly not very careful reading, sorry.
Andrew Oakley April 13, 2022, 8:41 p.m. UTC | #4
Thanks for doing this.  I've been meaning to write some similar code
for years and never quite got around to it.  So maybe my opinion
shouldn't be worth much :/.

On Wed, 13 Apr 2022 06:24:29 +0000
"Tao Klerks via GitGitGadget" <gitgitgadget@gmail.com> wrote:
> Make the changelist-description- and user-fullname-handling code
> python-runtime-agnostic, introducing three "strategies" selectable via
> config:
> - 'legacy', behaving as previously under python2,
> - 'strict', behaving as previously under python3, and
> - 'fallback', favoring utf-8 but supporting a secondary encoding when
> utf-8 decoding fails, and finally replacing remaining unmappable
> bytes.

I was thinking about making the config option be a list of encodings to
try.  So the options you've given map something like this:
- "legacy" -> "raw"
- "strict" -> "utf8"
- "fallback" -> "utf8 cp1252" (or whatever is configured)

This doesn't handle the case of using a replacement character, but in
reality I suspect that fallback encoding will always be a legacy 8bit
codec anyway.

I think what you've proposed is fine too, I'm not sure what would end
up being easier to understand.

>      * Does it make sense to make "fallback" the default decoding
> strategy in python3? This is definitely a change in behavior, but I
> believe for the better; failing with "we defaulted to strict, but you
> can run again with this other option if you want it to work" seems
> unkind, only making sense if we thought fallback to cp1252 would be
> wrong in a substantial proportion of cases...

The only issue I can see with changing the default is that it might
lead to a surprising loss of data for someone migrating to git.  Maybe
print a warning the first time git-p4 encounters something that can't be
decoded as UTF-8, but then continue with the fallback to cp1252?

>      * Is it OK to duplicate the bulk of the testing code across
>        t9835-git-p4-metadata-encoding-python2.sh and
>        t9836-git-p4-metadata-encoding-python3.sh?
>      * Is it OK to explicitly call "git-p4.py" in tests, rather than
> the build output "git-p4", in order to be able to select the python
>        runtime on a per-test basis? Is there a better approach?

I tried to find a nicer way to do this and failed.

>      * Is the naming of the strategies appropriate? Should the default
>        python2 strategy be called something less opinionated, like
>        "passthrough"?

I think that "passthrough" or "raw" would be more descriptive names.

The changes to git-p4 itself look good to me.  I think that dealing
with bytes more and strings less will be good going forward.
Tao Klerks April 14, 2022, 9:38 a.m. UTC | #5
On Wed, Apr 13, 2022 at 9:04 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> AFAICT we only allow selecting between encodings, not "no idea what this
> is, but here's some raw sequence of bytes", except by omitting the
> header.
>

I don't understand what you mean here. My reading of the docs is that
omitting the header *does not* imply "no idea what this is" - it
implies "this is utf-8".

Git will do the best it can (convert to utf-8 at output time if the
bytes are parseable in the specified or implied encoding, and return
the raw bytes otherwise), *regardless* of whether an encoding is
explicitly specified or utf-8 is implied.

> It seems to me that between legacy/strict/fallback there's a 4th setting
> missing here. I.e. a "try-encoding". One where if your data is valid
> utf8 (or cp1252 if we want to get fancy and combine it with "fallback")
> you get an encoding header, but having tried that we'll just write
> whatever raw data we found, but *not* munge it.

For utf-8 specifically, the "legacy" strategy achieves this effect
with less overhead: It copies the bytes over raw, and in git the
implied encoding is utf-8. So if the bytes were utf-8 in the first
place, then they're good (we didn't need to munge them in any way),
and if they weren't utf-8 we copied them over anyway, which is the
"tried and failed" behavior you propose also.

For cp1252 or another encoding, the behavior you propose is an
enhancement/modification of the current "fallback" behavior, I guess:
Currently if the fallback encoding doesn't wash, any remaining "bad
bytes" get replaced out of existence. This leads to data loss of those
individual bytes, and it also means that if the data really wasn't
cp1252 in the first place, you might just have garbled up the data
something awful. Of course, you might have done that without any
errors anyway - cp1252 only leaves 4 unmapped bytes, so most arbitrary
text data will be successfully "interpreted", no matter how
erroneously.

The advantage of the current "replace" behavior is that you *always*
end up with valid utf-8 data in your commit text. The disadvantage is
that you can suffer (minor) unrecoverable data loss.

I think your proposal is that (optionally?) the fallback-or-raw
behavior would simply spit out / leave the original bytes in this
situation, not making any attempt to interpret them or convert them to
utf-8, but simply bring them to git as-is. This would make that
particular commit message "invalidly encoded", in the sense that any
git client or git-caller would need to "do what it can" with that
sequence of bytes.

This tradeoff between avoidance of data loss, and type/encoding
coherence, is not one I'm comfortable deciding on. I would ideally
prefer a third route, where the data *is* interpreted and converted,
but in a fully reversible way.

What would you think of a scheme where, *if* the fallback encoding
fails to decode successfully, we simply take all x80+ bytes and escape
them to a form like "\x8c", so a commit message might end up like
"solve the problem with the japanese character: \8f\c4."? Would this
way of preserving the bytes, without breaking out of having a known
(utf-8) encoding in the git commits, make sense to you? We could even
add a suffix to the message like "[original data contained bytes that
could not be mapped in targeted encoding cp1252; bytes at or over x80
were escaped as \xNN, and backslashes were escaped as \\ to avoid
ambiguity]", or something less horrendously verbose :)

>
> I haven't worked with p4, but having done some legacy SCM imports it was
> nice to be able to map data 1=1 to git in those "odd encoding" cases, or
> even cases where there was raw binary in a commit message or whatever...
>

My problem here, again, is that these "badly encoded commit messages"
endure forever in your commit history: any tool that wants to parse
your history will have to deal with them, skirt around them, etc. That
just seems like bad discipline. If the intent is to ensure that you
*can* reconstruct those bytes if/when you need to, we should find a
way to store them safely, in a way that won't randomly trip others up.

> Anyway, all of this is fine with me as-is, I just had a drive-by
> question after some admittedly not very careful reading, sorry.

Thanks for looking into it!
Tao Klerks April 14, 2022, 9:57 a.m. UTC | #6
On Wed, Apr 13, 2022 at 10:41 PM Andrew Oakley <andrew@adoakley.name> wrote:
>
> On Wed, 13 Apr 2022 06:24:29 +0000
> "Tao Klerks via GitGitGadget" <gitgitgadget@gmail.com> wrote:
> > Make the changelist-description- and user-fullname-handling code
> > python-runtime-agnostic, introducing three "strategies" selectable via
> > config:
> > - 'legacy', behaving as previously under python2,
> > - 'strict', behaving as previously under python3, and
> > - 'fallback', favoring utf-8 but supporting a secondary encoding when
> > utf-8 decoding fails, and finally replacing remaining unmappable
> > bytes.
>
> I was thinking about making the config option be a list of encodings to
> try.  So the options you've given map something like this:
> - "legacy" -> "raw"
> - "strict" -> "utf8"
> - "fallback" -> "utf8 cp1252" (or whatever is configured)
>
> This doesn't handle the case of using a replacement character, but in
> reality I suspect that fallback encoding will always be a legacy 8bit
> codec anyway.
>
> I think what you've proposed is fine too, I'm not sure what would end
> up being easier to understand.

I'm not sure I understand the proposal... Specifically, I don't
understand how "raw" would work in that scheme.

In "my" current scheme, there is a big difference between "legacy" and
the other two strategies: the legacy strategy reads "raw", but also
*writes* "raw".

The other schemes read whatever encoding, and then write utf-8. In the
case of strict, that actually works out the same as "raw", as long as
the input bytes were valid utf-8 (and otherwise nothing happens
anyway). In the case of fallback, that's a completely different
behavior to legacy's read-raw write-raw.

Is your proposal to independently specify the read encodings *and* the
write encoding, as separate parameters? That was actually my original
approach, but it turned out to, in my opinion, be harder to understand
(and implement :) )

>
> >      * Does it make sense to make "fallback" the default decoding
> > strategy in python3? This is definitely a change in behavior, but I
> > believe for the better; failing with "we defaulted to strict, but you
> > can run again with this other option if you want it to work" seems
> > unkind, only making sense if we thought fallback to cp1252 would be
> > wrong in a substantial proportion of cases...
>
> The only issue I can see with changing the default is that it might
> lead to a surprising loss of data for someone migrating to git.  Maybe
> print a warning the first time git-p4 encounters something that can't be
> decoded as UTF-8, but then continue with the fallback to cp1252?

Honestly, I'm not sure how much a warning does. In my perforce repo,
for example, any new warnings during the import would get drowned out
by the mac metadata ignoring warnings.

I understand and share the data loss concern.

As I just answered Ævar, I *think* I'd like to address the data loss
concern by escaping all x80+ bytes if something cannot be interpreted
even using the fallback encoding. In a commit message there could also
be a suffix explaining what happened, although I suspect that's
pointless complexity. The advantage of this approach is that it makes
it *possible* to reconstruct the original bytestream precisely, but
without creating badly-encoded git commit messages that need to be
skirted around.

>
> >      * Is it OK to duplicate the bulk of the testing code across
> >        t9835-git-p4-metadata-encoding-python2.sh and
> >        t9836-git-p4-metadata-encoding-python3.sh?
> >      * Is it OK to explicitly call "git-p4.py" in tests, rather than
> > the build output "git-p4", in order to be able to select the python
> >        runtime on a per-test basis? Is there a better approach?
>
> I tried to find a nicer way to do this and failed.

OK thx.

>
> >      * Is the naming of the strategies appropriate? Should the default
> >        python2 strategy be called something less opinionated, like
> >        "passthrough"?
>
> I think that "passthrough" or "raw" would be more descriptive names.
>

OK thx, I'll take "passthrough" as it feels slightly less positive
than "raw", for some reason that I can't put my finger on :)

> The changes to git-p4 itself look good to me.  I think that dealing
> with bytes more and strings less will be good going forward.

Thx for your feedback!
Andrew Oakley April 17, 2022, 6:11 p.m. UTC | #7
On 14/04/2022 10:57, Tao Klerks wrote:
> On Wed, Apr 13, 2022 at 10:41 PM Andrew Oakley <andrew@adoakley.name> wrote:
>>
>> On Wed, 13 Apr 2022 06:24:29 +0000
>> "Tao Klerks via GitGitGadget" <gitgitgadget@gmail.com> wrote:
>>> Make the changelist-description- and user-fullname-handling code
>>> python-runtime-agnostic, introducing three "strategies" selectable via
>>> config:
>>> - 'legacy', behaving as previously under python2,
>>> - 'strict', behaving as previously under python3, and
>>> - 'fallback', favoring utf-8 but supporting a secondary encoding when
>>> utf-8 decoding fails, and finally replacing remaining unmappable
>>> bytes.
>>
>> I was thinking about making the config option be a list of encodings to
>> try.  So the options you've given map something like this:
>> - "legacy" -> "raw"
>> - "strict" -> "utf8"
>> - "fallback" -> "utf8 cp1252" (or whatever is configured)
>>
>> This doesn't handle the case of using a replacement character, but in
>> reality I suspect that fallback encoding will always be a legacy 8bit
>> codec anyway.
>>
>> I think what you've proposed is fine too, I'm not sure what would end
>> up being easier to understand.
> 
> I'm not sure I understand the proposal... Specifically, I don't
> understand how "raw" would work in that scheme.
> 
> In "my" current scheme, there is a big difference between "legacy" and
> the other two strategies: the legacy strategy reads "raw", but also
> *writes* "raw".
> 
> The other schemes read whatever encoding, and then write utf-8. In the
> case of strict, that actually works out the same as "raw", as long as
> the input bytes were valid utf-8 (and otherwise nothing happens
> anyway). In the case of fallback, that's a completely different
> behavior to legacy's read-raw write-raw.

The way I look at it is that you both read and write bytes, and you may 
attempt to decode and re-encode text on the way.  Both the decoding and 
the encoding are done in metadata_stream_to_writable_bytes, so nothing 
else needs to know about the raw option being different.

Perhaps it's easier to explain with a bit of (untested) code.  I was 
thinking of something like this:

def metadata_stream_to_writable_bytes(s):
     if not isinstance(s, bytes):
         return s.encode('utf-8')

     for encoding in gitConfigList('git-p4.metadataEncoding') or 
['utf-8', 'cp1252']:
         if encoding == 'passthrough':
             return s  # do not try to correct text encoding
         else:
             try:
                 return s.decode(encoding).encode('utf-8')
             except UnicodeDecodeError:
                 pass  # try the next configured encoding

     raise MetadataDecodingException(s)
> Is your proposal to independently specify the read encodings *and* the
> write encoding, as separate parameters? That was actually my original
> approach, but it turned out to, in my opinion, be harder to understand
> (and implement :) )

I don't think it's important to be able specify the encoding to be used 
in git.  I've not spotted anyone asking for that feature.  I think it 
could be added later if someone needs it.

>>>       * Does it make sense to make "fallback" the default decoding
>>> strategy in python3? This is definitely a change in behavior, but I
>>> believe for the better; failing with "we defaulted to strict, but you
>>> can run again with this other option if you want it to work" seems
>>> unkind, only making sense if we thought fallback to cp1252 would be
>>> wrong in a substantial proportion of cases...
>>
>> The only issue I can see with changing the default is that it might
>> lead to a surprising loss of data for someone migrating to git.  Maybe
>> print a warning the first time git-p4 encounters something that can't be
>> decoded as UTF-8, but then continue with the fallback to cp1252?
> 
> Honestly, I'm not sure how much a warning does. In my perforce repo,
> for example, any new warnings during the import would get drowned out
> by the mac metadata ignoring warnings.

FWIW in the perforce repository I work with this doesn't happen much and 
I would notice the additional warning about text encodings.  I suspect 
this will be another thing which varies a lot between repositories.

> I understand and share the data loss concern.
> 
> As I just answered Ævar, I *think* I'd like to address the data loss
> concern by escaping all x80+ bytes if something cannot be interpreted
> even using the fallback encoding. In a commit message there could also
> be a suffix explaining what happened, although I suspect that's
> pointless complexity. The advantage of this approach is that it makes
> it *possible* to reconstruct the original bytestream precisely, but
> without creating badly-encoded git commit messages that need to be
> skirted around.

I think this gets pretty messy though.  In my opinion it's not any nicer 
than putting the raw bytes in the commit message.

Git does not make any attempt enforce the commit metadata encoding, so I 
think that tools really should make an attempt to handle invalid data in 
a somewhat sensible fashion.

I don't think there is really a "right" answer, anything reasonable 
would be better than what we've got now.
Tao Klerks April 19, 2022, 8:30 p.m. UTC | #8
On Sun, Apr 17, 2022 at 8:17 PM Andrew Oakley <andrew@adoakley.name> wrote:
>
>
> The way I look at it is that you both read and write bytes, and you may
> attempt to decode and re-encode text on the way.  Both the decoding and
> the encoding are done in metadata_stream_to_writable_bytes, so nothing
> else needs to know about the raw option being different.
>

Right - personally I just believe making the distinction explicit as
"strategies" makes for a less magical explanation than a special
encoding value that's not just a different encoding but also a
different behavior.

In other aspects, the behavior you're proposing (except for the final
fallback-decoding-failure) seems to be equivalent to what I've
implemented in the latest version.

>
> > I understand and share the data loss concern.
> >
> > As I just answered Ævar, I *think* I'd like to address the data loss
> > concern by escaping all x80+ bytes if something cannot be interpreted
> > even using the fallback encoding. In a commit message there could also
> > be a suffix explaining what happened, although I suspect that's
> > pointless complexity. The advantage of this approach is that it makes
> > it *possible* to reconstruct the original bytestream precisely, but
> > without creating badly-encoded git commit messages that need to be
> > skirted around.
>
> I think this gets pretty messy though.  In my opinion it's not any nicer
> than putting the raw bytes in the commit message.
>
> Git does not make any attempt enforce the commit metadata encoding, so I
> think that tools really should make an attempt to handle invalid data in
> a somewhat sensible fashion.
>
> I don't think there is really a "right" answer, anything reasonable
> would be better than what we've got now.

Alright - I went ahead with the "escape if you can't do it right"
behavior anyway, because it makes me feel better about being able to
say "no information loss" :)
diff mbox series

Patch

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index e21fcd8f712..e21bc6a5e37 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -636,7 +636,42 @@  git-p4.pathEncoding::
 	Git expects paths encoded as UTF-8. Use this config to tell git-p4
 	what encoding Perforce had used for the paths. This encoding is used
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
-	often uses "cp1252" to encode path names.
+	often uses "cp1252" to encode path names. If this option is passed
+	into a p4 clone request, it is persisted in the resulting new git
+	repo.
+
+git-p4.metadataDecodingStrategy::
+	Perforce keeps the encoding of a changelist descriptions and user
+	full names as stored by the client on a given OS. The p4v client
+	uses the OS-local encosing, and so different users can end up storing
+	different changelist descriptions or user full names in different
+	encodings, in the same depot.
+	Git tolerates inconsistent/incorrect encodings in commit messages
+	and author names, but expects them to be specified in utf-8.
+	git-p4 can use three different decoding strategies in handling the
+	encoding uncertainty in Perforce: 'legacy' simply passes the original
+	bytes through from Perforce to git, creating usable but
+	incorrectly-encoded data when the Perforce data is encoded as
+	anything other than utf-8. 'strict' expects the Perforce data to be
+	encoded as utf-8, and fails to import when this is not true.
+	'fallback' attempts to interpret the data as utf-8, and otherwise
+	falls back to using a secondary encoding - by default the common
+	windows encoding 'cp-1252' - with any remaining unparseable bytes
+	replaced with a placeholder character.
+	Under python2 the default strategy is 'legacy' for historical
+	reasons, and under python3 the default is 'fallback'.
+	When 'strict' is selected and decoding fails, the error message will
+	propose changing this config parameter as a workaround. If this
+	option is passed into a p4 clone request, it is persisted into the
+	resulting new git repo.
+
+git-p4.metadataFallbackEncoding::
+	Specify the fallback encoding to use when decoding Perforce author
+	names and changelists descriptions using the 'fallback' strategy
+	(see git-p4.metadataDecodingStrategy). The fallback encoding will
+	only be used when decoding as utf-8 fails. This option defaults to
+	cp1252, a common windows encoding. If this option is passed into a
+	p4 clone request, it is persisted into the resulting new git repo.
 
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
diff --git a/git-p4.py b/git-p4.py
index a9b1f904410..c5e74aaa515 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -54,6 +54,9 @@  defaultLabelRegexp = r'[a-zA-Z0-9_\-.]+$'
 # The block size is reduced automatically if required
 defaultBlockSize = 1<<20
 
+defaultMetadataDecodingStrategy = 'legacy' if sys.version_info.major == 2 else 'fallback'
+defaultFallbackMetadataEncoding = 'cp1252'
+
 p4_access_checked = False
 
 re_ko_keywords = re.compile(br'\$(Id|Header)(:[^$\n]+)?\$')
@@ -203,6 +206,37 @@  else:
     def encode_text_stream(s):
         return s.encode('utf_8') if isinstance(s, unicode) else s
 
+
+class MetadataDecodingException(Exception):
+    def __init__(self, input_string):
+        self.input_string = input_string
+
+    def __str__(self):
+        return """Decoding perforce metadata failed!
+The failing string was:
+---
+{}
+---
+Consider setting the git-p4.metadataDecodingStrategy config option to
+'fallback', to allow metadata to be decoded using a fallback encoding,
+defaulting to cp1252.""".format(self.input_string)
+
+
+def metadata_stream_to_writable_bytes(s):
+    encodingStrategy = gitConfig('git-p4.metadataDecodingStrategy') or defaultMetadataDecodingStrategy
+    fallbackEncoding = gitConfig('git-p4.metadataFallbackEncoding') or defaultFallbackMetadataEncoding
+    if not isinstance(s, bytes):
+        return s.encode('utf_8')
+    if encodingStrategy == 'legacy':
+        return s
+    try:
+        s.decode('utf_8')
+        return s
+    except UnicodeDecodeError:
+        if encodingStrategy == 'fallback' and fallbackEncoding:
+            return s.decode(fallbackEncoding, errors='replace').encode('utf_8')
+        raise MetadataDecodingException(s)
+
 def decode_path(path):
     """Decode a given string (bytes or otherwise) using configured path encoding options
     """
@@ -702,11 +736,12 @@  def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
             if bytes is not str:
                 # Decode unmarshalled dict to use str keys and values, except for:
                 #   - `data` which may contain arbitrary binary data
-                #   - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text
+                #   - `desc` or `FullName` which may contain non-UTF8 encoded text handled below, eagerly converted to bytes
+                #   - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text, handled by decode_path()
                 decoded_entry = {}
                 for key, value in entry.items():
                     key = key.decode()
-                    if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
+                    if isinstance(value, bytes) and not (key in ('data', 'desc', 'FullName', 'path', 'clientFile') or key.startswith('depotFile')):
                         value = value.decode()
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
@@ -716,6 +751,10 @@  def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
             if skip_info:
                 if 'code' in entry and entry['code'] == 'info':
                     continue
+            if 'desc' in entry:
+                entry['desc'] = metadata_stream_to_writable_bytes(entry['desc'])
+            if 'FullName' in entry:
+                entry['FullName'] = metadata_stream_to_writable_bytes(entry['FullName'])
             if cb is not None:
                 cb(entry)
             else:
@@ -1435,7 +1474,13 @@  class P4UserMap:
         for output in p4CmdList(["users"]):
             if "User" not in output:
                 continue
-            self.users[output["User"]] = output["FullName"] + " <" + output["Email"] + ">"
+            # "FullName" is bytes. "Email" on the other hand might be bytes
+            # or unicode string depending on whether we are running under
+            # python2 or python3. To support
+            # git-p4.metadataDecodingStrategy=legacy, self.users dict values
+            # are always bytes, ready to be written to git.
+            emailbytes = metadata_stream_to_writable_bytes(output["Email"])
+            self.users[output["User"]] = output["FullName"] + b" <" + emailbytes + b">"
             self.emails[output["Email"]] = output["User"]
 
         mapUserConfigRegex = re.compile(r"^\s*(\S+)\s*=\s*(.+)\s*<(\S+)>\s*$", re.VERBOSE)
@@ -1445,26 +1490,28 @@  class P4UserMap:
                 user = mapUser[0][0]
                 fullname = mapUser[0][1]
                 email = mapUser[0][2]
-                self.users[user] = fullname + " <" + email + ">"
+                fulluser = fullname + " <" + email + ">"
+                self.users[user] = metadata_stream_to_writable_bytes(fulluser)
                 self.emails[email] = user
 
-        s = ''
+        s = b''
         for (key, val) in self.users.items():
-            s += "%s\t%s\n" % (key.expandtabs(1), val.expandtabs(1))
+            keybytes = metadata_stream_to_writable_bytes(key)
+            s += b"%s\t%s\n" % (keybytes.expandtabs(1), val.expandtabs(1))
 
-        open(self.getUserCacheFilename(), 'w').write(s)
+        open(self.getUserCacheFilename(), 'wb').write(s)
         self.userMapFromPerforceServer = True
 
     def loadUserMapFromCache(self):
         self.users = {}
         self.userMapFromPerforceServer = False
         try:
-            cache = open(self.getUserCacheFilename(), 'r')
+            cache = open(self.getUserCacheFilename(), 'rb')
             lines = cache.readlines()
             cache.close()
             for line in lines:
-                entry = line.strip().split("\t")
-                self.users[entry[0]] = entry[1]
+                entry = line.strip().split(b"\t")
+                self.users[entry[0].decode('utf_8')] = entry[1]
         except IOError:
             self.getUserMapFromPerforceServer()
 
@@ -3020,7 +3067,8 @@  class P4Sync(Command, P4UserMap):
         if userid in self.users:
             return self.users[userid]
         else:
-            return "%s <a@b>" % userid
+            userid_bytes = metadata_stream_to_writable_bytes(userid)
+            return b"%s <a@b>" % userid_bytes
 
     def streamTag(self, gitStream, labelName, labelDetails, commit, epoch):
         """ Stream a p4 tag.
@@ -3043,9 +3091,10 @@  class P4Sync(Command, P4UserMap):
             email = self.make_email(owner)
         else:
             email = self.make_email(self.p4UserId())
-        tagger = "%s %s %s" % (email, epoch, self.tz)
 
-        gitStream.write("tagger %s\n" % tagger)
+        gitStream.write("tagger ")
+        gitStream.write(email)
+        gitStream.write(" %s %s\n" % (epoch, self.tz))
 
         print("labelDetails=",labelDetails)
         if 'Description' in labelDetails:
@@ -3138,12 +3187,12 @@  class P4Sync(Command, P4UserMap):
         self.gitStream.write("commit %s\n" % branch)
         self.gitStream.write("mark :%s\n" % details["change"])
         self.committedChanges.add(int(details["change"]))
-        committer = ""
         if author not in self.users:
             self.getUserMapFromPerforceServer()
-        committer = "%s %s %s" % (self.make_email(author), epoch, self.tz)
 
-        self.gitStream.write("committer %s\n" % committer)
+        self.gitStream.write("committer ")
+        self.gitStream.write(self.make_email(author))
+        self.gitStream.write(" %s %s\n" % (epoch, self.tz))
 
         self.gitStream.write("data <<EOT\n")
         self.gitStream.write(details["desc"])
@@ -4055,6 +4104,14 @@  class P4Clone(P4Sync):
         if self.useClientSpec_from_options:
             system(["git", "config", "--bool", "git-p4.useclientspec", "true"])
 
+        # persist any git-p4 encoding-handling config options passed in for clone:
+        if gitConfig('git-p4.metadataDecodingStrategy'):
+            system(["git", "config", "git-p4.metadataDecodingStrategy", gitConfig('git-p4.metadataDecodingStrategy')])
+        if gitConfig('git-p4.metadataFallbackEncoding'):
+            system(["git", "config", "git-p4.metadataFallbackEncoding", gitConfig('git-p4.metadataFallbackEncoding')])
+        if gitConfig('git-p4.pathEncoding'):
+            system(["git", "config", "git-p4.pathEncoding", gitConfig('git-p4.pathEncoding')])
+
         return True
 
 class P4Unshelve(Command):
diff --git a/t/lib-git-p4.sh b/t/lib-git-p4.sh
index 5aff2abe8b5..2a5b8738ea3 100644
--- a/t/lib-git-p4.sh
+++ b/t/lib-git-p4.sh
@@ -142,10 +142,11 @@  start_p4d () {
 
 p4_add_user () {
 	name=$1 &&
+	fullname="${2:-Dr. $1}"
 	p4 user -f -i <<-EOF
 	User: $name
 	Email: $name@example.com
-	FullName: Dr. $name
+	FullName: $fullname
 	EOF
 }
 
diff --git a/t/t9835-git-p4-metadata-encoding-python2.sh b/t/t9835-git-p4-metadata-encoding-python2.sh
new file mode 100755
index 00000000000..724eaee9cf4
--- /dev/null
+++ b/t/t9835-git-p4-metadata-encoding-python2.sh
@@ -0,0 +1,182 @@ 
+#!/bin/sh
+
+test_description='git p4 metadata encoding
+
+This test checks that the import process handles inconsistent text
+encoding in p4 metadata (author names, commit messages, etc) without
+failing, and produces maximally sane output in git.'
+
+. ./lib-git-p4.sh
+
+python_target_version='2'
+
+###############################
+## SECTION REPEATED IN t9836 ##
+###############################
+
+# Please note: this test calls "git-p4.py" rather than "git-p4", because the
+# latter references a specific path so we can't easily force it to run under
+# the python version we need to.
+
+python_major_version=$(python -V 2>&1 | cut -c  8)
+python_target_binary=$(which python$python_target_version)
+if ! test "$python_major_version" = "$python_target_version" && test "$python_target_binary"
+then
+	mkdir temp_python
+	PATH="$(pwd)/temp_python:$PATH" && export PATH
+	ln -s $python_target_binary temp_python/python
+fi
+
+python_major_version=$(python -V 2>&1 | cut -c  8)
+if ! test "$python_major_version" = "$python_target_version"
+then
+	skip_all="skipping python$python_target_version-specific git p4 tests; python$python_target_version not available"
+	test_done
+fi
+
+remove_user_cache () {
+	rm "$HOME/.gitp4-usercache.txt" || true
+}
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'init depot' '
+	(
+		cd "$cli" &&
+
+		p4_add_user "utf8_author" "ǣuthor" &&
+		P4USER=utf8_author &&
+		touch file1 &&
+		p4 add file1 &&
+		p4 submit -d "first CL has some utf-8 tǣxt" &&
+
+		p4_add_user "latin1_author" "$(echo æuthor |
+			iconv -f utf8 -t latin1)" &&
+		P4USER=latin1_author &&
+		touch file2 &&
+		p4 add file2 &&
+		p4 submit -d "$(echo second CL has some latin-1 tæxt |
+			iconv -f utf8 -t latin1)" &&
+
+		p4_add_user "cp1252_author" "$(echo æuthœr |
+			iconv -f utf8 -t cp1252)" &&
+		P4USER=cp1252_author &&
+		touch file3 &&
+		p4 add file3 &&
+		p4 submit -d "$(echo third CL has sœme cp-1252 tæxt |
+		  iconv -f utf8 -t cp1252)"
+	)
+'
+
+test_expect_success 'clone non-utf8 repo with strict encoding' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	test_must_fail git -c git-p4.metadataDecodingStrategy=strict p4.py clone --dest="$git" //depot@all 2>err &&
+	grep "Decoding perforce metadata failed!" err
+'
+
+test_expect_success 'check utf-8 contents with legacy strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "some utf-8 tǣxt" actual &&
+		grep "ǣuthor" actual
+	)
+'
+
+test_expect_success 'check latin-1 contents corrupted in git with legacy strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) &&
+		grep "$badly_encoded_in_git" actual &&
+		bad_author_in_git="$(echo æuthor | iconv -f utf8 -t latin1)" &&
+		grep "$bad_author_in_git" actual
+	)
+'
+
+test_expect_success 'check utf-8 contents with fallback strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "some utf-8 tǣxt" actual &&
+		grep "ǣuthor" actual
+	)
+'
+
+test_expect_success 'check latin-1 contents with fallback strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "some latin-1 tæxt" actual &&
+		grep "æuthor" actual
+	)
+'
+
+test_expect_success 'check cp-1252 contents with fallback strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "sœme cp-1252 tæxt" actual &&
+		grep "æuthœr" actual
+	)
+'
+
+test_expect_success 'check cp-1252 contents on later sync after clone with fallback strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$cli" &&
+		P4USER=cp1252_author &&
+		touch file4 &&
+		p4 add file4 &&
+		p4 submit -d "$(echo fourth CL has sœme more cp-1252 tæxt |
+			iconv -f utf8 -t cp1252)"
+	) &&
+	(
+		cd "$git" &&
+
+		git p4.py sync --branch=master &&
+
+		git log p4/master >actual &&
+		cat actual &&
+		grep "sœme more cp-1252 tæxt" actual &&
+		grep "æuthœr" actual
+	)
+'
+
+############################
+## / END REPEATED SECTION ##
+############################
+
+test_expect_success 'legacy (latin-1 contents corrupted in git) is the default with python2' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) &&
+		grep "$badly_encoded_in_git" actual
+	)
+'
+
+test_done
diff --git a/t/t9836-git-p4-metadata-encoding-python3.sh b/t/t9836-git-p4-metadata-encoding-python3.sh
new file mode 100755
index 00000000000..6c8e8cce5f1
--- /dev/null
+++ b/t/t9836-git-p4-metadata-encoding-python3.sh
@@ -0,0 +1,183 @@ 
+#!/bin/sh
+
+test_description='git p4 metadata encoding
+
+This test checks that the import process handles inconsistent text
+encoding in p4 metadata (author names, commit messages, etc) without
+failing, and produces maximally sane output in git.'
+
+. ./lib-git-p4.sh
+
+python_target_version='3'
+
+###############################
+## SECTION REPEATED IN t9835 ##
+###############################
+
+# Please note: this test calls "git-p4.py" rather than "git-p4", because the
+# latter references a specific path so we can't easily force it to run under
+# the python version we need to.
+
+python_major_version=$(python -V 2>&1 | cut -c  8)
+python_target_binary=$(which python$python_target_version)
+if ! test "$python_major_version" = "$python_target_version" && test "$python_target_binary"
+then
+	mkdir temp_python
+	PATH="$(pwd)/temp_python:$PATH" && export PATH
+	ln -s $python_target_binary temp_python/python
+fi
+
+python_major_version=$(python -V 2>&1 | cut -c  8)
+if ! test "$python_major_version" = "$python_target_version"
+then
+	skip_all="skipping python$python_target_version-specific git p4 tests; python$python_target_version not available"
+	test_done
+fi
+
+remove_user_cache () {
+	rm "$HOME/.gitp4-usercache.txt" || true
+}
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'init depot' '
+	(
+		cd "$cli" &&
+
+		p4_add_user "utf8_author" "ǣuthor" &&
+		P4USER=utf8_author &&
+		touch file1 &&
+		p4 add file1 &&
+		p4 submit -d "first CL has some utf-8 tǣxt" &&
+
+		p4_add_user "latin1_author" "$(echo æuthor |
+			iconv -f utf8 -t latin1)" &&
+		P4USER=latin1_author &&
+		touch file2 &&
+		p4 add file2 &&
+		p4 submit -d "$(echo second CL has some latin-1 tæxt |
+			iconv -f utf8 -t latin1)" &&
+
+		p4_add_user "cp1252_author" "$(echo æuthœr |
+			iconv -f utf8 -t cp1252)" &&
+		P4USER=cp1252_author &&
+		touch file3 &&
+		p4 add file3 &&
+		p4 submit -d "$(echo third CL has sœme cp-1252 tæxt |
+		  iconv -f utf8 -t cp1252)"
+	)
+'
+
+test_expect_success 'clone non-utf8 repo with strict encoding' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	test_must_fail git -c git-p4.metadataDecodingStrategy=strict p4.py clone --dest="$git" //depot@all 2>err &&
+	grep "Decoding perforce metadata failed!" err
+'
+
+test_expect_success 'check utf-8 contents with legacy strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "some utf-8 tǣxt" actual &&
+		grep "ǣuthor" actual
+	)
+'
+
+test_expect_success 'check latin-1 contents corrupted in git with legacy strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) &&
+		grep "$badly_encoded_in_git" actual &&
+		bad_author_in_git="$(echo æuthor | iconv -f utf8 -t latin1)" &&
+		grep "$bad_author_in_git" actual
+	)
+'
+
+test_expect_success 'check utf-8 contents with fallback strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "some utf-8 tǣxt" actual &&
+		grep "ǣuthor" actual
+	)
+'
+
+test_expect_success 'check latin-1 contents with fallback strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "some latin-1 tæxt" actual &&
+		grep "æuthor" actual
+	)
+'
+
+test_expect_success 'check cp-1252 contents with fallback strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "sœme cp-1252 tæxt" actual &&
+		grep "æuthœr" actual
+	)
+'
+
+test_expect_success 'check cp-1252 contents on later sync after clone with fallback strategy' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$cli" &&
+		P4USER=cp1252_author &&
+		touch file4 &&
+		p4 add file4 &&
+		p4 submit -d "$(echo fourth CL has sœme more cp-1252 tæxt |
+			iconv -f utf8 -t cp1252)"
+	) &&
+	(
+		cd "$git" &&
+
+		git p4.py sync --branch=master &&
+
+		git log p4/master >actual &&
+		cat actual &&
+		grep "sœme more cp-1252 tæxt" actual &&
+		grep "æuthœr" actual
+	)
+'
+
+############################
+## / END REPEATED SECTION ##
+############################
+
+
+test_expect_success 'fallback (both utf-8 and cp-1252 contents handled) is the default with python3' '
+	test_when_finished cleanup_git &&
+	test_when_finished remove_user_cache &&
+	git p4.py clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git log >actual &&
+		grep "sœme cp-1252 tæxt" actual &&
+		grep "æuthœr" actual
+	)
+'
+
+test_done