Message ID | pull.1206.v3.git.1650399590844.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v3] git-p4: improve encoding handling to support inconsistent encodings | expand |
My apologies, I messed up the cover letter, *and* the subject line, and now GitGitGadget won't let me submit the same commit again... The subject line *should* have read: Git p4 encoding strategy The cover letter (after "---") should have read: --- This is no longer RFC, it's now request-for-Review! Changes wrt v2: * Renamed "legacy" strategy to "passthrough", reflecting the possible value of maintaining it long-term * Changed "fallback decoding failure" behavior to escape over-127 bytes, instead of omitting them. There should now be no information loss under any scenario, although recovering the original bytes might be non-trivial On Tue, Apr 19, 2022 at 10:19 PM Tao Klerks via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Tao Klerks <tao@klerks.biz> > > git-p4 is designed to run correctly under python2.7 and python3, but > its functional behavior wrt importing user-entered text differs across > these environments: > > Under python2, git-p4 "naively" writes the Perforce bytestream into git > metadata (and does not set an "encoding" header on the commits); this > means that any non-utf-8 byte sequences end up creating invalidly-encoded > commit metadata in git. > > Under python3, git-p4 attempts to decode the Perforce bytestream as utf-8 > data, and fails badly (with an unhelpful error) when non-utf-8 data is > encountered. > > Perforce clients (especially p4v) encourage user entry of changelist > descriptions (and user full names) in OS-local encoding, and store the > resulting bytestream to the server unmodified - such that different > clients can end up creating mutually-unintelligible messages. The most > common inconsistency, in many Perforce environments, is likely to be utf-8 > (typical in linux) vs cp-1252 (typical in windows). > > Make the changelist-description- and user-fullname-handling code > python-runtime-agnostic, introducing three "strategies" selectable via > config: > - 'passthrough', behaving as previously under python2, > - 'strict', behaving as previously under python3, and > - 'fallback', favoring utf-8 but supporting a secondary encoding when > utf-8 decoding fails, and finally escaping high-range bytes if the > decoding with the secondary encoding also fails. > > Keep the python2 default behavior as-is ('legacy' strategy), but switch > the python3 default strategy to 'fallback' with default fallback encoding > 'cp1252'. > > Also include tests exercising these encoding strategies, documentation for > the new config, and improve the user-facing error messages when decoding > does fail. > > Signed-off-by: Tao Klerks <tao@klerks.biz> > --- > RFC: Git p4 encoding strategy > > OPEN QUESTIONS: > > * Does it make sense to make "fallback" the default decoding strategy > in python3? This is definitely a change in behavior, but I believe > for the better; failing with "we defaulted to strict, but you can run > again with this other option if you want it to work" seems unkind, > only making sense if we thought fallback to cp1252 would be wrong in > a substantial proportion of cases... > * Is it OK to duplicate the bulk of the testing code across > t9835-git-p4-metadata-encoding-python2.sh and > t9836-git-p4-metadata-encoding-python3.sh? > * Is it OK to explicitly call "git-p4.py" in tests, rather than the > build output "git-p4", in order to be able to select the python > runtime on a per-test basis? Is there a better approach? > * Is the naming of the strategies appropriate? Should the default > python2 strategy be called something less opinionated, like > "passthrough"? > > Changes wrt v1: > > * Added "and replace any remaining unmappable bytes" behavior to the > "fallback" strategy; common reasonable encodings like cp1252 still > contain unmapped codepoints, and if those are found, there is really > nothing that can be done about it other than ignoring the crazy > bytes; this approach is consistent with the longstanding > path-encoding-handling strategy. > * Simplified error-handling accordingly > * Cleaned up tests & commit messages slightly > > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1206%2FTaoK%2Fgit-p4-encoding-strategy-v3 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1206/TaoK/git-p4-encoding-strategy-v3 > Pull-Request: https://github.com/gitgitgadget/git/pull/1206 > > Range-diff vs v2: > > 1: 6d227ad57ea ! 1: 1d83b6d7865 [RFC] git-p4: improve encoding handling to support inconsistent encodings > @@ Metadata > Author: Tao Klerks <tao@klerks.biz> > > ## Commit message ## > - [RFC] git-p4: improve encoding handling to support inconsistent encodings > + git-p4: improve encoding handling to support inconsistent encodings > > git-p4 is designed to run correctly under python2.7 and python3, but > its functional behavior wrt importing user-entered text differs across > @@ Commit message > Make the changelist-description- and user-fullname-handling code > python-runtime-agnostic, introducing three "strategies" selectable via > config: > - - 'legacy', behaving as previously under python2, > + - 'passthrough', behaving as previously under python2, > - 'strict', behaving as previously under python3, and > - 'fallback', favoring utf-8 but supporting a secondary encoding when > - utf-8 decoding fails, and finally replacing remaining unmappable bytes. > + utf-8 decoding fails, and finally escaping high-range bytes if the > + decoding with the secondary encoding also fails. > > Keep the python2 default behavior as-is ('legacy' strategy), but switch > - the python3 default strategy to 'fallback' with fallback encoding > + the python3 default strategy to 'fallback' with default fallback encoding > 'cp1252'. > > Also include tests exercising these encoding strategies, documentation for > @@ Documentation/git-p4.txt: git-p4.pathEncoding:: > +git-p4.metadataDecodingStrategy:: > + Perforce keeps the encoding of a changelist descriptions and user > + full names as stored by the client on a given OS. The p4v client > -+ uses the OS-local encosing, and so different users can end up storing > ++ uses the OS-local encoding, and so different users can end up storing > + different changelist descriptions or user full names in different > + encodings, in the same depot. > + Git tolerates inconsistent/incorrect encodings in commit messages > + and author names, but expects them to be specified in utf-8. > + git-p4 can use three different decoding strategies in handling the > -+ encoding uncertainty in Perforce: 'legacy' simply passes the original > -+ bytes through from Perforce to git, creating usable but > ++ encoding uncertainty in Perforce: 'passthrough' simply passes the > ++ original bytes through from Perforce to git, creating usable but > + incorrectly-encoded data when the Perforce data is encoded as > + anything other than utf-8. 'strict' expects the Perforce data to be > + encoded as utf-8, and fails to import when this is not true. > + 'fallback' attempts to interpret the data as utf-8, and otherwise > + falls back to using a secondary encoding - by default the common > -+ windows encoding 'cp-1252' - with any remaining unparseable bytes > -+ replaced with a placeholder character. > -+ Under python2 the default strategy is 'legacy' for historical > ++ windows encoding 'cp-1252' - with upper-range bytes escaped if > ++ decoding with the fallback encoding also fails. > ++ Under python2 the default strategy is 'passthrough' for historical > + reasons, and under python3 the default is 'fallback'. > + When 'strict' is selected and decoding fails, the error message will > + propose changing this config parameter as a workaround. If this > @@ Documentation/git-p4.txt: git-p4.pathEncoding:: > Specify the system that is used for large (binary) files. Please note > > ## git-p4.py ## > +@@ > + # pylint: disable=too-many-statements,too-many-instance-attributes > + # pylint: disable=too-many-branches,too-many-nested-blocks > + # > ++import struct > + import sys > + if sys.version_info.major < 3 and sys.version_info.minor < 7: > + sys.stderr.write("git-p4: requires Python 2.7 or later.\n") > @@ git-p4.py: defaultLabelRegexp = r'[a-zA-Z0-9_\-.]+$' > # The block size is reduced automatically if required > defaultBlockSize = 1<<20 > > -+defaultMetadataDecodingStrategy = 'legacy' if sys.version_info.major == 2 else 'fallback' > ++defaultMetadataDecodingStrategy = 'passthrough' if sys.version_info.major == 2 else 'fallback' > +defaultFallbackMetadataEncoding = 'cp1252' > + > p4_access_checked = False > @@ git-p4.py: else: > +defaulting to cp1252.""".format(self.input_string) > + > + > ++encoding_fallback_warning_issued = False > ++encoding_escape_warning_issued = False > +def metadata_stream_to_writable_bytes(s): > + encodingStrategy = gitConfig('git-p4.metadataDecodingStrategy') or defaultMetadataDecodingStrategy > + fallbackEncoding = gitConfig('git-p4.metadataFallbackEncoding') or defaultFallbackMetadataEncoding > + if not isinstance(s, bytes): > + return s.encode('utf_8') > -+ if encodingStrategy == 'legacy': > ++ if encodingStrategy == 'passthrough': > + return s > + try: > + s.decode('utf_8') > + return s > + except UnicodeDecodeError: > + if encodingStrategy == 'fallback' and fallbackEncoding: > -+ return s.decode(fallbackEncoding, errors='replace').encode('utf_8') > ++ global encoding_fallback_warning_issued > ++ global encoding_escape_warning_issued > ++ try: > ++ if not encoding_fallback_warning_issued: > ++ print("\nCould not decode value as utf-8; using configured fallback encoding %s: %s" % (fallbackEncoding, s)) > ++ print("\n(this warning is only displayed once during an import)") > ++ encoding_fallback_warning_issued = True > ++ return s.decode(fallbackEncoding).encode('utf_8') > ++ except Exception as exc: > ++ if not encoding_escape_warning_issued: > ++ print("\nCould not decode value with configured fallback encoding %s; escaping bytes over 127: %s" % (fallbackEncoding, s)) > ++ print("\n(this warning is only displayed once during an import)") > ++ encoding_escape_warning_issued = True > ++ escaped_bytes = b'' > ++ # bytes and strings work very differently in python2 vs python3... > ++ if str is bytes: > ++ for byte in s: > ++ byte_number = struct.unpack('>B', byte)[0] > ++ if byte_number > 127: > ++ escaped_bytes += b'%' > ++ escaped_bytes += hex(byte_number)[2:].upper() > ++ else: > ++ escaped_bytes += byte > ++ else: > ++ for byte_number in s: > ++ if byte_number > 127: > ++ escaped_bytes += b'%' > ++ escaped_bytes += hex(byte_number).upper().encode()[2:] > ++ else: > ++ escaped_bytes += bytes([byte_number]) > ++ return escaped_bytes > ++ > + raise MetadataDecodingException(s) > + > def decode_path(path): > @@ git-p4.py: class P4UserMap: > + # "FullName" is bytes. "Email" on the other hand might be bytes > + # or unicode string depending on whether we are running under > + # python2 or python3. To support > -+ # git-p4.metadataDecodingStrategy=legacy, self.users dict values > ++ # git-p4.metadataDecodingStrategy=fallback, self.users dict values > + # are always bytes, ready to be written to git. > + emailbytes = metadata_stream_to_writable_bytes(output["Email"]) > + self.users[output["User"]] = output["FullName"] + b" <" + emailbytes + b">" > @@ t/t9835-git-p4-metadata-encoding-python2.sh (new) > + touch file3 && > + p4 add file3 && > + p4 submit -d "$(echo third CL has sœme cp-1252 tæxt | > -+ iconv -f utf8 -t cp1252)" > ++ iconv -f utf8 -t cp1252)" && > ++ > ++ p4_add_user "cp850_author" "$(echo Åuthor | > ++ iconv -f utf8 -t cp850)" && > ++ P4USER=cp850_author && > ++ touch file4 && > ++ p4 add file4 && > ++ p4 submit -d "$(echo fourth CL hÅs some cp850 text | > ++ iconv -f utf8 -t cp850)" > + ) > +' > + > @@ t/t9835-git-p4-metadata-encoding-python2.sh (new) > + grep "Decoding perforce metadata failed!" err > +' > + > -+test_expect_success 'check utf-8 contents with legacy strategy' ' > ++test_expect_success 'check utf-8 contents with passthrough strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > -+ git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all && > ++ git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > @@ t/t9835-git-p4-metadata-encoding-python2.sh (new) > + ) > +' > + > -+test_expect_success 'check latin-1 contents corrupted in git with legacy strategy' ' > ++test_expect_success 'check latin-1 contents corrupted in git with passthrough strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > -+ git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all && > ++ git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > @@ t/t9835-git-p4-metadata-encoding-python2.sh (new) > + ) > +' > + > ++test_expect_success 'check cp850 contents parsed with correct fallback' ' > ++ test_when_finished cleanup_git && > ++ test_when_finished remove_user_cache && > ++ git -c git-p4.metadataDecodingStrategy=fallback -c git-p4.metadataFallbackEncoding=cp850 p4.py clone --dest="$git" //depot@all && > ++ ( > ++ cd "$git" && > ++ git log >actual && > ++ grep "hÅs some cp850 text" actual && > ++ grep "Åuthor" actual > ++ ) > ++' > ++ > ++test_expect_success 'check cp850-only contents escaped when cp1252 is fallback' ' > ++ test_when_finished cleanup_git && > ++ test_when_finished remove_user_cache && > ++ git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > ++ ( > ++ cd "$git" && > ++ git log >actual && > ++ grep "h%8Fs some cp850 text" actual && > ++ grep "%8Futhor" actual > ++ ) > ++' > ++ > +test_expect_success 'check cp-1252 contents on later sync after clone with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > @@ t/t9835-git-p4-metadata-encoding-python2.sh (new) > + ( > + cd "$cli" && > + P4USER=cp1252_author && > -+ touch file4 && > -+ p4 add file4 && > -+ p4 submit -d "$(echo fourth CL has sœme more cp-1252 tæxt | > ++ touch file10 && > ++ p4 add file10 && > ++ p4 submit -d "$(echo later CL has sœme more cp-1252 tæxt | > + iconv -f utf8 -t cp1252)" > + ) && > + ( > @@ t/t9835-git-p4-metadata-encoding-python2.sh (new) > + git p4.py sync --branch=master && > + > + git log p4/master >actual && > -+ cat actual && > + grep "sœme more cp-1252 tæxt" actual && > + grep "æuthœr" actual > + ) > @@ t/t9835-git-p4-metadata-encoding-python2.sh (new) > +## / END REPEATED SECTION ## > +############################ > + > -+test_expect_success 'legacy (latin-1 contents corrupted in git) is the default with python2' ' > ++test_expect_success 'passthrough (latin-1 contents corrupted in git) is the default with python2' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > -+ git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all && > ++ git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > @@ t/t9836-git-p4-metadata-encoding-python3.sh (new) > + touch file3 && > + p4 add file3 && > + p4 submit -d "$(echo third CL has sœme cp-1252 tæxt | > -+ iconv -f utf8 -t cp1252)" > ++ iconv -f utf8 -t cp1252)" && > ++ > ++ p4_add_user "cp850_author" "$(echo Åuthor | > ++ iconv -f utf8 -t cp850)" && > ++ P4USER=cp850_author && > ++ touch file4 && > ++ p4 add file4 && > ++ p4 submit -d "$(echo fourth CL hÅs some cp850 text | > ++ iconv -f utf8 -t cp850)" > + ) > +' > + > @@ t/t9836-git-p4-metadata-encoding-python3.sh (new) > + grep "Decoding perforce metadata failed!" err > +' > + > -+test_expect_success 'check utf-8 contents with legacy strategy' ' > ++test_expect_success 'check utf-8 contents with passthrough strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > -+ git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all && > ++ git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > @@ t/t9836-git-p4-metadata-encoding-python3.sh (new) > + ) > +' > + > -+test_expect_success 'check latin-1 contents corrupted in git with legacy strategy' ' > ++test_expect_success 'check latin-1 contents corrupted in git with passthrough strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > -+ git -c git-p4.metadataDecodingStrategy=legacy p4.py clone --dest="$git" //depot@all && > ++ git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > @@ t/t9836-git-p4-metadata-encoding-python3.sh (new) > + ) > +' > + > ++test_expect_success 'check cp850 contents parsed with correct fallback' ' > ++ test_when_finished cleanup_git && > ++ test_when_finished remove_user_cache && > ++ git -c git-p4.metadataDecodingStrategy=fallback -c git-p4.metadataFallbackEncoding=cp850 p4.py clone --dest="$git" //depot@all && > ++ ( > ++ cd "$git" && > ++ git log >actual && > ++ grep "hÅs some cp850 text" actual && > ++ grep "Åuthor" actual > ++ ) > ++' > ++ > ++test_expect_success 'check cp850-only contents escaped when cp1252 is fallback' ' > ++ test_when_finished cleanup_git && > ++ test_when_finished remove_user_cache && > ++ git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > ++ ( > ++ cd "$git" && > ++ git log >actual && > ++ grep "h%8Fs some cp850 text" actual && > ++ grep "%8Futhor" actual > ++ ) > ++' > ++ > +test_expect_success 'check cp-1252 contents on later sync after clone with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > @@ t/t9836-git-p4-metadata-encoding-python3.sh (new) > + ( > + cd "$cli" && > + P4USER=cp1252_author && > -+ touch file4 && > -+ p4 add file4 && > -+ p4 submit -d "$(echo fourth CL has sœme more cp-1252 tæxt | > ++ touch file10 && > ++ p4 add file10 && > ++ p4 submit -d "$(echo later CL has sœme more cp-1252 tæxt | > + iconv -f utf8 -t cp1252)" > + ) && > + ( > @@ t/t9836-git-p4-metadata-encoding-python3.sh (new) > + git p4.py sync --branch=master && > + > + git log p4/master >actual && > -+ cat actual && > + grep "sœme more cp-1252 tæxt" actual && > + grep "æuthœr" actual > + ) > > > Documentation/git-p4.txt | 37 +++- > git-p4.py | 123 +++++++++-- > t/lib-git-p4.sh | 3 +- > t/t9835-git-p4-metadata-encoding-python2.sh | 213 +++++++++++++++++++ > t/t9836-git-p4-metadata-encoding-python3.sh | 214 ++++++++++++++++++++ > 5 files changed, 572 insertions(+), 18 deletions(-) > create mode 100755 t/t9835-git-p4-metadata-encoding-python2.sh > create mode 100755 t/t9836-git-p4-metadata-encoding-python3.sh > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > index e21fcd8f712..de5ee6748e3 100644 > --- a/Documentation/git-p4.txt > +++ b/Documentation/git-p4.txt > @@ -636,7 +636,42 @@ git-p4.pathEncoding:: > Git expects paths encoded as UTF-8. Use this config to tell git-p4 > what encoding Perforce had used for the paths. This encoding is used > to transcode the paths to UTF-8. As an example, Perforce on Windows > - often uses "cp1252" to encode path names. > + often uses "cp1252" to encode path names. If this option is passed > + into a p4 clone request, it is persisted in the resulting new git > + repo. > + > +git-p4.metadataDecodingStrategy:: > + Perforce keeps the encoding of a changelist descriptions and user > + full names as stored by the client on a given OS. The p4v client > + uses the OS-local encoding, and so different users can end up storing > + different changelist descriptions or user full names in different > + encodings, in the same depot. > + Git tolerates inconsistent/incorrect encodings in commit messages > + and author names, but expects them to be specified in utf-8. > + git-p4 can use three different decoding strategies in handling the > + encoding uncertainty in Perforce: 'passthrough' simply passes the > + original bytes through from Perforce to git, creating usable but > + incorrectly-encoded data when the Perforce data is encoded as > + anything other than utf-8. 'strict' expects the Perforce data to be > + encoded as utf-8, and fails to import when this is not true. > + 'fallback' attempts to interpret the data as utf-8, and otherwise > + falls back to using a secondary encoding - by default the common > + windows encoding 'cp-1252' - with upper-range bytes escaped if > + decoding with the fallback encoding also fails. > + Under python2 the default strategy is 'passthrough' for historical > + reasons, and under python3 the default is 'fallback'. > + When 'strict' is selected and decoding fails, the error message will > + propose changing this config parameter as a workaround. If this > + option is passed into a p4 clone request, it is persisted into the > + resulting new git repo. > + > +git-p4.metadataFallbackEncoding:: > + Specify the fallback encoding to use when decoding Perforce author > + names and changelists descriptions using the 'fallback' strategy > + (see git-p4.metadataDecodingStrategy). The fallback encoding will > + only be used when decoding as utf-8 fails. This option defaults to > + cp1252, a common windows encoding. If this option is passed into a > + p4 clone request, it is persisted into the resulting new git repo. > > git-p4.largeFileSystem:: > Specify the system that is used for large (binary) files. Please note > diff --git a/git-p4.py b/git-p4.py > index a9b1f904410..d24c3535f8a 100755 > --- a/git-p4.py > +++ b/git-p4.py > @@ -15,6 +15,7 @@ > # pylint: disable=too-many-statements,too-many-instance-attributes > # pylint: disable=too-many-branches,too-many-nested-blocks > # > +import struct > import sys > if sys.version_info.major < 3 and sys.version_info.minor < 7: > sys.stderr.write("git-p4: requires Python 2.7 or later.\n") > @@ -54,6 +55,9 @@ defaultLabelRegexp = r'[a-zA-Z0-9_\-.]+$' > # The block size is reduced automatically if required > defaultBlockSize = 1<<20 > > +defaultMetadataDecodingStrategy = 'passthrough' if sys.version_info.major == 2 else 'fallback' > +defaultFallbackMetadataEncoding = 'cp1252' > + > p4_access_checked = False > > re_ko_keywords = re.compile(br'\$(Id|Header)(:[^$\n]+)?\$') > @@ -203,6 +207,70 @@ else: > def encode_text_stream(s): > return s.encode('utf_8') if isinstance(s, unicode) else s > > + > +class MetadataDecodingException(Exception): > + def __init__(self, input_string): > + self.input_string = input_string > + > + def __str__(self): > + return """Decoding perforce metadata failed! > +The failing string was: > +--- > +{} > +--- > +Consider setting the git-p4.metadataDecodingStrategy config option to > +'fallback', to allow metadata to be decoded using a fallback encoding, > +defaulting to cp1252.""".format(self.input_string) > + > + > +encoding_fallback_warning_issued = False > +encoding_escape_warning_issued = False > +def metadata_stream_to_writable_bytes(s): > + encodingStrategy = gitConfig('git-p4.metadataDecodingStrategy') or defaultMetadataDecodingStrategy > + fallbackEncoding = gitConfig('git-p4.metadataFallbackEncoding') or defaultFallbackMetadataEncoding > + if not isinstance(s, bytes): > + return s.encode('utf_8') > + if encodingStrategy == 'passthrough': > + return s > + try: > + s.decode('utf_8') > + return s > + except UnicodeDecodeError: > + if encodingStrategy == 'fallback' and fallbackEncoding: > + global encoding_fallback_warning_issued > + global encoding_escape_warning_issued > + try: > + if not encoding_fallback_warning_issued: > + print("\nCould not decode value as utf-8; using configured fallback encoding %s: %s" % (fallbackEncoding, s)) > + print("\n(this warning is only displayed once during an import)") > + encoding_fallback_warning_issued = True > + return s.decode(fallbackEncoding).encode('utf_8') > + except Exception as exc: > + if not encoding_escape_warning_issued: > + print("\nCould not decode value with configured fallback encoding %s; escaping bytes over 127: %s" % (fallbackEncoding, s)) > + print("\n(this warning is only displayed once during an import)") > + encoding_escape_warning_issued = True > + escaped_bytes = b'' > + # bytes and strings work very differently in python2 vs python3... > + if str is bytes: > + for byte in s: > + byte_number = struct.unpack('>B', byte)[0] > + if byte_number > 127: > + escaped_bytes += b'%' > + escaped_bytes += hex(byte_number)[2:].upper() > + else: > + escaped_bytes += byte > + else: > + for byte_number in s: > + if byte_number > 127: > + escaped_bytes += b'%' > + escaped_bytes += hex(byte_number).upper().encode()[2:] > + else: > + escaped_bytes += bytes([byte_number]) > + return escaped_bytes > + > + raise MetadataDecodingException(s) > + > def decode_path(path): > """Decode a given string (bytes or otherwise) using configured path encoding options > """ > @@ -702,11 +770,12 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, > if bytes is not str: > # Decode unmarshalled dict to use str keys and values, except for: > # - `data` which may contain arbitrary binary data > - # - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text > + # - `desc` or `FullName` which may contain non-UTF8 encoded text handled below, eagerly converted to bytes > + # - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text, handled by decode_path() > decoded_entry = {} > for key, value in entry.items(): > key = key.decode() > - if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): > + if isinstance(value, bytes) and not (key in ('data', 'desc', 'FullName', 'path', 'clientFile') or key.startswith('depotFile')): > value = value.decode() > decoded_entry[key] = value > # Parse out data if it's an error response > @@ -716,6 +785,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, > if skip_info: > if 'code' in entry and entry['code'] == 'info': > continue > + if 'desc' in entry: > + entry['desc'] = metadata_stream_to_writable_bytes(entry['desc']) > + if 'FullName' in entry: > + entry['FullName'] = metadata_stream_to_writable_bytes(entry['FullName']) > if cb is not None: > cb(entry) > else: > @@ -1435,7 +1508,13 @@ class P4UserMap: > for output in p4CmdList(["users"]): > if "User" not in output: > continue > - self.users[output["User"]] = output["FullName"] + " <" + output["Email"] + ">" > + # "FullName" is bytes. "Email" on the other hand might be bytes > + # or unicode string depending on whether we are running under > + # python2 or python3. To support > + # git-p4.metadataDecodingStrategy=fallback, self.users dict values > + # are always bytes, ready to be written to git. > + emailbytes = metadata_stream_to_writable_bytes(output["Email"]) > + self.users[output["User"]] = output["FullName"] + b" <" + emailbytes + b">" > self.emails[output["Email"]] = output["User"] > > mapUserConfigRegex = re.compile(r"^\s*(\S+)\s*=\s*(.+)\s*<(\S+)>\s*$", re.VERBOSE) > @@ -1445,26 +1524,28 @@ class P4UserMap: > user = mapUser[0][0] > fullname = mapUser[0][1] > email = mapUser[0][2] > - self.users[user] = fullname + " <" + email + ">" > + fulluser = fullname + " <" + email + ">" > + self.users[user] = metadata_stream_to_writable_bytes(fulluser) > self.emails[email] = user > > - s = '' > + s = b'' > for (key, val) in self.users.items(): > - s += "%s\t%s\n" % (key.expandtabs(1), val.expandtabs(1)) > + keybytes = metadata_stream_to_writable_bytes(key) > + s += b"%s\t%s\n" % (keybytes.expandtabs(1), val.expandtabs(1)) > > - open(self.getUserCacheFilename(), 'w').write(s) > + open(self.getUserCacheFilename(), 'wb').write(s) > self.userMapFromPerforceServer = True > > def loadUserMapFromCache(self): > self.users = {} > self.userMapFromPerforceServer = False > try: > - cache = open(self.getUserCacheFilename(), 'r') > + cache = open(self.getUserCacheFilename(), 'rb') > lines = cache.readlines() > cache.close() > for line in lines: > - entry = line.strip().split("\t") > - self.users[entry[0]] = entry[1] > + entry = line.strip().split(b"\t") > + self.users[entry[0].decode('utf_8')] = entry[1] > except IOError: > self.getUserMapFromPerforceServer() > > @@ -3020,7 +3101,8 @@ class P4Sync(Command, P4UserMap): > if userid in self.users: > return self.users[userid] > else: > - return "%s <a@b>" % userid > + userid_bytes = metadata_stream_to_writable_bytes(userid) > + return b"%s <a@b>" % userid_bytes > > def streamTag(self, gitStream, labelName, labelDetails, commit, epoch): > """ Stream a p4 tag. > @@ -3043,9 +3125,10 @@ class P4Sync(Command, P4UserMap): > email = self.make_email(owner) > else: > email = self.make_email(self.p4UserId()) > - tagger = "%s %s %s" % (email, epoch, self.tz) > > - gitStream.write("tagger %s\n" % tagger) > + gitStream.write("tagger ") > + gitStream.write(email) > + gitStream.write(" %s %s\n" % (epoch, self.tz)) > > print("labelDetails=",labelDetails) > if 'Description' in labelDetails: > @@ -3138,12 +3221,12 @@ class P4Sync(Command, P4UserMap): > self.gitStream.write("commit %s\n" % branch) > self.gitStream.write("mark :%s\n" % details["change"]) > self.committedChanges.add(int(details["change"])) > - committer = "" > if author not in self.users: > self.getUserMapFromPerforceServer() > - committer = "%s %s %s" % (self.make_email(author), epoch, self.tz) > > - self.gitStream.write("committer %s\n" % committer) > + self.gitStream.write("committer ") > + self.gitStream.write(self.make_email(author)) > + self.gitStream.write(" %s %s\n" % (epoch, self.tz)) > > self.gitStream.write("data <<EOT\n") > self.gitStream.write(details["desc"]) > @@ -4055,6 +4138,14 @@ class P4Clone(P4Sync): > if self.useClientSpec_from_options: > system(["git", "config", "--bool", "git-p4.useclientspec", "true"]) > > + # persist any git-p4 encoding-handling config options passed in for clone: > + if gitConfig('git-p4.metadataDecodingStrategy'): > + system(["git", "config", "git-p4.metadataDecodingStrategy", gitConfig('git-p4.metadataDecodingStrategy')]) > + if gitConfig('git-p4.metadataFallbackEncoding'): > + system(["git", "config", "git-p4.metadataFallbackEncoding", gitConfig('git-p4.metadataFallbackEncoding')]) > + if gitConfig('git-p4.pathEncoding'): > + system(["git", "config", "git-p4.pathEncoding", gitConfig('git-p4.pathEncoding')]) > + > return True > > class P4Unshelve(Command): > diff --git a/t/lib-git-p4.sh b/t/lib-git-p4.sh > index 5aff2abe8b5..2a5b8738ea3 100644 > --- a/t/lib-git-p4.sh > +++ b/t/lib-git-p4.sh > @@ -142,10 +142,11 @@ start_p4d () { > > p4_add_user () { > name=$1 && > + fullname="${2:-Dr. $1}" > p4 user -f -i <<-EOF > User: $name > Email: $name@example.com > - FullName: Dr. $name > + FullName: $fullname > EOF > } > > diff --git a/t/t9835-git-p4-metadata-encoding-python2.sh b/t/t9835-git-p4-metadata-encoding-python2.sh > new file mode 100755 > index 00000000000..036bf79c667 > --- /dev/null > +++ b/t/t9835-git-p4-metadata-encoding-python2.sh > @@ -0,0 +1,213 @@ > +#!/bin/sh > + > +test_description='git p4 metadata encoding > + > +This test checks that the import process handles inconsistent text > +encoding in p4 metadata (author names, commit messages, etc) without > +failing, and produces maximally sane output in git.' > + > +. ./lib-git-p4.sh > + > +python_target_version='2' > + > +############################### > +## SECTION REPEATED IN t9836 ## > +############################### > + > +# Please note: this test calls "git-p4.py" rather than "git-p4", because the > +# latter references a specific path so we can't easily force it to run under > +# the python version we need to. > + > +python_major_version=$(python -V 2>&1 | cut -c 8) > +python_target_binary=$(which python$python_target_version) > +if ! test "$python_major_version" = "$python_target_version" && test "$python_target_binary" > +then > + mkdir temp_python > + PATH="$(pwd)/temp_python:$PATH" && export PATH > + ln -s $python_target_binary temp_python/python > +fi > + > +python_major_version=$(python -V 2>&1 | cut -c 8) > +if ! test "$python_major_version" = "$python_target_version" > +then > + skip_all="skipping python$python_target_version-specific git p4 tests; python$python_target_version not available" > + test_done > +fi > + > +remove_user_cache () { > + rm "$HOME/.gitp4-usercache.txt" || true > +} > + > +test_expect_success 'start p4d' ' > + start_p4d > +' > + > +test_expect_success 'init depot' ' > + ( > + cd "$cli" && > + > + p4_add_user "utf8_author" "ǣuthor" && > + P4USER=utf8_author && > + touch file1 && > + p4 add file1 && > + p4 submit -d "first CL has some utf-8 tǣxt" && > + > + p4_add_user "latin1_author" "$(echo æuthor | > + iconv -f utf8 -t latin1)" && > + P4USER=latin1_author && > + touch file2 && > + p4 add file2 && > + p4 submit -d "$(echo second CL has some latin-1 tæxt | > + iconv -f utf8 -t latin1)" && > + > + p4_add_user "cp1252_author" "$(echo æuthœr | > + iconv -f utf8 -t cp1252)" && > + P4USER=cp1252_author && > + touch file3 && > + p4 add file3 && > + p4 submit -d "$(echo third CL has sœme cp-1252 tæxt | > + iconv -f utf8 -t cp1252)" && > + > + p4_add_user "cp850_author" "$(echo Åuthor | > + iconv -f utf8 -t cp850)" && > + P4USER=cp850_author && > + touch file4 && > + p4 add file4 && > + p4 submit -d "$(echo fourth CL hÅs some cp850 text | > + iconv -f utf8 -t cp850)" > + ) > +' > + > +test_expect_success 'clone non-utf8 repo with strict encoding' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + test_must_fail git -c git-p4.metadataDecodingStrategy=strict p4.py clone --dest="$git" //depot@all 2>err && > + grep "Decoding perforce metadata failed!" err > +' > + > +test_expect_success 'check utf-8 contents with passthrough strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "some utf-8 tǣxt" actual && > + grep "ǣuthor" actual > + ) > +' > + > +test_expect_success 'check latin-1 contents corrupted in git with passthrough strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) && > + grep "$badly_encoded_in_git" actual && > + bad_author_in_git="$(echo æuthor | iconv -f utf8 -t latin1)" && > + grep "$bad_author_in_git" actual > + ) > +' > + > +test_expect_success 'check utf-8 contents with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "some utf-8 tǣxt" actual && > + grep "ǣuthor" actual > + ) > +' > + > +test_expect_success 'check latin-1 contents with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "some latin-1 tæxt" actual && > + grep "æuthor" actual > + ) > +' > + > +test_expect_success 'check cp-1252 contents with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "sœme cp-1252 tæxt" actual && > + grep "æuthœr" actual > + ) > +' > + > +test_expect_success 'check cp850 contents parsed with correct fallback' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback -c git-p4.metadataFallbackEncoding=cp850 p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "hÅs some cp850 text" actual && > + grep "Åuthor" actual > + ) > +' > + > +test_expect_success 'check cp850-only contents escaped when cp1252 is fallback' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "h%8Fs some cp850 text" actual && > + grep "%8Futhor" actual > + ) > +' > + > +test_expect_success 'check cp-1252 contents on later sync after clone with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$cli" && > + P4USER=cp1252_author && > + touch file10 && > + p4 add file10 && > + p4 submit -d "$(echo later CL has sœme more cp-1252 tæxt | > + iconv -f utf8 -t cp1252)" > + ) && > + ( > + cd "$git" && > + > + git p4.py sync --branch=master && > + > + git log p4/master >actual && > + grep "sœme more cp-1252 tæxt" actual && > + grep "æuthœr" actual > + ) > +' > + > +############################ > +## / END REPEATED SECTION ## > +############################ > + > +test_expect_success 'passthrough (latin-1 contents corrupted in git) is the default with python2' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) && > + grep "$badly_encoded_in_git" actual > + ) > +' > + > +test_done > diff --git a/t/t9836-git-p4-metadata-encoding-python3.sh b/t/t9836-git-p4-metadata-encoding-python3.sh > new file mode 100755 > index 00000000000..63350dc4b5c > --- /dev/null > +++ b/t/t9836-git-p4-metadata-encoding-python3.sh > @@ -0,0 +1,214 @@ > +#!/bin/sh > + > +test_description='git p4 metadata encoding > + > +This test checks that the import process handles inconsistent text > +encoding in p4 metadata (author names, commit messages, etc) without > +failing, and produces maximally sane output in git.' > + > +. ./lib-git-p4.sh > + > +python_target_version='3' > + > +############################### > +## SECTION REPEATED IN t9835 ## > +############################### > + > +# Please note: this test calls "git-p4.py" rather than "git-p4", because the > +# latter references a specific path so we can't easily force it to run under > +# the python version we need to. > + > +python_major_version=$(python -V 2>&1 | cut -c 8) > +python_target_binary=$(which python$python_target_version) > +if ! test "$python_major_version" = "$python_target_version" && test "$python_target_binary" > +then > + mkdir temp_python > + PATH="$(pwd)/temp_python:$PATH" && export PATH > + ln -s $python_target_binary temp_python/python > +fi > + > +python_major_version=$(python -V 2>&1 | cut -c 8) > +if ! test "$python_major_version" = "$python_target_version" > +then > + skip_all="skipping python$python_target_version-specific git p4 tests; python$python_target_version not available" > + test_done > +fi > + > +remove_user_cache () { > + rm "$HOME/.gitp4-usercache.txt" || true > +} > + > +test_expect_success 'start p4d' ' > + start_p4d > +' > + > +test_expect_success 'init depot' ' > + ( > + cd "$cli" && > + > + p4_add_user "utf8_author" "ǣuthor" && > + P4USER=utf8_author && > + touch file1 && > + p4 add file1 && > + p4 submit -d "first CL has some utf-8 tǣxt" && > + > + p4_add_user "latin1_author" "$(echo æuthor | > + iconv -f utf8 -t latin1)" && > + P4USER=latin1_author && > + touch file2 && > + p4 add file2 && > + p4 submit -d "$(echo second CL has some latin-1 tæxt | > + iconv -f utf8 -t latin1)" && > + > + p4_add_user "cp1252_author" "$(echo æuthœr | > + iconv -f utf8 -t cp1252)" && > + P4USER=cp1252_author && > + touch file3 && > + p4 add file3 && > + p4 submit -d "$(echo third CL has sœme cp-1252 tæxt | > + iconv -f utf8 -t cp1252)" && > + > + p4_add_user "cp850_author" "$(echo Åuthor | > + iconv -f utf8 -t cp850)" && > + P4USER=cp850_author && > + touch file4 && > + p4 add file4 && > + p4 submit -d "$(echo fourth CL hÅs some cp850 text | > + iconv -f utf8 -t cp850)" > + ) > +' > + > +test_expect_success 'clone non-utf8 repo with strict encoding' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + test_must_fail git -c git-p4.metadataDecodingStrategy=strict p4.py clone --dest="$git" //depot@all 2>err && > + grep "Decoding perforce metadata failed!" err > +' > + > +test_expect_success 'check utf-8 contents with passthrough strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "some utf-8 tǣxt" actual && > + grep "ǣuthor" actual > + ) > +' > + > +test_expect_success 'check latin-1 contents corrupted in git with passthrough strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) && > + grep "$badly_encoded_in_git" actual && > + bad_author_in_git="$(echo æuthor | iconv -f utf8 -t latin1)" && > + grep "$bad_author_in_git" actual > + ) > +' > + > +test_expect_success 'check utf-8 contents with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "some utf-8 tǣxt" actual && > + grep "ǣuthor" actual > + ) > +' > + > +test_expect_success 'check latin-1 contents with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "some latin-1 tæxt" actual && > + grep "æuthor" actual > + ) > +' > + > +test_expect_success 'check cp-1252 contents with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "sœme cp-1252 tæxt" actual && > + grep "æuthœr" actual > + ) > +' > + > +test_expect_success 'check cp850 contents parsed with correct fallback' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback -c git-p4.metadataFallbackEncoding=cp850 p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "hÅs some cp850 text" actual && > + grep "Åuthor" actual > + ) > +' > + > +test_expect_success 'check cp850-only contents escaped when cp1252 is fallback' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "h%8Fs some cp850 text" actual && > + grep "%8Futhor" actual > + ) > +' > + > +test_expect_success 'check cp-1252 contents on later sync after clone with fallback strategy' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && > + ( > + cd "$cli" && > + P4USER=cp1252_author && > + touch file10 && > + p4 add file10 && > + p4 submit -d "$(echo later CL has sœme more cp-1252 tæxt | > + iconv -f utf8 -t cp1252)" > + ) && > + ( > + cd "$git" && > + > + git p4.py sync --branch=master && > + > + git log p4/master >actual && > + grep "sœme more cp-1252 tæxt" actual && > + grep "æuthœr" actual > + ) > +' > + > +############################ > +## / END REPEATED SECTION ## > +############################ > + > + > +test_expect_success 'fallback (both utf-8 and cp-1252 contents handled) is the default with python3' ' > + test_when_finished cleanup_git && > + test_when_finished remove_user_cache && > + git p4.py clone --dest="$git" //depot@all && > + ( > + cd "$git" && > + git log >actual && > + grep "sœme cp-1252 tæxt" actual && > + grep "æuthœr" actual > + ) > +' > + > +test_done > > base-commit: 11cfe552610386954886543f5de87dcc49ad5735 > -- > gitgitgadget
diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt index e21fcd8f712..de5ee6748e3 100644 --- a/Documentation/git-p4.txt +++ b/Documentation/git-p4.txt @@ -636,7 +636,42 @@ git-p4.pathEncoding:: Git expects paths encoded as UTF-8. Use this config to tell git-p4 what encoding Perforce had used for the paths. This encoding is used to transcode the paths to UTF-8. As an example, Perforce on Windows - often uses "cp1252" to encode path names. + often uses "cp1252" to encode path names. If this option is passed + into a p4 clone request, it is persisted in the resulting new git + repo. + +git-p4.metadataDecodingStrategy:: + Perforce keeps the encoding of a changelist descriptions and user + full names as stored by the client on a given OS. The p4v client + uses the OS-local encoding, and so different users can end up storing + different changelist descriptions or user full names in different + encodings, in the same depot. + Git tolerates inconsistent/incorrect encodings in commit messages + and author names, but expects them to be specified in utf-8. + git-p4 can use three different decoding strategies in handling the + encoding uncertainty in Perforce: 'passthrough' simply passes the + original bytes through from Perforce to git, creating usable but + incorrectly-encoded data when the Perforce data is encoded as + anything other than utf-8. 'strict' expects the Perforce data to be + encoded as utf-8, and fails to import when this is not true. + 'fallback' attempts to interpret the data as utf-8, and otherwise + falls back to using a secondary encoding - by default the common + windows encoding 'cp-1252' - with upper-range bytes escaped if + decoding with the fallback encoding also fails. + Under python2 the default strategy is 'passthrough' for historical + reasons, and under python3 the default is 'fallback'. + When 'strict' is selected and decoding fails, the error message will + propose changing this config parameter as a workaround. If this + option is passed into a p4 clone request, it is persisted into the + resulting new git repo. + +git-p4.metadataFallbackEncoding:: + Specify the fallback encoding to use when decoding Perforce author + names and changelists descriptions using the 'fallback' strategy + (see git-p4.metadataDecodingStrategy). The fallback encoding will + only be used when decoding as utf-8 fails. This option defaults to + cp1252, a common windows encoding. If this option is passed into a + p4 clone request, it is persisted into the resulting new git repo. git-p4.largeFileSystem:: Specify the system that is used for large (binary) files. Please note diff --git a/git-p4.py b/git-p4.py index a9b1f904410..d24c3535f8a 100755 --- a/git-p4.py +++ b/git-p4.py @@ -15,6 +15,7 @@ # pylint: disable=too-many-statements,too-many-instance-attributes # pylint: disable=too-many-branches,too-many-nested-blocks # +import struct import sys if sys.version_info.major < 3 and sys.version_info.minor < 7: sys.stderr.write("git-p4: requires Python 2.7 or later.\n") @@ -54,6 +55,9 @@ defaultLabelRegexp = r'[a-zA-Z0-9_\-.]+$' # The block size is reduced automatically if required defaultBlockSize = 1<<20 +defaultMetadataDecodingStrategy = 'passthrough' if sys.version_info.major == 2 else 'fallback' +defaultFallbackMetadataEncoding = 'cp1252' + p4_access_checked = False re_ko_keywords = re.compile(br'\$(Id|Header)(:[^$\n]+)?\$') @@ -203,6 +207,70 @@ else: def encode_text_stream(s): return s.encode('utf_8') if isinstance(s, unicode) else s + +class MetadataDecodingException(Exception): + def __init__(self, input_string): + self.input_string = input_string + + def __str__(self): + return """Decoding perforce metadata failed! +The failing string was: +--- +{} +--- +Consider setting the git-p4.metadataDecodingStrategy config option to +'fallback', to allow metadata to be decoded using a fallback encoding, +defaulting to cp1252.""".format(self.input_string) + + +encoding_fallback_warning_issued = False +encoding_escape_warning_issued = False +def metadata_stream_to_writable_bytes(s): + encodingStrategy = gitConfig('git-p4.metadataDecodingStrategy') or defaultMetadataDecodingStrategy + fallbackEncoding = gitConfig('git-p4.metadataFallbackEncoding') or defaultFallbackMetadataEncoding + if not isinstance(s, bytes): + return s.encode('utf_8') + if encodingStrategy == 'passthrough': + return s + try: + s.decode('utf_8') + return s + except UnicodeDecodeError: + if encodingStrategy == 'fallback' and fallbackEncoding: + global encoding_fallback_warning_issued + global encoding_escape_warning_issued + try: + if not encoding_fallback_warning_issued: + print("\nCould not decode value as utf-8; using configured fallback encoding %s: %s" % (fallbackEncoding, s)) + print("\n(this warning is only displayed once during an import)") + encoding_fallback_warning_issued = True + return s.decode(fallbackEncoding).encode('utf_8') + except Exception as exc: + if not encoding_escape_warning_issued: + print("\nCould not decode value with configured fallback encoding %s; escaping bytes over 127: %s" % (fallbackEncoding, s)) + print("\n(this warning is only displayed once during an import)") + encoding_escape_warning_issued = True + escaped_bytes = b'' + # bytes and strings work very differently in python2 vs python3... + if str is bytes: + for byte in s: + byte_number = struct.unpack('>B', byte)[0] + if byte_number > 127: + escaped_bytes += b'%' + escaped_bytes += hex(byte_number)[2:].upper() + else: + escaped_bytes += byte + else: + for byte_number in s: + if byte_number > 127: + escaped_bytes += b'%' + escaped_bytes += hex(byte_number).upper().encode()[2:] + else: + escaped_bytes += bytes([byte_number]) + return escaped_bytes + + raise MetadataDecodingException(s) + def decode_path(path): """Decode a given string (bytes or otherwise) using configured path encoding options """ @@ -702,11 +770,12 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, if bytes is not str: # Decode unmarshalled dict to use str keys and values, except for: # - `data` which may contain arbitrary binary data - # - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text + # - `desc` or `FullName` which may contain non-UTF8 encoded text handled below, eagerly converted to bytes + # - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text, handled by decode_path() decoded_entry = {} for key, value in entry.items(): key = key.decode() - if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): + if isinstance(value, bytes) and not (key in ('data', 'desc', 'FullName', 'path', 'clientFile') or key.startswith('depotFile')): value = value.decode() decoded_entry[key] = value # Parse out data if it's an error response @@ -716,6 +785,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, if skip_info: if 'code' in entry and entry['code'] == 'info': continue + if 'desc' in entry: + entry['desc'] = metadata_stream_to_writable_bytes(entry['desc']) + if 'FullName' in entry: + entry['FullName'] = metadata_stream_to_writable_bytes(entry['FullName']) if cb is not None: cb(entry) else: @@ -1435,7 +1508,13 @@ class P4UserMap: for output in p4CmdList(["users"]): if "User" not in output: continue - self.users[output["User"]] = output["FullName"] + " <" + output["Email"] + ">" + # "FullName" is bytes. "Email" on the other hand might be bytes + # or unicode string depending on whether we are running under + # python2 or python3. To support + # git-p4.metadataDecodingStrategy=fallback, self.users dict values + # are always bytes, ready to be written to git. + emailbytes = metadata_stream_to_writable_bytes(output["Email"]) + self.users[output["User"]] = output["FullName"] + b" <" + emailbytes + b">" self.emails[output["Email"]] = output["User"] mapUserConfigRegex = re.compile(r"^\s*(\S+)\s*=\s*(.+)\s*<(\S+)>\s*$", re.VERBOSE) @@ -1445,26 +1524,28 @@ class P4UserMap: user = mapUser[0][0] fullname = mapUser[0][1] email = mapUser[0][2] - self.users[user] = fullname + " <" + email + ">" + fulluser = fullname + " <" + email + ">" + self.users[user] = metadata_stream_to_writable_bytes(fulluser) self.emails[email] = user - s = '' + s = b'' for (key, val) in self.users.items(): - s += "%s\t%s\n" % (key.expandtabs(1), val.expandtabs(1)) + keybytes = metadata_stream_to_writable_bytes(key) + s += b"%s\t%s\n" % (keybytes.expandtabs(1), val.expandtabs(1)) - open(self.getUserCacheFilename(), 'w').write(s) + open(self.getUserCacheFilename(), 'wb').write(s) self.userMapFromPerforceServer = True def loadUserMapFromCache(self): self.users = {} self.userMapFromPerforceServer = False try: - cache = open(self.getUserCacheFilename(), 'r') + cache = open(self.getUserCacheFilename(), 'rb') lines = cache.readlines() cache.close() for line in lines: - entry = line.strip().split("\t") - self.users[entry[0]] = entry[1] + entry = line.strip().split(b"\t") + self.users[entry[0].decode('utf_8')] = entry[1] except IOError: self.getUserMapFromPerforceServer() @@ -3020,7 +3101,8 @@ class P4Sync(Command, P4UserMap): if userid in self.users: return self.users[userid] else: - return "%s <a@b>" % userid + userid_bytes = metadata_stream_to_writable_bytes(userid) + return b"%s <a@b>" % userid_bytes def streamTag(self, gitStream, labelName, labelDetails, commit, epoch): """ Stream a p4 tag. @@ -3043,9 +3125,10 @@ class P4Sync(Command, P4UserMap): email = self.make_email(owner) else: email = self.make_email(self.p4UserId()) - tagger = "%s %s %s" % (email, epoch, self.tz) - gitStream.write("tagger %s\n" % tagger) + gitStream.write("tagger ") + gitStream.write(email) + gitStream.write(" %s %s\n" % (epoch, self.tz)) print("labelDetails=",labelDetails) if 'Description' in labelDetails: @@ -3138,12 +3221,12 @@ class P4Sync(Command, P4UserMap): self.gitStream.write("commit %s\n" % branch) self.gitStream.write("mark :%s\n" % details["change"]) self.committedChanges.add(int(details["change"])) - committer = "" if author not in self.users: self.getUserMapFromPerforceServer() - committer = "%s %s %s" % (self.make_email(author), epoch, self.tz) - self.gitStream.write("committer %s\n" % committer) + self.gitStream.write("committer ") + self.gitStream.write(self.make_email(author)) + self.gitStream.write(" %s %s\n" % (epoch, self.tz)) self.gitStream.write("data <<EOT\n") self.gitStream.write(details["desc"]) @@ -4055,6 +4138,14 @@ class P4Clone(P4Sync): if self.useClientSpec_from_options: system(["git", "config", "--bool", "git-p4.useclientspec", "true"]) + # persist any git-p4 encoding-handling config options passed in for clone: + if gitConfig('git-p4.metadataDecodingStrategy'): + system(["git", "config", "git-p4.metadataDecodingStrategy", gitConfig('git-p4.metadataDecodingStrategy')]) + if gitConfig('git-p4.metadataFallbackEncoding'): + system(["git", "config", "git-p4.metadataFallbackEncoding", gitConfig('git-p4.metadataFallbackEncoding')]) + if gitConfig('git-p4.pathEncoding'): + system(["git", "config", "git-p4.pathEncoding", gitConfig('git-p4.pathEncoding')]) + return True class P4Unshelve(Command): diff --git a/t/lib-git-p4.sh b/t/lib-git-p4.sh index 5aff2abe8b5..2a5b8738ea3 100644 --- a/t/lib-git-p4.sh +++ b/t/lib-git-p4.sh @@ -142,10 +142,11 @@ start_p4d () { p4_add_user () { name=$1 && + fullname="${2:-Dr. $1}" p4 user -f -i <<-EOF User: $name Email: $name@example.com - FullName: Dr. $name + FullName: $fullname EOF } diff --git a/t/t9835-git-p4-metadata-encoding-python2.sh b/t/t9835-git-p4-metadata-encoding-python2.sh new file mode 100755 index 00000000000..036bf79c667 --- /dev/null +++ b/t/t9835-git-p4-metadata-encoding-python2.sh @@ -0,0 +1,213 @@ +#!/bin/sh + +test_description='git p4 metadata encoding + +This test checks that the import process handles inconsistent text +encoding in p4 metadata (author names, commit messages, etc) without +failing, and produces maximally sane output in git.' + +. ./lib-git-p4.sh + +python_target_version='2' + +############################### +## SECTION REPEATED IN t9836 ## +############################### + +# Please note: this test calls "git-p4.py" rather than "git-p4", because the +# latter references a specific path so we can't easily force it to run under +# the python version we need to. + +python_major_version=$(python -V 2>&1 | cut -c 8) +python_target_binary=$(which python$python_target_version) +if ! test "$python_major_version" = "$python_target_version" && test "$python_target_binary" +then + mkdir temp_python + PATH="$(pwd)/temp_python:$PATH" && export PATH + ln -s $python_target_binary temp_python/python +fi + +python_major_version=$(python -V 2>&1 | cut -c 8) +if ! test "$python_major_version" = "$python_target_version" +then + skip_all="skipping python$python_target_version-specific git p4 tests; python$python_target_version not available" + test_done +fi + +remove_user_cache () { + rm "$HOME/.gitp4-usercache.txt" || true +} + +test_expect_success 'start p4d' ' + start_p4d +' + +test_expect_success 'init depot' ' + ( + cd "$cli" && + + p4_add_user "utf8_author" "ǣuthor" && + P4USER=utf8_author && + touch file1 && + p4 add file1 && + p4 submit -d "first CL has some utf-8 tǣxt" && + + p4_add_user "latin1_author" "$(echo æuthor | + iconv -f utf8 -t latin1)" && + P4USER=latin1_author && + touch file2 && + p4 add file2 && + p4 submit -d "$(echo second CL has some latin-1 tæxt | + iconv -f utf8 -t latin1)" && + + p4_add_user "cp1252_author" "$(echo æuthœr | + iconv -f utf8 -t cp1252)" && + P4USER=cp1252_author && + touch file3 && + p4 add file3 && + p4 submit -d "$(echo third CL has sœme cp-1252 tæxt | + iconv -f utf8 -t cp1252)" && + + p4_add_user "cp850_author" "$(echo Åuthor | + iconv -f utf8 -t cp850)" && + P4USER=cp850_author && + touch file4 && + p4 add file4 && + p4 submit -d "$(echo fourth CL hÅs some cp850 text | + iconv -f utf8 -t cp850)" + ) +' + +test_expect_success 'clone non-utf8 repo with strict encoding' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + test_must_fail git -c git-p4.metadataDecodingStrategy=strict p4.py clone --dest="$git" //depot@all 2>err && + grep "Decoding perforce metadata failed!" err +' + +test_expect_success 'check utf-8 contents with passthrough strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "some utf-8 tǣxt" actual && + grep "ǣuthor" actual + ) +' + +test_expect_success 'check latin-1 contents corrupted in git with passthrough strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) && + grep "$badly_encoded_in_git" actual && + bad_author_in_git="$(echo æuthor | iconv -f utf8 -t latin1)" && + grep "$bad_author_in_git" actual + ) +' + +test_expect_success 'check utf-8 contents with fallback strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "some utf-8 tǣxt" actual && + grep "ǣuthor" actual + ) +' + +test_expect_success 'check latin-1 contents with fallback strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "some latin-1 tæxt" actual && + grep "æuthor" actual + ) +' + +test_expect_success 'check cp-1252 contents with fallback strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "sœme cp-1252 tæxt" actual && + grep "æuthœr" actual + ) +' + +test_expect_success 'check cp850 contents parsed with correct fallback' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback -c git-p4.metadataFallbackEncoding=cp850 p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "hÅs some cp850 text" actual && + grep "Åuthor" actual + ) +' + +test_expect_success 'check cp850-only contents escaped when cp1252 is fallback' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "h%8Fs some cp850 text" actual && + grep "%8Futhor" actual + ) +' + +test_expect_success 'check cp-1252 contents on later sync after clone with fallback strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$cli" && + P4USER=cp1252_author && + touch file10 && + p4 add file10 && + p4 submit -d "$(echo later CL has sœme more cp-1252 tæxt | + iconv -f utf8 -t cp1252)" + ) && + ( + cd "$git" && + + git p4.py sync --branch=master && + + git log p4/master >actual && + grep "sœme more cp-1252 tæxt" actual && + grep "æuthœr" actual + ) +' + +############################ +## / END REPEATED SECTION ## +############################ + +test_expect_success 'passthrough (latin-1 contents corrupted in git) is the default with python2' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) && + grep "$badly_encoded_in_git" actual + ) +' + +test_done diff --git a/t/t9836-git-p4-metadata-encoding-python3.sh b/t/t9836-git-p4-metadata-encoding-python3.sh new file mode 100755 index 00000000000..63350dc4b5c --- /dev/null +++ b/t/t9836-git-p4-metadata-encoding-python3.sh @@ -0,0 +1,214 @@ +#!/bin/sh + +test_description='git p4 metadata encoding + +This test checks that the import process handles inconsistent text +encoding in p4 metadata (author names, commit messages, etc) without +failing, and produces maximally sane output in git.' + +. ./lib-git-p4.sh + +python_target_version='3' + +############################### +## SECTION REPEATED IN t9835 ## +############################### + +# Please note: this test calls "git-p4.py" rather than "git-p4", because the +# latter references a specific path so we can't easily force it to run under +# the python version we need to. + +python_major_version=$(python -V 2>&1 | cut -c 8) +python_target_binary=$(which python$python_target_version) +if ! test "$python_major_version" = "$python_target_version" && test "$python_target_binary" +then + mkdir temp_python + PATH="$(pwd)/temp_python:$PATH" && export PATH + ln -s $python_target_binary temp_python/python +fi + +python_major_version=$(python -V 2>&1 | cut -c 8) +if ! test "$python_major_version" = "$python_target_version" +then + skip_all="skipping python$python_target_version-specific git p4 tests; python$python_target_version not available" + test_done +fi + +remove_user_cache () { + rm "$HOME/.gitp4-usercache.txt" || true +} + +test_expect_success 'start p4d' ' + start_p4d +' + +test_expect_success 'init depot' ' + ( + cd "$cli" && + + p4_add_user "utf8_author" "ǣuthor" && + P4USER=utf8_author && + touch file1 && + p4 add file1 && + p4 submit -d "first CL has some utf-8 tǣxt" && + + p4_add_user "latin1_author" "$(echo æuthor | + iconv -f utf8 -t latin1)" && + P4USER=latin1_author && + touch file2 && + p4 add file2 && + p4 submit -d "$(echo second CL has some latin-1 tæxt | + iconv -f utf8 -t latin1)" && + + p4_add_user "cp1252_author" "$(echo æuthœr | + iconv -f utf8 -t cp1252)" && + P4USER=cp1252_author && + touch file3 && + p4 add file3 && + p4 submit -d "$(echo third CL has sœme cp-1252 tæxt | + iconv -f utf8 -t cp1252)" && + + p4_add_user "cp850_author" "$(echo Åuthor | + iconv -f utf8 -t cp850)" && + P4USER=cp850_author && + touch file4 && + p4 add file4 && + p4 submit -d "$(echo fourth CL hÅs some cp850 text | + iconv -f utf8 -t cp850)" + ) +' + +test_expect_success 'clone non-utf8 repo with strict encoding' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + test_must_fail git -c git-p4.metadataDecodingStrategy=strict p4.py clone --dest="$git" //depot@all 2>err && + grep "Decoding perforce metadata failed!" err +' + +test_expect_success 'check utf-8 contents with passthrough strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "some utf-8 tǣxt" actual && + grep "ǣuthor" actual + ) +' + +test_expect_success 'check latin-1 contents corrupted in git with passthrough strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=passthrough p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + badly_encoded_in_git=$(echo "some latin-1 tæxt" | iconv -f utf8 -t latin1) && + grep "$badly_encoded_in_git" actual && + bad_author_in_git="$(echo æuthor | iconv -f utf8 -t latin1)" && + grep "$bad_author_in_git" actual + ) +' + +test_expect_success 'check utf-8 contents with fallback strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "some utf-8 tǣxt" actual && + grep "ǣuthor" actual + ) +' + +test_expect_success 'check latin-1 contents with fallback strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "some latin-1 tæxt" actual && + grep "æuthor" actual + ) +' + +test_expect_success 'check cp-1252 contents with fallback strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "sœme cp-1252 tæxt" actual && + grep "æuthœr" actual + ) +' + +test_expect_success 'check cp850 contents parsed with correct fallback' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback -c git-p4.metadataFallbackEncoding=cp850 p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "hÅs some cp850 text" actual && + grep "Åuthor" actual + ) +' + +test_expect_success 'check cp850-only contents escaped when cp1252 is fallback' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "h%8Fs some cp850 text" actual && + grep "%8Futhor" actual + ) +' + +test_expect_success 'check cp-1252 contents on later sync after clone with fallback strategy' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git -c git-p4.metadataDecodingStrategy=fallback p4.py clone --dest="$git" //depot@all && + ( + cd "$cli" && + P4USER=cp1252_author && + touch file10 && + p4 add file10 && + p4 submit -d "$(echo later CL has sœme more cp-1252 tæxt | + iconv -f utf8 -t cp1252)" + ) && + ( + cd "$git" && + + git p4.py sync --branch=master && + + git log p4/master >actual && + grep "sœme more cp-1252 tæxt" actual && + grep "æuthœr" actual + ) +' + +############################ +## / END REPEATED SECTION ## +############################ + + +test_expect_success 'fallback (both utf-8 and cp-1252 contents handled) is the default with python3' ' + test_when_finished cleanup_git && + test_when_finished remove_user_cache && + git p4.py clone --dest="$git" //depot@all && + ( + cd "$git" && + git log >actual && + grep "sœme cp-1252 tæxt" actual && + grep "æuthœr" actual + ) +' + +test_done