Message ID | 20210412085251.51475-3-andrew@adoakley.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | git-p4: encoding of data from perforce | expand |
I checked out "seen" and ran the test script from this patch (t9835-git-p4-message-encoding.sh) on my Windows machine and it fails. I don't think the solution in this patch will solve the issue of non UTF-8 descriptions on Windows. The interaction between git-p4.py and p4 around non-ASCII descriptions is different on Linux and Windows (at least with the default code page settings). Unfortunately the CI on gitlab does not include any Windows test environments that have p4 installed. As far as I can tell, non-ASCII strings passed to "p4 submit -d" pass unchanged to the Perforce database on Linux. As well, such data also passes unchanged in the other direction, when "p4" output is consumed by git-p4.py. Since this patch avoids decoding descriptions, and the test script uses binary data for descriptions, the tests pass on Linux. However, on Windows, UTF-8 strings passed to "p4 submit -d" are somehow converted to the default Windows code page by the time they are stored in the Perforce database, probably as part of the process of passing the command line arguments to the Windows p4 executable. However, the "code page" data is *not* converted to UTF-8 on the way back from p4 to git-p4.py. The only way to get it into UTF-8 is to call string.decode(). As a result, this patch, which takes out the call to string.decode() will not work on Windows.
On Thu, 29 Apr 2021 03:00:06 -0700 Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote: > However, on Windows, UTF-8 strings passed to "p4 submit -d" are > somehow converted to the default Windows code page by the time they > are stored in the Perforce database, probably as part of the process > of passing the command line arguments to the Windows p4 executable. > However, the "code page" data is *not* converted to UTF-8 on the way > back from p4 to git-p4.py. The only way to get it into UTF-8 is to > call string.decode(). As a result, this patch, which takes out the > call to string.decode() will not work on Windows. Thanks for that explanation, the reencoding of the data on Windows is not something I was expecting. Given the behaviour you've described, I suspect that there might be two different problems that we are trying to solve. The perforce depot I'm working with has a mixture of encodings, and commits are created from a variety of different environments. The majority of commits are ASCII or UTF-8, there are a small number that are in some other encoding. Any attempt to reencode the data is likely to make the problem worse in at least some cases. I suspect that other perforce depots are used primarily from Windows machines, and have data that is encoded in a mostly consistent way but the encoding is not UTF-8. Re-encoding the data for git makes sense in that case. Is this the kind of repository you have? If there are these two different cases then we probably need to come up with a patch that solves both issues. For my cases where we've got a repository containing all sorts of junk, it sounds like it might be awkward to create a test case that works on Windows.
On 30/04/2021 08:53, Andrew Oakley wrote: > On Thu, 29 Apr 2021 03:00:06 -0700 > Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote: >> However, on Windows, UTF-8 strings passed to "p4 submit -d" are >> somehow converted to the default Windows code page by the time they >> are stored in the Perforce database, probably as part of the process >> of passing the command line arguments to the Windows p4 executable. >> However, the "code page" data is *not* converted to UTF-8 on the way >> back from p4 to git-p4.py. The only way to get it into UTF-8 is to >> call string.decode(). As a result, this patch, which takes out the >> call to string.decode() will not work on Windows. > > Thanks for that explanation, the reencoding of the data on Windows is > not something I was expecting. Given the behaviour you've described, I > suspect that there might be two different problems that we are trying > to solve. > > The perforce depot I'm working with has a mixture of encodings, and > commits are created from a variety of different environments. The > majority of commits are ASCII or UTF-8, there are a small number that > are in some other encoding. Any attempt to reencode the data is likely > to make the problem worse in at least some cases. > > I suspect that other perforce depots are used primarily from Windows > machines, and have data that is encoded in a mostly consistent way but > the encoding is not UTF-8. Re-encoding the data for git makes sense in > that case. Is this the kind of repository you have? > > If there are these two different cases then we probably need to come up > with a patch that solves both issues. > > For my cases where we've got a repository containing all sorts of junk, > it sounds like it might be awkward to create a test case that works on > Windows. > https://www.perforce.com/perforce/doc.current/user/i18nnotes.txt Tzadik - is your server unicode enabled or not? That would be interesting to know: p4 counters | grep -i unicode I suspect it is not. It's only if unicode is enabled that the server will convert to/from utf8 (at least that's my understanding). Without this setting, p4d and p4 are (probably) not doing any conversions. I think it might be useful to clarify exactly what conversions are actually happening. I wonder what encoding Perforce thinks you've got in place.
On Fri, Apr 30, 2021 at 8:33 AM Luke Diamand <luke@diamand.org> wrote: > > Tzadik - is your server unicode enabled or not? That would be > interesting to know: > > p4 counters | grep -i unicode > > I suspect it is not. It's only if unicode is enabled that the server > will convert to/from utf8 (at least that's my understanding). Without > this setting, p4d and p4 are (probably) not doing any conversions. My server is not unicode. These conversions are happening even with a non-Unicode perforce db. I don't think it's the p4d code per se that is doing the conversion, but rather an interaction between the OS and the code, which is different under Linux vs Windows. If you create a trivial C program that dumps the hex values of the bytes it receives in argv, you can see this different behavior: #include <stdio.h> void main(int argc, char *argv[]) { int i, j; char *s; for (i = 1; i < argc; ++i) { s = argv[i]; for (j = 0; s[j] != '\0'; ++j) printf(" %X", (unsigned char)s[j]); printf("\n"); printf("[%s]\n\n", s); } } When built with Visual Studio and called from Cygwin, if you pass in args with UTF-8 encoded characters, the program will spit them out in cp1252. If you compile it on a Linux system using gcc, it will spit them out in UTF-8 (unchanged). I suspect that's what's happening with p4d on Windows vs Linux. In any event, if you look at my patch (v6 is the latest... https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/ ), you will see I have written tests that pass under both Linux and Windows. (If you want to run them yourself, you need to base my patch off of "master", not "seen"). The tests make clear what the different behavior is and also show that p4d is not set to Unicode (since the tests do not change the default setting).
On Fri, 30 Apr 2021 11:08:57 -0700 Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote: > My server is not unicode. > > These conversions are happening even with a non-Unicode perforce db. > I don't think it's the p4d code per se that is doing the conversion, > but rather an interaction between the OS and the code, which is > different under Linux vs Windows. It's not particularly obvious exactly what is happening here. The perforce command line client is written in a rather odd way - it uses the unicode (UTF-16) wWinMainCRTStartup entry point but then calls an undocumented API to get the "narrow" version of the command line. The code is here: https://swarm.workshop.perforce.com/projects/perforce_software-p4/files/2016-1/client/clientmain.cc I think that perforce will end up with the data in a code page that depends on the configuration of the machine. I don't think the exact details matter here - just that it's some semi-arbitrary encoding that isn't recorded in the commit. The key thing that I'm trying to point out here is that the encoding is not necessarily consistent between different commits. The changes that you have proposed force you to pick one encoding that will be used for every commit. If it's wrong then data will be corrupted, and there is no option provided to avoid that. The only way I can see to avoid this issue is to not attempt to re-encode the data - just pass it directly to git. I think another way to solve the issue you have is the encoding header on git commits. We can pass the bytes through git-p4 unmodified, but mark the commit message as being encoded using something that isn't UTF-8. That avoids any potentially lossy conversions when cloning the repository, but should allow the data to be displayed correctly in git. > In any event, if you look at my patch (v6 is the latest... > https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/ > ), you will see I have written tests that pass under both Linux and > Windows. (If you want to run them yourself, you need to base my patch > off of "master", not "seen"). The tests make clear what the > different behavior is and also show that p4d is not set to Unicode > (since the tests do not change the default setting). I don't think the tests are doing anything interesting on Linux - you stick valid UTF-8 in, and valid UTF-8 data comes out. I suspect the tests will fail on Windows if the relevant code page is set to a value that you're not expecting. For the purposes of writing tests that work the same everywhere we can use `p4 submit -i`. The data written on stdin isn't reencoded, even on Windows. I can rework my test to use `p4 submit -i` on windows. It should be fairly simple to write another change which allows the encoding to be set on commits created by git-p4. Does that seem like a reasonable way forward? I think it gets us: - sensible handling for repositories with mixed encodings - sensible handling for repositories with known encodings - tests that work the same on Linux and Windows
On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote: > The key thing that I'm trying to point out here is that the encoding is > not necessarily consistent between different commits. The changes that > you have proposed force you to pick one encoding that will be used for > every commit. If it's wrong then data will be corrupted, and there is > no option provided to avoid that. The only way I can see to avoid this > issue is to not attempt to re-encode the data - just pass it directly > to git. No, my "fallbackEndocing" setting is just that... a fallback. My proposal *always* tries to decode in UTF-8 first! Only if that throws an exception will my "fallbackEncoding" come into play, and it only comes into play for the single changeset description that was invalid UTF-8. After that, subsequent descriptions will again be tried in UTF-8 first. The design of the UTF-8 format makes it very unlikely that non UTF-8 text will pass undetected through a UTF-8 decoder, so by attempting to decode in UTF-8 first, there is very little risk of a lossy conversion. As for passing data through "raw", that will *guarantee* bad encoding on any descriptions that are not UTF-8, because git will interpret the data as UTF-8 once it has been put into the commit (unless the encoding header is used, as you mentioned) . If that header is not used, and it was not in UTF-8 in Perforce, it has zero chance of being correct in git unless it is decoded. At least "fallbackEncoding" gives it SOME chance of decoding it correctly. > I think another way to solve the issue you have is the encoding header > on git commits. We can pass the bytes through git-p4 unmodified, but > mark the commit message as being encoded using something that isn't > UTF-8. That avoids any potentially lossy conversions when cloning the > repository, but should allow the data to be displayed correctly in git. Yes, that could be a solution. I will try that out. > > In any event, if you look at my patch (v6 is the latest... > > https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/ > > ), you will see I have written tests that pass under both Linux and > > Windows. (If you want to run them yourself, you need to base my patch > > off of "master", not "seen"). The tests make clear what the > > different behavior is and also show that p4d is not set to Unicode > > (since the tests do not change the default setting). > > I don't think the tests are doing anything interesting on Linux - you > stick valid UTF-8 in, and valid UTF-8 data comes out. Totally agree.... I only did that to get them to pass the Gitlab CI. I submitted an earlier patch that simply skipped the test file on Linux, but I got pushback on that, so I made them pass on Linux, even though they aren't useful. > I suspect the tests will fail on Windows if the relevant code page is set to a value > that you're not expecting. It depends. If the code page is set to UTF-8 (65001) I think the tests would still work, because as I said above, my code always tries to decode with UTF-8 first, no matter what the "fallbackEncoding" setting is. If the code page is set to something other than UTF-8 or the default, then one of my tests would fail, because it uses a hard-coded "fallbackEncoding" of "cp1252". But the code would work for the user! All they would need to do is set "fallbackEncoding" to the code page they're actually using, instead of "cp1252". (More sophisticated tests could be developed that explicitly set the code page and use the corresponding "fallbackEncoding" setting.) > For the purposes of writing tests that work the same everywhere we can > use `p4 submit -i`. The data written on stdin isn't reencoded, even on > Windows. I already have gone down the `p4 submit -i` road. It behaves exactly the same as passing the description on the command line. (One of several dead-ends I went down that I haven't mentioned in my emails)
Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes: > On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote: >> The key thing that I'm trying to point out here is that the encoding is >> not necessarily consistent between different commits. The changes that >> you have proposed force you to pick one encoding that will be used for >> every commit. If it's wrong then data will be corrupted, and there is >> no option provided to avoid that. The only way I can see to avoid this >> issue is to not attempt to re-encode the data - just pass it directly >> to git. > > No, my "fallbackEndocing" setting is just that... a fallback. My proposal > *always* tries to decode in UTF-8 first! Only if that throws an exception > will my "fallbackEncoding" come into play, and it only comes into play > for the single changeset description that was invalid UTF-8. After that, > subsequent descriptions will again be tried in UTF-8 first. Hmph, I do not quite see the need for "No" at the beginning of what you said. The fallbackEncoding can specify only one non UTF-8 encoding, so even if majority of commits were in UTF-8 but when you need to import two commits with non UTF-8 encoding, there is no suitable value to give to the fallbackEncoding setting. One of these two commits will fail to decode first in UTF-8 and then fail to decode again with the fallback, and after that a corrupted message remains. >> I think another way to solve the issue you have is the encoding header >> on git commits. We can pass the bytes through git-p4 unmodified, but >> mark the commit message as being encoded using something that isn't >> UTF-8. That avoids any potentially lossy conversions when cloning the >> repository, but should allow the data to be displayed correctly in git. > > Yes, that could be a solution. I will try that out. If we can determine in what encoding the thing that came out of Perforce is written in, we can put it on the encoding header of the resulting commit. But if that is possible to begin with, perhaps we do not even need to do so---if you can determine what the original encoding is, you can reencode with that encoding into UTF-8 inside git-p4 while creating the commit, no? And if the raw data that came from Perforce cannot be reencoded to UTF-8 (i.e. iconv fails to process for some reason), then whether the translation is done at the import time (i.e. where you would have used the fallbackEncoding to reencode into UTF-8) or at the display time (i.e. "git show" would notice the encoding header and try to reencode the raw data from that encoding into UTF-8), it would fail in the same way, so I do not see much advantage in writing the encoding header into the resulting object (other than shifting the blame to downstream and keeping the original data intact, which is a good design principle).
On Tue, May 4, 2021 at 6:11 PM Junio C Hamano <gitster@pobox.com> wrote: > > Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes: > > > On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote: > >> The key thing that I'm trying to point out here is that the encoding is > >> not necessarily consistent between different commits. The changes that > >> you have proposed force you to pick one encoding that will be used for > >> every commit. If it's wrong then data will be corrupted, and there is > >> no option provided to avoid that. The only way I can see to avoid this > >> issue is to not attempt to re-encode the data - just pass it directly > >> to git. > > > > No, my "fallbackEndocing" setting is just that... a fallback. My proposal > > *always* tries to decode in UTF-8 first! Only if that throws an exception > > will my "fallbackEncoding" come into play, and it only comes into play > > for the single changeset description that was invalid UTF-8. After that, > > subsequent descriptions will again be tried in UTF-8 first. > > The fallbackEncoding can specify only one non UTF-8 > encoding, so even if majority of commits were in UTF-8 but when you > need to import two commits with non UTF-8 encoding, there is no > suitable value to give to the fallbackEncoding setting. One of > these two commits will fail to decode first in UTF-8 and then fail > to decode again with the fallback, and after that a corrupted > message remains. I'm not sure I understand your scenario. If the majority of commits are in UTF-8, but there are 2 with the same UTF-8 encoding (say "cp1252"), then just set "fallbackEndocing" to "cp1252" and all the commits will display fine. Are you talking about a scenario where most of the commits are UTF-8, one is "cp1252" and another one is "cp1251", so a total of 3 encodings are used in the Perforce depot? I don't think that is a common scenario. But you have a point that my patch does not address that scenario. > If we can determine in what encoding the thing that came out of > Perforce is written in, we can put it on the encoding header of the > resulting commit. But if that is possible to begin with, perhaps we > do not even need to do so---if you can determine what the original > encoding is, you can reencode with that encoding into UTF-8 inside > git-p4 while creating the commit, no? > > And if the raw data that came from Perforce cannot be reencoded to > UTF-8 (i.e. iconv fails to process for some reason), then whether > the translation is done at the import time (i.e. where you would > have used the fallbackEncoding to reencode into UTF-8) or at the > display time (i.e. "git show" would notice the encoding header and > try to reencode the raw data from that encoding into UTF-8), it > would fail in the same way, so I do not see much advantage in > writing the encoding header into the resulting object (other than > shifting the blame to downstream and keeping the original data > intact, which is a good design principle). I agree with the idea that if you know what the encoding is, then why not just use that knowledge to convert that to UTF-8, rather than use the encoding header. I'm not sure about how complete the support in "git" is for encoding headers. Does *everything* that reads commit messages respect the encoding header and handle it properly? The documentation seems to discourage their use altogether and in any event, the main use case seems to be a project where everyone has settled on the same legacy encoding to use everywhere, which is not really the case for our situation. If we want to abandon the "fallbackEncoding" direction, then the only other option I see is: Step 1) try to decode in UTF-8. If that succeeds then it is almost certain that it really was in UTF-8 (due to the design of UTF-8). Make the commit in UTF-8. Step 2) If it failed to decode in UTF-8, either use heuristics to detect the encoding, or query the OS for the current code page and assume it's that. Then use the selected encoding to convert to UTF-8. Only the heuristics option will satisfy the case of more than 1 encoding (in addition to UTF-8) being present in the Perforce depot, and in any event the current code page may not help at all. I'm not really familiar with what encoding-detection heuristics are available for Python and how reliable, stable or performant they are. It would also involve taking on another library dependency. I will take a look at the encoding-detection options out there.
Oops, noticed a typo.... the paragraph should read: I'm not sure I understand your scenario. If the majority of commits are in UTF-8, but there are 2 with the same *non* UTF-8 encoding (say "cp1252"), then just set "fallbackEndocing" to "cp1252" and all the commits will display fine.
Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes: > On Tue, May 4, 2021 at 6:11 PM Junio C Hamano <gitster@pobox.com> wrote: >> >> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes: >> >> > On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote: >> >> The key thing that I'm trying to point out here is that the encoding is >> >> not necessarily consistent between different commits. The changes that >> >> you have proposed force you to pick one encoding that will be used for >> >> every commit. If it's wrong then data will be corrupted, and there is >> >> no option provided to avoid that. The only way I can see to avoid this >> >> issue is to not attempt to re-encode the data - just pass it directly >> >> to git. >> > ... > Are you talking about a scenario where most of the commits are UTF-8, > one is "cp1252" and another one is "cp1251", so a total of 3 encodings > are used in the Perforce depot? I don't think that is a common scenario. Yes. I think that is where "not necessarily consistent between different commits" leads us to---not limited only to two encodings. > I agree with the idea that if you know what the encoding is, then > why not just use that knowledge to convert that to UTF-8, rather > than use the encoding header.
diff --git a/git-p4.py b/git-p4.py index 8407ec5c7a..8a97ff3dd2 100755 --- a/git-p4.py +++ b/git-p4.py @@ -764,15 +764,19 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, while True: entry = marshal.load(p4.stdout) if bytes is not str: - # Decode unmarshalled dict to use str keys and values, except - # for cases where the values may not be valid UTF-8. - binary_keys = ('data', 'path', 'clientFile', 'Description', - 'desc', 'Email', 'FullName', 'Owner', 'time', - 'user', 'User') + # Decode unmarshalled dict to use str keys and values where it + # is expected that the data is always valid UTF-8. + text_keys = ('action', 'change', 'Change', 'Client', 'code', + 'fileSize', 'headAction', 'headRev', 'headType', + 'Jobs', 'label', 'options', 'perm', 'rev', 'Root', + 'Status', 'type', 'Update') + text_key_prefixes = ('action', 'File', 'job', 'rev', 'type', + 'View') decoded_entry = {} for key, value in entry.items(): key = key.decode() - if isinstance(value, bytes) and not (key in binary_keys or key.startswith('depotFile')): + if isinstance(value, bytes) and (key in text_keys or + any(filter(key.startswith, text_key_prefixes))): value = value.decode() decoded_entry[key] = value # Parse out data if it's an error response