Message ID | cover.1714215826.git.herbert@gondor.apana.org.au (mailing list archive) |
---|---|
Headers | show |
Series | Add multi-byte support | expand |
Hey. On Sat, 2024-04-27 at 19:03 +0800, Herbert Xu wrote: > This patch series adds multi-byte support to dash. For now only > fnmatch is supported as the native pmatch function has not been > modified to support multi-byte characters. Nothing against the functionality per se, but I think for all scripts that assumed dash's (and thus on may systems /bin/sh's) current behaviour of being C locale only even without explicitly setting LC_ALL=C, this may have quite some subtle issues. AFAIU, in the C locale, all bytes is a character, and thus in particular pattern matching notation is defined for every defined outcome of command substitution respectively every content of variables (that is: in every(!) locale every byte other than NUL). For example: ************ A while ago I've asked on the Austin Group mailing list for a portable way to get command substitution without stripping of trailing newlines. Long story short: The recommended way was to add a sentinel character '.' at the end of the output within the command substitution and strip that off later with parameter expansion. But despite of the very special properties[0] of '.', it's apparently still required to set LC_ALL=C when stripping the sentinel, because the pattern matching notation in ${foo%.} is defined only on strings of characters, not on strings of bytes. Back then, Harald van Dijk had some ideas how that might be resolved for good, but IIRC none of the shell implementors seemed to really have interest. My goal was to make a portable function like command_subst_with_newlines "eval-ed-command-string" "target-variable-name" which, with the requirement of setting LC_ALL proved more or less impossible when the function should have no side effects (like keeping the LC_ALL overridden, over possibly overriding some existing var like OLD_LC_ALL). Anyway... I could image, that if dash becomes multi-byte aware, there might be more or less subtle surprises. Cheers, Chris. [0] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html "The encoded values associated with <period>, <slash>, <newline>, and <carriage-return> shall be invariant across all locales supported by the implementation."
On Sat, Apr 27, 2024 at 11:31:43PM +0200, Christoph Anton Mitterer wrote: > > Long story short: > The recommended way was to add a sentinel character '.' at the end of > the output within the command substitution and strip that off later > with parameter expansion. > But despite of the very special properties[0] of '.', it's apparently > still required to set LC_ALL=C when stripping the sentinel, because the > pattern matching notation in ${foo%.} is defined only on strings of > characters, not on strings of bytes. Are you talking about a theoretical undefined condition, or an actual one? Which shell doesn't deal with ${foo%.} correctly? Cheers,
On Sun, 2024-04-28 at 08:49 +0800, Herbert Xu wrote: > > Are you talking about a theoretical undefined condition, or an > actual one? Which shell doesn't deal with ${foo%.} correctly? Well my main point for this mail was how dash does it (and not just about '.'). I guess it simply resorts to fnmatch(3) so it will probably do whatever the system's libc does? But it's really not just about '.' (which is more or less a rather safe case). If someone assumes dash would always be LC_ALL=C, that any such operation where e.g. a byte is used that is part of a multi-byte character might, AFAIU, lead to unexpected results. E.g. an fnmatch implementation may just decide to stop and give an error at the first byte that's not a valid characters, right? Also there were some locales like Big5, which had the weird property of having multibyte chars that contain byte sequences that form other valid chars (see [0]). Not sure if I remember that correctly, but it might have been undefined when stripping of the "shorter" character from that. Which again, couldn't have happened so far in dash, as it simply was C local only. Harald van Dijk made some extensive tests back then, how different shells behave. I think the austrin-group-l mailing list archive is not publicly available, but if you have an account it was in that mail: https://collaboration.opengroup.org/operational/mailarch.php?soph=N&action=show&archive=austin-group-l&num=34339&limit=100&offset=0 which showed that shells do indeed behave differently (the tests weren't for '.'). Cheers, Chris. [0] https://unix.stackexchange.com/questions/383217/shell-keep-trailing-newlines-n-in-command-substitution/383411#383411
On Sat, Apr 27, 2024, at 9:19 PM, Christoph Anton Mitterer wrote: > If someone assumes dash would always be LC_ALL=C, that any such > operation where e.g. a byte is used that is part of a multi-byte > character might, AFAIU, lead to unexpected results. What's your point? Do you think dash should not acquire multibyte support, to avoid breaking scripts that make such assumptions?
On Sat, 2024-04-27 at 21:35 -0400, Lawrence Velázquez wrote: > What's your point? Do you think dash should not acquire multibyte > support, to avoid breaking scripts that make such assumptions? No, as I've said I have nothing against the functionality per se, but just wanted to give a heads up that this might have some not immediately obvious consequences. Maybe it makes sense to add some big fat warning in a NEWS or changelog file, telling people that their scripts may now run under another locale and perhaps also giving some explanation on possible subtle consequences? Cheers, Chris.
Just to give a simple example: x.sh: char=ä printf '%s' "$char" | hd byte="$( printf '\244' )" printf '%s' "$byte" | hd printf '%s' "${char%$byte}" | hd $ LC_ALL=en_US.UTF-8 bash x.sh 00000000 c3 a4 |..| 00000002 00000000 a4 |.| 00000001 00000000 c3 |.| 00000001 $ LC_ALL=C bash x.sh 00000000 c3 a4 |..| 00000002 00000000 a4 |.| 00000001 00000000 c3 |.| 00000001 $ dash x.sh 00000000 c3 a4 |..| 00000002 00000000 a4 |.| 00000001 00000000 c3 |.| 00000001 but: $ LC_ALL=en_US.UTF-8 zsh x.sh 00000000 c3 a4 |..| 00000002 00000000 a4 |.| 00000001 00000000 c3 a4 |..| 00000002 $ LC_ALL=C zsh x.sh 00000000 c3 a4 |..| 00000002 00000000 a4 |.| 00000001 00000000 c3 |.| 00000001 Again, not saying that this means dash shouldn't support other locales... but people may (of course wrongly) rely on the behaviour that dash is always in the C locale. Even if that doesn't change right now with the fnmatch() implementation, it could in principle do so with any change there. Ceers, Chris.
On 28/04/2024 01:49, Herbert Xu wrote: > On Sat, Apr 27, 2024 at 11:31:43PM +0200, Christoph Anton Mitterer wrote: >> >> Long story short: >> The recommended way was to add a sentinel character '.' at the end of >> the output within the command substitution and strip that off later >> with parameter expansion. >> But despite of the very special properties[0] of '.', it's apparently >> still required to set LC_ALL=C when stripping the sentinel, because the >> pattern matching notation in ${foo%.} is defined only on strings of >> characters, not on strings of bytes. > > Are you talking about a theoretical undefined condition, or an > actual one? Which shell doesn't deal with ${foo%.} correctly? The way you are implementing it, once you get to pmatch(), arguably you will not handle ${foo%.} correctly. Consider an UTF-8 locale, where '\303' is not a valid multibyte character. In this locale, consider foo=$(printf '\303.') foo=${foo%.} This is something I expect to set foo to '\303', and it does in all shells I know of, despite POSIX not saying this needs to work. The way you are implementing multibyte character support, if I am reading it right, as long as a full multibyte character has not been read, the next byte will be taken as part of that multibyte character, meaning you will take '\303.' as a single invalid multibyte character. At the same time, '\303\251' is a valid multibyte character, and '\251' is not. So also consider foo=$(printf '\303\251') foo=${foo%$(printf '\251')} Here, it is not clear what the correct result is, and indeed, shells disagree. bosh, ksh, zsh, and my shell do not break up characters, which I believe to be the most sensible behaviour. bash and mksh do. The corner cases need to be carefully considered in order to figure out how to write the multibyte character support core functionality. Cheers, Harald van Dijk
On Sun, Apr 28, 2024 at 03:50:57PM +0100, Harald van Dijk wrote: > > The way you are implementing it, once you get to pmatch(), arguably you will > not handle ${foo%.} correctly. I took the easy way out so for now only the fnmatch path works properly. But eventually I will get to the pmatch path too, especially because fnmatch(3) seems to slow down quite a bit once locale is set. > This is something I expect to set foo to '\303', and it does in all shells I > know of, despite POSIX not saying this needs to work. The way you are > implementing multibyte character support, if I am reading it right, as long > as a full multibyte character has not been read, the next byte will be taken > as part of that multibyte character, meaning you will take '\303.' as a > single invalid multibyte character. A multi-byte character has to be valid according to mbrtowc before I mark it with MBCHAR. Otherwise it'll be treated as just a single-byte character. Cheers,