diff mbox

dash bug: double-quoted "\" breaks glob protection for next char

Message ID 20180224003344.GA3354@gondor.apana.org.au (mailing list archive)
State Superseded
Delegated to: Herbert Xu
Headers show

Commit Message

Herbert Xu Feb. 24, 2018, 12:33 a.m. UTC
On Wed, Feb 21, 2018 at 11:47:58PM +0100, Harald van Dijk wrote:
> On 2/21/18 2:50 PM, Denys Vlasenko wrote:
> >I propose replacing this:
> 
> Agreed, that looks better. Thanks!

Good work guys.  However, could you check to see if this patch
works too? It passes all my tests so far.

---8<---
The commit "[EXPAND] Move parse-time quote flag detection to
run-time" broke the following case:

	case \\zzzz in "\*") echo bad;; esac

The reason is that the naked backslash gets attached to the newly
added backslash added for the special character "*" in preglob,
IOW you end up with "\\*" instead of just "\*" as it should be.

The reason the naked backslash exists in the first place is because
otherwise we cannot tell the difference between "\[" and "\\[" when
it occurs in a quoted pattern context such as "${var#\[}"

This patch restores the original behaviour for the buggy case
while keeping the naked backslash for the quoted pattern case.

Fixes: 7cfd8be0dc83 ("[EXPAND] Move parse-time quote flag...")
Reported-by: Denys Vlasenko <vda.linux@googlemail.com>
Suggested-by: Harald van Dijk <harald@gigawatt.nl>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

Harald van Dijk Feb. 24, 2018, 9:47 a.m. UTC | #1
On 2/24/18 1:33 AM, Herbert Xu wrote:
> On Wed, Feb 21, 2018 at 11:47:58PM +0100, Harald van Dijk wrote:
>> On 2/21/18 2:50 PM, Denys Vlasenko wrote:
>>> I propose replacing this:
>>
>> Agreed, that looks better. Thanks!
> 
> Good work guys.  However, could you check to see if this patch
> works too? It passes all my tests so far.

It seems like the new control character doesn't get escaped in one place 
where it should be, and gets lost because of that:

Unpatched:

$ dash -c 'x=`printf \\\211X`; echo $x | cat -v'
M-^IX

Patched:

$ src/dash -c 'x=`printf \\\211X`; echo $x | cat -v'
X

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 24, 2018, 4:52 p.m. UTC | #2
On Sat, Feb 24, 2018 at 10:47:07AM +0100, Harald van Dijk wrote:
>
> It seems like the new control character doesn't get escaped in one place
> where it should be, and gets lost because of that:
> 
> Unpatched:
> 
> $ dash -c 'x=`printf \\\211X`; echo $x | cat -v'
> M-^IX
> 
> Patched:
> 
> $ src/dash -c 'x=`printf \\\211X`; echo $x | cat -v'
> X

Hmm, it works here.  Can you check that your Makefile is up-to-date
and the generated syntax.c looks like this:

const char basesyntax[] = {
      CEOF,    CSPCL,   CWORD,   CCTL,
      CCTL,    CCTL,    CCTL,    CCTL,
      CCTL,    CCTL,    CCTL,    CCTL,

Cheers,
Denys Vlasenko Feb. 28, 2018, 10:14 p.m. UTC | #3
On Sat, Feb 24, 2018 at 5:52 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Sat, Feb 24, 2018 at 10:47:07AM +0100, Harald van Dijk wrote:
>>
>> It seems like the new control character doesn't get escaped in one place
>> where it should be, and gets lost because of that:
>>
>> Unpatched:
>>
>> $ dash -c 'x=`printf \\\211X`; echo $x | cat -v'
>> M-^IX
>>
>> Patched:
>>
>> $ src/dash -c 'x=`printf \\\211X`; echo $x | cat -v'
>> X
>
> Hmm, it works here.  Can you check that your Makefile is up-to-date
> and the generated syntax.c looks like this:
>
> const char basesyntax[] = {
>       CEOF,    CSPCL,   CWORD,   CCTL,
>       CCTL,    CCTL,    CCTL,    CCTL,
>       CCTL,    CCTL,    CCTL,    CCTL,
>
> Cheers,

Guys, while you work on fixing this, consider taking a look at
some more parsing cases in this bbox bug:

https://bugs.busybox.net/show_bug.cgi?id=10821

I suppose some of them may fail in dash too (I did not test, sorry).
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Harald van Dijk Feb. 28, 2018, 11:04 p.m. UTC | #4
On 2/28/18 11:14 PM, Denys Vlasenko wrote:
> Guys, while you work on fixing this, consider taking a look at
> some more parsing cases in this bbox bug:
> 
> https://bugs.busybox.net/show_bug.cgi?id=10821
> 
> I suppose some of them may fail in dash too (I did not test, sorry).

$'...' isn't supported in dash, but the underlying problem does appear 
to exist in dash too:

$ bash -c 'x=yz; echo "${x#'"'y'"'}"'
z

$ dash -c 'x=yz; echo "${x#'"'y'"'}"'
yz

(That is, they are executing x=yz; echo "${x#'y'}".)

POSIX says that in "${var#pattern}" (and the same for ##, % and %%), the 
pattern is considered unquoted regardless of the outer quotation marks. 
Because of that, the single quote characters should not be taken 
literally, but should be taken as quoting the y. ksh, posh and zsh agree 
with bash.

But: POSIX does not say the same for +/-/..., so dash is correct there:

$ bash -c 'x=yz; echo "${x+'"'y'"'}"'
'y'

$ dash -c 'x=yz; echo "${x+'"'y'"'}"'
'y'

Because of that, for that report's follow-up comment about

   (unset x; echo "${x-$'\x41'}")

I would not have expected this to be taken as a $'...' string. ksh 
(which does support $'...' strings too) prints the literal text $'\x41', 
and so does bash if invoked as sh.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Harald van Dijk March 1, 2018, 7:24 p.m. UTC | #5
On 01/03/2018 00:04, Harald van Dijk wrote:
> $ bash -c 'x=yz; echo "${x#'"'y'"'}"'
> z
> 
> $ dash -c 'x=yz; echo "${x#'"'y'"'}"'
> yz
> 
> (That is, they are executing x=yz; echo "${x#'y'}".)
> 
> POSIX says that in "${var#pattern}" (and the same for ##, % and %%), the 
> pattern is considered unquoted regardless of the outer quotation marks. 
> Because of that, the single quote characters should not be taken 
> literally, but should be taken as quoting the y. ksh, posh and zsh agree 
> with bash.

Unfortunately, this causes another problem with all of the backslash 
approaches so far:

   x='\\\\'; printf "%s\n" "${x#'\\\\'}"

This should print a blank line. (bash, ksh, posh and zsh agree.)

Here, dash's parser stores '$\$\', where $ is a control character. 
preglob would need to turn this into \\\\\\\\. The problem is again that 
preglob cannot increase the string length. Perhaps the parser needs to 
store this as '$\$\$\$\', $ being either CTLESC or that new CTLBACK? 
Either way, it requires some more invasive changes.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 2, 2018, 7:49 a.m. UTC | #6
On Thu, Mar 01, 2018 at 08:24:22PM +0100, Harald van Dijk wrote:
> On 01/03/2018 00:04, Harald van Dijk wrote:
> >$ bash -c 'x=yz; echo "${x#'"'y'"'}"'
> >z
> >
> >$ dash -c 'x=yz; echo "${x#'"'y'"'}"'
> >yz
> >
> >(That is, they are executing x=yz; echo "${x#'y'}".)
> >
> >POSIX says that in "${var#pattern}" (and the same for ##, % and %%), the
> >pattern is considered unquoted regardless of the outer quotation marks.
> >Because of that, the single quote characters should not be taken
> >literally, but should be taken as quoting the y. ksh, posh and zsh agree
> >with bash.
> 
> Unfortunately, this causes another problem with all of the backslash
> approaches so far:
> 
>   x='\\\\'; printf "%s\n" "${x#'\\\\'}"
> 
> This should print a blank line. (bash, ksh, posh and zsh agree.)
> 
> Here, dash's parser stores '$\$\', where $ is a control character. preglob
> would need to turn this into \\\\\\\\. The problem is again that preglob
> cannot increase the string length. Perhaps the parser needs to store this as
> '$\$\$\$\', $ being either CTLESC or that new CTLBACK? Either way, it
> requires some more invasive changes.

These are different issues.  dash's parser currently does not
understand nested quoting in patterns at all.  That is, if your
parameter expansion are within double quotes, then dash at the
parser level will consider the pattern to be double-quoted.  Thus
any nested single-quotes will be literals instead of actual quotes.

If we fix this in the parser then everything should just work.

Cheers,
Harald van Dijk March 2, 2018, 10:58 a.m. UTC | #7
On 02/03/2018 08:49, Herbert Xu wrote:
> On Thu, Mar 01, 2018 at 08:24:22PM +0100, Harald van Dijk wrote:
>> On 01/03/2018 00:04, Harald van Dijk wrote:
>>> $ bash -c 'x=yz; echo "${x#'"'y'"'}"'
>>> z
>>>
>>> $ dash -c 'x=yz; echo "${x#'"'y'"'}"'
>>> yz
>>>
>>> (That is, they are executing x=yz; echo "${x#'y'}".)
>>>
>>> POSIX says that in "${var#pattern}" (and the same for ##, % and %%), the
>>> pattern is considered unquoted regardless of the outer quotation marks.
>>> Because of that, the single quote characters should not be taken
>>> literally, but should be taken as quoting the y. ksh, posh and zsh agree
>>> with bash.
>>
>> Unfortunately, this causes another problem with all of the backslash
>> approaches so far:
>>
>>    x='\\\\'; printf "%s\n" "${x#'\\\\'}"
>>
>> This should print a blank line. (bash, ksh, posh and zsh agree.)
>>
>> Here, dash's parser stores '$\$\', where $ is a control character. preglob
>> would need to turn this into \\\\\\\\. The problem is again that preglob
>> cannot increase the string length. Perhaps the parser needs to store this as
>> '$\$\$\$\', $ being either CTLESC or that new CTLBACK? Either way, it
>> requires some more invasive changes.
> 
> These are different issues.  dash's parser currently does not
> understand nested quoting in patterns at all.  That is, if your
> parameter expansion are within double quotes, then dash at the
> parser level will consider the pattern to be double-quoted.  Thus
> any nested single-quotes will be literals instead of actual quotes.

That's the same thing though. The problem with the backslashes is also 
that dash sees them as double-quoted when they should be seen as 
unquoted, and the approach taken in commit 
7cfd8be0dc83342b4a71f3a8e5b7efab4670e50c that lasts to this day was 
specifically to *not* fix this in the parser, but to simply have the 
parser record enough information so that quote status can be determined 
and patched up during expansion. It's just that in the case of single 
quotes, expansion was never modified to recognise them. Thinking some 
more, I don't think the parser actually records enough information to 
let that work.

> If we fix this in the parser then everything should just work.

Right, that's the approach FreeBSD sh has taken that I referred to in my 
message from Feb 18, that I'd personally prefer as well. It basically 
involves reverting 7cfd8be0dc83342b4a71f3a8e5b7efab4670e50c, setting 
syntax to BASESYNTAX/DQSYNTAX (whichever is appropriate) when the parse 
of a variable expansion starts, and finding a sensible way to change the 
syntax back to BASESYNTAX/DQSYNTAX/ARISYNTAX when it ends. In FreeBSD 
sh, an explicit stack of syntaxes is created for this, but that might be 
avoidable: with slight modifications to what gets stored in the byte 
after CTLVAR/CTLARI, it might be possible to go back through the parser 
output to determine the syntax to revert to. I'll see if I can get that 
working.

> Cheers,
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 2, 2018, 3:28 p.m. UTC | #8
On Fri, Mar 02, 2018 at 11:58:41AM +0100, Harald van Dijk wrote:
>
> >If we fix this in the parser then everything should just work.
> 
> Right, that's the approach FreeBSD sh has taken that I referred to in my
> message from Feb 18, that I'd personally prefer as well. It basically
> involves reverting 7cfd8be0dc83342b4a71f3a8e5b7efab4670e50c, setting syntax
> to BASESYNTAX/DQSYNTAX (whichever is appropriate) when the parse of a
> variable expansion starts, and finding a sensible way to change the syntax
> back to BASESYNTAX/DQSYNTAX/ARISYNTAX when it ends. In FreeBSD sh, an
> explicit stack of syntaxes is created for this, but that might be avoidable:
> with slight modifications to what gets stored in the byte after
> CTLVAR/CTLARI, it might be possible to go back through the parser output to
> determine the syntax to revert to. I'll see if I can get that working.

Yes but that's overkill just to fix single quote within patterns.
We already support nested double-quotes in patterns correctly.  As
single quotes cannot nest, it should be an easy fix.

Cheers,
Harald van Dijk March 2, 2018, 4:05 p.m. UTC | #9
On 02/03/2018 16:28, Herbert Xu wrote:
> On Fri, Mar 02, 2018 at 11:58:41AM +0100, Harald van Dijk wrote:
>>
>>> If we fix this in the parser then everything should just work.
>>
>> Right, that's the approach FreeBSD sh has taken that I referred to in my
>> message from Feb 18, that I'd personally prefer as well. It basically
>> involves reverting 7cfd8be0dc83342b4a71f3a8e5b7efab4670e50c, setting syntax
>> to BASESYNTAX/DQSYNTAX (whichever is appropriate) when the parse of a
>> variable expansion starts, and finding a sensible way to change the syntax
>> back to BASESYNTAX/DQSYNTAX/ARISYNTAX when it ends. In FreeBSD sh, an
>> explicit stack of syntaxes is created for this, but that might be avoidable:
>> with slight modifications to what gets stored in the byte after
>> CTLVAR/CTLARI, it might be possible to go back through the parser output to
>> determine the syntax to revert to. I'll see if I can get that working.
> 
> Yes but that's overkill just to fix single quote within patterns.
> We already support nested double-quotes in patterns correctly.  As
> single quotes cannot nest, it should be an easy fix.

Single quotes indeed cannot nest, but you do need to reliably determine 
when a single quote is a special character, which gets very tricky very 
quickly with nested substitutions.

In "${x+''}", the ' is the literal ' character. In "${x#''}", the ' is a 
quote character. This part is easy, this part is just a matter of 
setting another variable when the parse of the substitution starts.

But in "${x+${y-}''}", the ' is the literal ' character. In 
"${x#${y-}''}", the ' is a quote character. This part is hard. If the 
above is done simply using another local variable, then the parse of the 
nested ${y-} would clobber that variable.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 2, 2018, 4:33 p.m. UTC | #10
On Fri, Mar 02, 2018 at 05:05:46PM +0100, Harald van Dijk wrote:
>
> But in "${x+${y-}''}", the ' is the literal ' character. In "${x#${y-}''}",
> the ' is a quote character. This part is hard. If the above is done simply
> using another local variable, then the parse of the nested ${y-} would
> clobber that variable.

I don't see why that's hard.  You just need to remember whether
you're in a pattern context (i.e., after a %/%% or #/##).  If you
are, then you need to go back to basesyntax instead of dqsyntax
until the next right brace.

Cheers,
Harald van Dijk March 2, 2018, 6:10 p.m. UTC | #11
On 02/03/2018 17:33, Herbert Xu wrote:
> On Fri, Mar 02, 2018 at 05:05:46PM +0100, Harald van Dijk wrote:
>>
>> But in "${x+${y-}''}", the ' is the literal ' character. In "${x#${y-}''}",
>> the ' is a quote character. This part is hard. If the above is done simply
>> using another local variable, then the parse of the nested ${y-} would
>> clobber that variable.
> 
> I don't see why that's hard.  You just need to remember whether
> you're in a pattern context (i.e., after a %/%% or #/##).  If you
> are, then you need to go back to basesyntax instead of dqsyntax
> until the next right brace.

Let me slightly modify my example so that the effect becomes different:

   "${x#"${y-*}"''}"

When the parser has processed

   "${x#"${y-

would the state be "in a pattern context", or "not in a pattern context"?

If the state is "in a pattern context", how do you prevent the next * 
from being taken as unquoted? It should be taken as quoted.

If it stores "not in a pattern context", and the parser is processing 
the last character in

   "${x#"${y-*}

how can it reset the state to "in a pattern context"? Where could this 
state be stored?

With arbitrarily deep nesting of variable expansions, I do not see how 
you can avoid requiring arbitrarily large state, which gets difficult 
with dash's non-recursive parser. Mind you, I do hope I'm missing 
something obvious here!

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/Makefile.am b/src/Makefile.am
index 139355e..b5bff06 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -60,7 +60,7 @@  init.c: mkinit $(dash_CFILES)
 nodes.c nodes.h: mknodes nodetypes nodes.c.pat
 	./$^
 
-syntax.c syntax.h: mksyntax
+syntax.c syntax.h: mksyntax parser.h
 	./$^
 
 signames.c: mksignames
diff --git a/src/expand.c b/src/expand.c
index 2a50830..aeffe82 100644
--- a/src/expand.c
+++ b/src/expand.c
@@ -244,6 +244,7 @@  argstr(char *p, int flag)
 		CTLVAR,
 		CTLBACKQ,
 		CTLENDARI,
+		CTLBACK,
 		0
 	};
 	const char *reject = spclchars;
@@ -330,6 +331,7 @@  addquote:
 				startloc++;
 			}
 			break;
+		case CTLBACK:
 		case CTLESC:
 			startloc++;
 			length++;
@@ -340,7 +342,7 @@  addquote:
 			 * backslash.
 			 */
 			if (((flag | inquotes) & (EXP_QPAT | EXP_QUOTED)) ==
-			    EXP_QPAT && *p != '\\')
+			    EXP_QPAT && (*p != '\\' || c == CTLBACK))
 				break;
 
 			goto addquote;
@@ -373,6 +375,7 @@  exptilde(char *startp, char *p, int flag)
 
 	while ((c = *++p) != '\0') {
 		switch(c) {
+		case CTLBACK:
 		case CTLESC:
 			return (startp);
 		case CTLQUOTEMARK:
@@ -594,7 +597,7 @@  scanleft(
 		*loc2 = c;
 		if (match)
 			return loc;
-		if (quotes && *loc == (char)CTLESC)
+		if (quotes && (*loc == (char)CTLESC || *loc == (char)CTLBACK))
 			loc++;
 		loc++;
 		loc2++;
@@ -821,7 +824,8 @@  end:
 	if (subtype != VSNORMAL) {	/* skip to end of alternative */
 		int nesting = 1;
 		for (;;) {
-			if ((c = (signed char)*p++) == CTLESC)
+			if ((c = (signed char)*p++) == CTLESC ||
+			    c == CTLBACK)
 				p++;
 			else if (c == CTLBACKQ) {
 				if (varlen >= 0)
@@ -1052,7 +1056,7 @@  ifsbreakup(char *string, int maxargs, struct arglist *arglist)
 
 				q = p;
 				c = *p++;
-				if (c == (char)CTLESC)
+				if (c == (char)CTLESC || c == (char)CTLBACK)
 					c = *p++;
 
 				isifs = strchr(ifs, c);
@@ -1684,7 +1688,7 @@  _rmescapes(char *str, int flag)
 			notescaped = globbing;
 			continue;
 		}
-		if (*p == (char)CTLESC) {
+		if (*p == (char)CTLESC || *p == (char)CTLBACK) {
 			p++;
 			if (notescaped)
 				*q++ = '\\';
diff --git a/src/jobs.c b/src/jobs.c
index 4f02e38..a3aef20 100644
--- a/src/jobs.c
+++ b/src/jobs.c
@@ -1386,6 +1386,7 @@  cmdputs(const char *s)
 	while ((c = *p++) != 0) {
 		str = 0;
 		switch (c) {
+		case CTLBACK:
 		case CTLESC:
 			c = *p++;
 			break;
diff --git a/src/mystring.c b/src/mystring.c
index 0106bd2..2573d1f 100644
--- a/src/mystring.c
+++ b/src/mystring.c
@@ -62,7 +62,7 @@  const char spcstr[] = " ";
 const char snlfmt[] = "%s\n";
 const char dolatstr[] = { CTLQUOTEMARK, CTLVAR, VSNORMAL, '@', '=',
 			  CTLQUOTEMARK, '\0' };
-const char qchars[] = { CTLESC, CTLQUOTEMARK, 0 };
+const char qchars[] = { CTLESC, CTLQUOTEMARK, CTLBACK, 0 };
 const char illnum[] = "Illegal number: %s";
 const char homestr[] = "HOME";
 
diff --git a/src/parser.c b/src/parser.c
index 382658e..22fc84a 100644
--- a/src/parser.c
+++ b/src/parser.c
@@ -930,7 +930,7 @@  readtoken1(int firstc, char const *syntax, char *eofmark, int striptabs)
 			case CBACK:
 				c = pgetc2();
 				if (c == PEOF) {
-					USTPUTC(CTLESC, out);
+					USTPUTC(CTLBACK, out);
 					USTPUTC('\\', out);
 					pungetc();
 				} else if (c == '\n') {
@@ -944,6 +944,7 @@  readtoken1(int firstc, char const *syntax, char *eofmark, int striptabs)
 							eofmark != NULL
 						)
 					) {
+						USTPUTC(CTLBACK, out);
 						USTPUTC('\\', out);
 					}
 					USTPUTC(CTLESC, out);
diff --git a/src/parser.h b/src/parser.h
index 2875cce..54c02eb 100644
--- a/src/parser.h
+++ b/src/parser.h
@@ -45,7 +45,8 @@ 
 #define	CTLARI -122		/* arithmetic expression */
 #define	CTLENDARI -121
 #define	CTLQUOTEMARK -120
-#define	CTL_LAST -120		/* last 'special' character */
+#define	CTLBACK -119
+#define	CTL_LAST -119		/* last 'special' character */
 
 /* variable substitution byte (follows CTLVAR) */
 #define VSTYPE	0x0f		/* type of variable substitution */
diff --git a/src/show.c b/src/show.c
index 4a049e9..ed840aa 100644
--- a/src/show.c
+++ b/src/show.c
@@ -166,6 +166,7 @@  sharg(union node *arg, FILE *fp)
 	bqlist = arg->narg.backquote;
 	for (p = arg->narg.text ; *p ; p++) {
 		switch ((signed char)*p) {
+		case CTLBACK:
 		case CTLESC:
 			putc(*++p, fp);
 			break;
@@ -311,6 +312,7 @@  trstring(char *s)
 		case '\r':  c = 'r';  goto backslash;
 		case '"':  c = '"';  goto backslash;
 		case '\\':  c = '\\';  goto backslash;
+		case CTLBACK:
 		case CTLESC:  c = 'e';  goto backslash;
 		case CTLVAR:  c = 'v';  goto backslash;
 		case CTLBACKQ:  c = 'q';  goto backslash;