diff mbox

expand: Do not quote backslashes in unquoted parameter expansion

Message ID 20180327085530.GA12680@gondor.apana.org.au (mailing list archive)
State Superseded
Delegated to: Herbert Xu
Headers show

Commit Message

Herbert Xu March 27, 2018, 8:55 a.m. UTC
On Mon, Mar 26, 2018 at 07:25:20PM +0200, Martijn Dekker wrote:
> Op 26-03-18 om 17:38 schreef Harald van Dijk:
> > And not by dash 0.5.4. Like I wrote, dash 0.5.5 had some bugs that were
> > fixed in 0.5.6, which mostly restored the behaviour to match <0.5.5.
> 
> Ah, sorry. dash 0.5.4 and earlier don't compile on my system, so they
> are not included in my conveniently accessible arsenal of test shells.
> 
> > As for my patches, that was by accident and doesn't work reliably. When
> > the shell sees no metacharacters, pathname expansion is bypassed, and
> > backslash isn't considered a metacharacter. Which got me to my original
> > example of /de\v: there are no metacharacters in there, so the shell
> > doesn't look to see if it matches anything. Which seems highly
> > desirable: the shell shouldn't need to hit the file system for words not
> > containing metacharacters. The only way then to get consistent behaviour
> > is if the backslash is taken as quoted, so I'm not tempted to argue for
> > the behaviour you're hoping for, sorry. :)

Here is a better example:

	a="/*/\nullx" b="/*/\null"; printf "%s\n" $a $b

dash currently prints

	/*/\nullx
	/*/\null

bash prints

	/*/\nullx
	/dev/null

You may argue the bash behaviour is inconsistent but it actually
makes sense.  What happens is that quote removal only applies to
the original token as seen by the shell.  It is never applied to
the result of parameter expansion.

Now you may ask why on earth does the second line say "/dev/null"
instead of "/dev/\null".  Well that's because it is not the quote
removal step that removed the backslash, but the pathname expansion.

The fact that the /de\v does not become /dev even though it exists
is just the result of the optimisation to avoid unnecessarily
calling stat(2).  I have checked POSIX and I don't see anything
that forbids this behaviour.

So going back to dash yes I think we should adopt the bash behaviour
for pathname expansion and keep the existing case semantics.

This patch does exactly that.  Note that this patch does not work
unless you have already applied

	https://patchwork.kernel.org/patch/10306507/

because otherwise the optimisation mentioned above does not get
detected correctly and we will end up doing quote removal twice.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

Harald van Dijk March 27, 2018, 9:16 a.m. UTC | #1
On 27/03/2018 10:55, Herbert Xu wrote:
> Here is a better example:
> 
> 	a="/*/\nullx" b="/*/\null"; printf "%s\n" $a $b
> 
> dash currently prints
> 
> 	/*/\nullx
> 	/*/\null
> 
> bash prints
> 
> 	/*/\nullx
> 	/dev/null
> 
> You may argue the bash behaviour is inconsistent but it actually
> makes sense.  What happens is that quote removal only applies to
> the original token as seen by the shell.  It is never applied to
> the result of parameter expansion.

Huh? Here's what POSIX says:

 > The order of word expansion shall be as follows:

 > 1. ... parameter expansion (see Parameter Expansion) ... shall be 
performed, ...

 > ...

 > 3. Pathname expansion (see Pathname Expansion) shall be performed, 
unless set -f is in effect.

 > 4. Quote removal (see Quote Removal) shall always be performed last.

If you say that quote removal takes place on the original token (meaning 
before parameter expansion), and if parameter expansion takes place 
before pathname expansion, then there's nothing left to allow \* to 
behave differently from *.

> Now you may ask why on earth does the second line say "/dev/null"
> instead of "/dev/\null".  Well that's because it is not the quote
> removal step that removed the backslash, but the pathname expansion.
> 
> The fact that the /de\v does not become /dev even though it exists
> is just the result of the optimisation to avoid unnecessarily
> calling stat(2).  I have checked POSIX and I don't see anything
> that forbids this behaviour.

POSIX never actually says this optimisation is allowed. The only thing 
that allows it is the fact that it produces the same results anyway. If 
that stops, then checking the file system for matches becomes required.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 27, 2018, 9:44 a.m. UTC | #2
On Tue, Mar 27, 2018 at 11:16:29AM +0200, Harald van Dijk wrote:
>
> If you say that quote removal takes place on the original token (meaning
> before parameter expansion), and if parameter expansion takes place before
> pathname expansion, then there's nothing left to allow \* to behave
> differently from *.

Either you misunderstood me or you misread POSIX.  Quote removal
never applies to the backslashes which occur as a result of parameter
expansion:

	2.6.7 Quote Removal
	The quote characters ( <backslash>, single-quote, and
	double-quote) that were present in the original word shall be
	removed unless they have themselves been quoted.

It's clear that only quote characters in the *original* word will
be removed.

> POSIX never actually says this optimisation is allowed. The only thing that
> allows it is the fact that it produces the same results anyway. If that
> stops, then checking the file system for matches becomes required.

It doesn't disallow it either.  Can you show me a shell that doesn't
apply this optimisation?

Cheers,
Harald van Dijk March 27, 2018, 11:07 a.m. UTC | #3
On 27/03/2018 11:44, Herbert Xu wrote:
> On Tue, Mar 27, 2018 at 11:16:29AM +0200, Harald van Dijk wrote:
>>
>> If you say that quote removal takes place on the original token (meaning
>> before parameter expansion), and if parameter expansion takes place before
>> pathname expansion, then there's nothing left to allow \* to behave
>> differently from *.
> 
> Either you misunderstood me or you misread POSIX.

I misunderstood you. Given v='\' and the word \\$v, I took the result of 
parameter expansion as \\\ where the first two backslashes come from the 
original word, you take the result of parameter expansion as only \. 
Because of that, when you wrote quote removal is never applied to the 
results of parameter expansion, I didn't see what you meant. I think 
you're right in your terminology, the result of parameter expansion is 
just \, to get \\\ it would need to be phrased as something like "the 
result of applying parameter expansion". Thanks for the clarification.

>> POSIX never actually says this optimisation is allowed. The only thing that
>> allows it is the fact that it produces the same results anyway. If that
>> stops, then checking the file system for matches becomes required.
> 
> It doesn't disallow it either.

If POSIX specifies a result, and a shell applies an optimisation that 
causes a different result to be produced, doesn't that inherently 
disallow that optimisation? There may be some confusion and/or 
disagreement over what exactly POSIX specifies and/or intends to 
specify, but I don't see how you can argue that POSIX specifies result 
A, but it's okay to apply an optimisation that produces result B.

> Can you show me a shell that doesn't apply this optimisation?

Can you show me any shell other than bash that lets this optimisation 
affect the results?

Like I wrote in my initial message, all other shells I tested take a 
backslash that comes from a variable after parameter expansion as 
effectively quoted. Given your b="/*/\null", bash is the only shell I 
know that expands $b to /dev/null. All others do perform pathname 
expansion, but check to see if a file \null exists in any top-level 
directory.

Given a="/de[v]/\null" and b="/dev/\null", all shells other than bash, 
even dash, agree that $a and $b expand to '/dev/\null' if that file 
exists, or the original string if that file doesn't exist. *That* is 
what allows the optimisation in the case of $b: expanding to either 
'/dev/\null' or '/dev/\null', depending on some condition, can be done 
without evaluating that condition.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 27, 2018, 12:19 p.m. UTC | #4
On Tue, Mar 27, 2018 at 01:07:21PM +0200, Harald van Dijk wrote:
>
> If POSIX specifies a result, and a shell applies an optimisation that causes
> a different result to be produced, doesn't that inherently disallow that
> optimisation? There may be some confusion and/or disagreement over what
> exactly POSIX specifies and/or intends to specify, but I don't see how you
> can argue that POSIX specifies result A, but it's okay to apply an
> optimisation that produces result B.

Wait, you're talking about something completely different.  The
crux of the issue is whether POSIX allows the backslash to escape
the character after it or does it have to be a literal backslash.

In fact POSIX is perfectly clear on this:

	2.13.1 Patterns Matching a Single Character

	The following patterns matching a single character shall match a
	single character: ordinary characters, special pattern characters,
	and pattern bracket expressions. The pattern bracket expression
	also shall match a single collating element. A <backslash>
	character shall escape the following character. The escaping
	<backslash> shall be discarded. If a pattern ends with an
	unescaped <backslash>, it is unspecified whether the pattern
	does not match anything or the pattern is treated as invalid.

Nowhere does it say that backslashes that come from the result of
parameter expansion is always literal.

So it's clear that any shell that treats the backslash as a literal
backslash is already broken.

> Can you show me any shell other than bash that lets this optimisation affect
> the results?

The fact is that the other shells are not doing what they do because
they're not doing this optimisation.  They do what they do because
they have broken POSIX because they always treat backslashes that
arise from parameter expansion as literal backslashes.

FWIW it wouldn't be hard to iron out the anomalous case of /d\ev.
You just have treat the backslash as a metacharacter.

Cheers,
Harald van Dijk March 27, 2018, 12:57 p.m. UTC | #5
On 27/03/2018 14:19, Herbert Xu wrote:
> Nowhere does it say that backslashes that come from the result of
> parameter expansion is always literal.
> 
> So it's clear that any shell that treats the backslash as a literal
> backslash is already broken.

By the literal POSIX wording, yes, but the fact remains that that's what 
all shells aside from bash were already doing, so that may mean it's a 
POSIX defect. That's why I qualified with "POSIX specifies and/or 
intends to specify".

But I'm tempted to agree with your and Martijn's point of view now that 
it's better to just implement it the way POSIX specifies, that it's more 
useful that way.

> On Tue, Mar 27, 2018 at 01:07:21PM +0200, Harald van Dijk wrote:
>> Can you show me any shell other than bash that lets this optimisation affect
>> the results?
> 
> The fact is that the other shells are not doing what they do because
> they're not doing this optimisation.  They do what they do because
> they have broken POSIX because they always treat backslashes that
> arise from parameter expansion as literal backslashes.

Right, but they're doing this optimisation as well.

> FWIW it wouldn't be hard to iron out the anomalous case of /d\ev.
> You just have treat the backslash as a metacharacter.

That's a good point and wouldn't have too much of an impact on 
performance of existing scripts. BTW, that means both expandmeta()'s 
metachars variable, and expmeta()'s

   if (metaflag == 0 || lstat64(expdir, &statb) >= 0)

optimisation. You already got rid of the latter in your patch to prevent 
buffer overflows.

In regular /d\ev that doesn't come from a variable, the backslash will 
have been replaced by CTLESC, but it's not necessary to treat CTLESC as 
a metacharacter: in that case, either there's a match and pathname 
expansion produces /dev, or there's no match and quote removal produces 
/dev, so the optimisation is still valid. It'd only affect backslashes 
that come from a variable, so it's very limited in scope.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 27, 2018, 3:14 p.m. UTC | #6
On Tue, Mar 27, 2018 at 02:57:12PM +0200, Harald van Dijk wrote:
>
> That's a good point and wouldn't have too much of an impact on performance
> of existing scripts. BTW, that means both expandmeta()'s metachars variable,
> and expmeta()'s
> 
>   if (metaflag == 0 || lstat64(expdir, &statb) >= 0)
> 
> optimisation. You already got rid of the latter in your patch to prevent
> buffer overflows.

No the purpose of this metaflag check is to bypass the lstat call.
My patch simply made it bypass the call by returning earlier.

It is not relevant to whether we include backslash as a metacharacter.

> In regular /d\ev that doesn't come from a variable, the backslash will have
> been replaced by CTLESC, but it's not necessary to treat CTLESC as a
> metacharacter: in that case, either there's a match and pathname expansion
> produces /dev, or there's no match and quote removal produces /dev, so the
> optimisation is still valid. It'd only affect backslashes that come from a
> variable, so it's very limited in scope.

For the case where the backslash came from the parser, the CTLESC
would have already been removed by preglob so there should be no
difference.

The only problem with the metacharacter approach is that glibc's
glob(3) probably won't treat it as a metacharacter and therefore
it will still behave in the same way as the current bash.

Cheers,
Harald van Dijk March 27, 2018, 3:54 p.m. UTC | #7
On 27/03/2018 17:14, Herbert Xu wrote:
> On Tue, Mar 27, 2018 at 02:57:12PM +0200, Harald van Dijk wrote:
>>
>> That's a good point and wouldn't have too much of an impact on performance
>> of existing scripts. BTW, that means both expandmeta()'s metachars variable,
>> and expmeta()'s
>>
>>    if (metaflag == 0 || lstat64(expdir, &statb) >= 0)
>>
>> optimisation. You already got rid of the latter in your patch to prevent
>> buffer overflows.
> 
> No the purpose of this metaflag check is to bypass the lstat call.
> My patch simply made it bypass the call by returning earlier.

Oh, right.

> It is not relevant to whether we include backslash as a metacharacter.

I was thinking about not making backslashes set metaflag in expmeta(): 
when the pathname component doesn't include *, ?, or [, but does include 
backslashes, then the if (metaflag == 0) block could handle that as long 
as it performs the lstat64() check unconditionally. There's no need to 
go through the opendir()/readdir()/closedir() path for that case. Since 
expmeta() is bypassed for words not containing any potentially-magic 
characters, the impact might be small enough.

Regardless of whether metaflag is set, it would mean things like 'set 
"["' would start hitting the FS unnecessarily, if I understand 
correctly: the preglob()-expanded pattern is '\[', and expmeta() can no 
longer tell this apart from $v where v='\[' where a FS check would be 
required.

Perhaps preglob() should just be avoided, and expmeta() taught to 
respect both '\\' and CTLESC. '\\' would be a metacharacter and require 
a FS hit, CTLESC wouldn't require it. This would also avoid some memory 
allocations.

>> In regular /d\ev that doesn't come from a variable, the backslash will have
>> been replaced by CTLESC, but it's not necessary to treat CTLESC as a
>> metacharacter: in that case, either there's a match and pathname expansion
>> produces /dev, or there's no match and quote removal produces /dev, so the
>> optimisation is still valid. It'd only affect backslashes that come from a
>> variable, so it's very limited in scope.
> 
> For the case where the backslash came from the parser, the CTLESC
> would have already been removed by preglob so there should be no
> difference.

     if (!strpbrk(str->text, metachars))
         goto nometa;

in expandmeta() is done before preglob() is called. Here, it is still 
not necessary to include CTLESC.

> The only problem with the metacharacter approach is that glibc's
> glob(3) probably won't treat it as a metacharacter and therefore
> it will still behave in the same way as the current bash.

This doesn't matter much yet, but yeah, it will if you decide to enable 
--enable-glob by default in the future. As long as you don't include 
GLOB_NOESCAPE, then normally, it should be taken as escaping the next 
character, but it might still trigger GLOB_NOMAGIC to return early. If 
it does, I'd take that as a glibc bug, but that doesn't help dash.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 27, 2018, 4:04 p.m. UTC | #8
On Tue, Mar 27, 2018 at 05:54:53PM +0200, Harald van Dijk wrote:
>
> I was thinking about not making backslashes set metaflag in expmeta(): when
> the pathname component doesn't include *, ?, or [, but does include
> backslashes, then the if (metaflag == 0) block could handle that as long as
> it performs the lstat64() check unconditionally. There's no need to go
> through the opendir()/readdir()/closedir() path for that case. Since
> expmeta() is bypassed for words not containing any potentially-magic
> characters, the impact might be small enough.

Honestly such backslashes should be rare enough that I wouldn't
bother with such an optimisation.

> Regardless of whether metaflag is set, it would mean things like 'set "["'
> would start hitting the FS unnecessarily, if I understand correctly: the
> preglob()-expanded pattern is '\[', and expmeta() can no longer tell this
> apart from $v where v='\[' where a FS check would be required.

No it shouldn't.  We only mark [ is meta if there is a matching ].

> Perhaps preglob() should just be avoided, and expmeta() taught to respect
> both '\\' and CTLESC. '\\' would be a metacharacter and require a FS hit,
> CTLESC wouldn't require it. This would also avoid some memory allocations.

We need preglob for glob(3).  I want to minimise the amount of code
difference between the glob path and the expmeta path.

>     if (!strpbrk(str->text, metachars))
>         goto nometa;
> 
> in expandmeta() is done before preglob() is called. Here, it is still not
> necessary to include CTLESC.

We could move it after preglob.

> This doesn't matter much yet, but yeah, it will if you decide to enable
> --enable-glob by default in the future. As long as you don't include
> GLOB_NOESCAPE, then normally, it should be taken as escaping the next
> character, but it might still trigger GLOB_NOMAGIC to return early. If it
> does, I'd take that as a glibc bug, but that doesn't help dash.

Well at least glibc glob(3) actually works now after 20 years of
trying :)

Cheers,
Harald van Dijk March 27, 2018, 4:48 p.m. UTC | #9
On 27/03/2018 18:04, Herbert Xu wrote:
> On Tue, Mar 27, 2018 at 05:54:53PM +0200, Harald van Dijk wrote:
>>
>> I was thinking about not making backslashes set metaflag in expmeta(): when
>> the pathname component doesn't include *, ?, or [, but does include
>> backslashes, then the if (metaflag == 0) block could handle that as long as
>> it performs the lstat64() check unconditionally. There's no need to go
>> through the opendir()/readdir()/closedir() path for that case. Since
>> expmeta() is bypassed for words not containing any potentially-magic
>> characters, the impact might be small enough.
> 
> Honestly such backslashes should be rare enough that I wouldn't
> bother with such an optimisation.

Backslashes coming from parameters, sure, but backslashes introduced by 
preglob(), I'm not so sure.

>> Regardless of whether metaflag is set, it would mean things like 'set "["'
>> would start hitting the FS unnecessarily, if I understand correctly: the
>> preglob()-expanded pattern is '\[', and expmeta() can no longer tell this
>> apart from $v where v='\[' where a FS check would be required.
> 
> No it shouldn't.  We only mark [ is meta if there is a matching ].

The [ character would mark the whole string possibly-meta in 
expandmeta(), and the CTLESC wouldn't be seen there. Then, preglob() 
turns it into \[, and expmeta() wouldn't take [ as a meta character, but 
would take \ as a meta character.

>> Perhaps preglob() should just be avoided, and expmeta() taught to respect
>> both '\\' and CTLESC. '\\' would be a metacharacter and require a FS hit,
>> CTLESC wouldn't require it. This would also avoid some memory allocations.
> 
> We need preglob for glob(3).  I want to minimise the amount of code
> difference between the glob path and the expmeta path.

Then, how about moving some of expmeta()'s checks to return early 
outside of it, so that they can also be done in the glob() case?

>>      if (!strpbrk(str->text, metachars))
>>          goto nometa;
>>
>> in expandmeta() is done before preglob() is called. Here, it is still not
>> necessary to include CTLESC.
> 
> We could move it after preglob.

Assuming backslash is added to metachars, which was needed anyway:

Regardless of whether the check is done before or after preglob(), it 
would never skip expmeta() when it shouldn't. If it's kept before 
preglob(), then it will correctly skip expmeta() in more cases than if 
it's moved after it, so I don't think there's any benefit in moving it.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 27, 2018, 5:02 p.m. UTC | #10
On Tue, Mar 27, 2018 at 06:48:45PM +0200, Harald van Dijk wrote:
> 
> Backslashes coming from parameters, sure, but backslashes introduced by
> preglob(), I'm not so sure.

Right.  I guess we could change it so that CTLESC is preserved
to distinguish between the backslashes from parameter expansion
vs. backslashes from open input.

> The [ character would mark the whole string possibly-meta in expandmeta(),
> and the CTLESC wouldn't be seen there. Then, preglob() turns it into \[, and
> expmeta() wouldn't take [ as a meta character, but would take \ as a meta
> character.

Oh I though you were talking about test/[, I didn't see the set.
But why would someone do set "["?

> Then, how about moving some of expmeta()'s checks to return early outside of
> it, so that they can also be done in the glob() case?

We could.

Cheers,
Harald van Dijk March 27, 2018, 6:01 p.m. UTC | #11
On 27/03/2018 19:02, Herbert Xu wrote:
> On Tue, Mar 27, 2018 at 06:48:45PM +0200, Harald van Dijk wrote:
>>
>> Backslashes coming from parameters, sure, but backslashes introduced by
>> preglob(), I'm not so sure.
> 
> Right.  I guess we could change it so that CTLESC is preserved
> to distinguish between the backslashes from parameter expansion
> vs. backslashes from open input.

Thinking about it some more, see below.

>> The [ character would mark the whole string possibly-meta in expandmeta(),
>> and the CTLESC wouldn't be seen there. Then, preglob() turns it into \[, and
>> expmeta() wouldn't take [ as a meta character, but would take \ as a meta
>> character.
> 
> Oh I though you were talking about test/[, I didn't see the set.
> But why would someone do set "["?

It was just the most basic command I could think of that shouldn't hit 
the FS, currently doesn't hit the FS, and would start hitting the FS if 
backslash gets taken as a metacharacter. Anything containing a quoted 
metacharacter would do. I suppose I could have used echo "[  ok  ]" 
instead for something that's more likely to pop up in a real script.

>> Then, how about moving some of expmeta()'s checks to return early outside of
>> it, so that they can also be done in the glob() case?
> 
> We could.

expandmeta() is implemented twice, once for the glob() case, once for 
the expmeta() case. There is not much code shared between them except 
for preglob(). preglob(), through _rmescapes(), already has to go over 
the whole string and check each character. If that can be made to return 
an indication somehow of whether it has seen metacharacters, that could 
be used for both cases. Perhaps an additional flag that, when set, lets 
it return NULL (and freeing the allocated string) if no metacharacters 
were seen? That means expmeta() doesn't need to learn about CTLESC, 
means there's no need to go over the string a second time in the glob() 
case to really turn CTLESC into \, and means GLOB_NOMAGIC becomes 
unnecessary and can be removed as a workaround for glibc bugs.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 27, 2018, 6:24 p.m. UTC | #12
On Tue, Mar 27, 2018 at 08:01:10PM +0200, Harald van Dijk wrote:
>
> >Right.  I guess we could change it so that CTLESC is preserved
> >to distinguish between the backslashes from parameter expansion
> >vs. backslashes from open input.
> 
> Thinking about it some more, see below.

Actually let's not go down this route because this would mean
that the glob(3) path will never have the same functionality
since it cannot possibly see CTLESC.

> It was just the most basic command I could think of that shouldn't hit the
> FS, currently doesn't hit the FS, and would start hitting the FS if
> backslash gets taken as a metacharacter. Anything containing a quoted
> metacharacter would do. I suppose I could have used echo "[  ok  ]" instead
> for something that's more likely to pop up in a real script.

My inclination is to just drop the /d\ev issue and use the bash
behaviour until such a time that bash changes or POSIX changes.

It's such an unlikely case as why would anyone knowingly put
backslash into a variable and expecting pathname expansion to
remove it without actually using real magic characters.

Cheers,
Harald van Dijk March 27, 2018, 6:38 p.m. UTC | #13
On 27/03/2018 20:24, Herbert Xu wrote:
> On Tue, Mar 27, 2018 at 08:01:10PM +0200, Harald van Dijk wrote:
>>
>>> Right.  I guess we could change it so that CTLESC is preserved
>>> to distinguish between the backslashes from parameter expansion
>>> vs. backslashes from open input.
>>
>> Thinking about it some more, see below.
> 
> Actually let's not go down this route because this would mean
> that the glob(3) path will never have the same functionality
> since it cannot possibly see CTLESC.

Yes, that's why I ended up proposing what I put below, which does allow 
them the same functionality. Neither glob() nor expmeta() would see 
CTLESC, and neither would need to, since neither would need to optimise 
for the no-metachars case.

>> It was just the most basic command I could think of that shouldn't hit the
>> FS, currently doesn't hit the FS, and would start hitting the FS if
>> backslash gets taken as a metacharacter. Anything containing a quoted
>> metacharacter would do. I suppose I could have used echo "[  ok  ]" instead
>> for something that's more likely to pop up in a real script.
> 
> My inclination is to just drop the /d\ev issue and use the bash
> behaviour until such a time that bash changes or POSIX changes.

What would need to change in POSIX to convince you to change dash to 
match what you were already arguing is the POSIX-mandated behaviour? :)

More serious response: I do think it's worth raising this as a bash bug 
too and seeing what they do.

> It's such an unlikely case as why would anyone knowingly put
> backslash into a variable and expecting pathname expansion to
> remove it without actually using real magic characters.

That's true. The least unlikely scenario I can think of is a directory 
name containing [ but not ]. A script may end up escaping the [ in that 
case just as a precaution, even though it's not strictly needed, and 
glob() may return early due to GLOB_NOMAGIC picking up on the fact that 
[ is not magic in that context.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jilles Tjoelker March 27, 2018, 9:32 p.m. UTC | #14
On Tue, Mar 27, 2018 at 08:19:19PM +0800, Herbert Xu wrote:
> On Tue, Mar 27, 2018 at 01:07:21PM +0200, Harald van Dijk wrote:

> > If POSIX specifies a result, and a shell applies an optimisation that causes
> > a different result to be produced, doesn't that inherently disallow that
> > optimisation? There may be some confusion and/or disagreement over what
> > exactly POSIX specifies and/or intends to specify, but I don't see how you
> > can argue that POSIX specifies result A, but it's okay to apply an
> > optimisation that produces result B.

> Wait, you're talking about something completely different.  The
> crux of the issue is whether POSIX allows the backslash to escape
> the character after it or does it have to be a literal backslash.

> In fact POSIX is perfectly clear on this:

> 	2.13.1 Patterns Matching a Single Character

> 	The following patterns matching a single character shall match a
> 	single character: ordinary characters, special pattern characters,
> 	and pattern bracket expressions. The pattern bracket expression
> 	also shall match a single collating element. A <backslash>
> 	character shall escape the following character. The escaping
> 	<backslash> shall be discarded. If a pattern ends with an
> 	unescaped <backslash>, it is unspecified whether the pattern
> 	does not match anything or the pattern is treated as invalid.

> Nowhere does it say that backslashes that come from the result of
> parameter expansion is always literal.

> So it's clear that any shell that treats the backslash as a literal
> backslash is already broken.

I don't think it is clear at all. Note the final paragraph of 2.13.1:

] When pattern matching is used where shell quote removal is not
] performed (such as in the argument to the find -name primary when find
] is being called using one of the exec functions as defined in the
] System Interfaces volume of POSIX.1-2008, or in the pattern argument
] to the fnmatch() function), special characters can be escaped to
] remove their special meaning by preceding them with a <backslash>
] character. This escaping <backslash> is discarded. The sequence "\\"
] represents one literal <backslash>. All of the requirements and
] effects of quoting on ordinary, shell special, and special pattern
] characters shall apply to escaping in this context.

This implies that special characters cannot be escaped using a backslash
in a context where shell quote removal is performed. Instead, special
characters can be escaped using shell quoting. As a result, the simplest
form of parameter expansion has either all or none of the generated
characters quoted (depending on whether the expansion is in
double-quotes or not).

There is also a sentence "The shell special characters always require
quoting." which seems untrue in practice if the characters come from an
expansion. Something like

  touch 'a&b'
  sh -c 'w=*\&*; printf "%s\n" $w'

works for many shells as sh. However, this could be explained away as
undefined behaviour.

Furthermore, POSIX generally intends to specify the behaviour of the
Bourne shell and ksh88. If the Bourne shell and ksh88 both agree, and do
not match an interpretation of POSIX, then it is likely that the
interpretation is wrong or the standard text has a bug.

> > Can you show me any shell other than bash that lets this
> > optimisation affect the results?

> The fact is that the other shells are not doing what they do because
> they're not doing this optimisation.  They do what they do because
> they have broken POSIX because they always treat backslashes that
> arise from parameter expansion as literal backslashes.

Let's use clear terminology: if it affects behaviour, it is by
definition not an optimization. It is a special case for words not
containing special pattern characters (for some value of "special
pattern characters").
Harald van Dijk March 27, 2018, 10:09 p.m. UTC | #15
On 3/27/18 11:32 PM, Jilles Tjoelker wrote:
> I don't think it is clear at all. Note the final paragraph of 2.13.1:
> 
> ] When pattern matching is used where shell quote removal is not
> ] performed [...]
> 
> This implies that special characters cannot be escaped using a backslash
> in a context where shell quote removal is performed.

Taken literally, it just says something about what happens when shell 
quote removal is not performed. In the cases we're talking about, shell 
quote removal is performed, so this simply wouldn't apply. Perhaps 
that's taking it more literally than intended, I don't know.

>                                                      Instead, special
> characters can be escaped using shell quoting. As a result, the simplest
> form of parameter expansion has either all or none of the generated
> characters quoted (depending on whether the expansion is in
> double-quotes or not).
> 
> There is also a sentence "The shell special characters always require
> quoting." which seems untrue in practice if the characters come from an
> expansion. Something like
> 
>    touch 'a&b'
>    sh -c 'w=*\&*; printf "%s\n" $w'
> 
> works for many shells as sh. However, this could be explained away as
> undefined behaviour.

This is what allows extensions to glob syntax, if those extensions use 
shell special characters.

   p="*(ab)"; case abab in $p) echo match ;; esac

This prints "match" in ksh93.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Harald van Dijk March 27, 2018, 10:19 p.m. UTC | #16
On 3/27/18 10:55 AM, Herbert Xu wrote:
> So going back to dash yes I think we should adopt the bash behaviour
> for pathname expansion and keep the existing case semantics.
> 
> This patch does exactly that.  Note that this patch does not work
> unless you have already applied
> 
> 	https://patchwork.kernel.org/patch/10306507/
> 
> because otherwise the optimisation mentioned above does not get
> detected correctly and we will end up doing quote removal twice.

This introduces a buffer overread. When expmeta() sees a backslash, it 
assumes it can just skip the next character, assuming the next character 
is not a forward slash. By treating expanded backslashes as unquoted, it 
becomes possible for the next character to be the terminating '\0'.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 28, 2018, 6:18 a.m. UTC | #17
On Tue, Mar 27, 2018 at 08:38:01PM +0200, Harald van Dijk wrote:
>
> >My inclination is to just drop the /d\ev issue and use the bash
> >behaviour until such a time that bash changes or POSIX changes.
> 
> What would need to change in POSIX to convince you to change dash to match
> what you were already arguing is the POSIX-mandated behaviour? :)

No the passage I quoted only mandates that backslashes be
interpreted when doing the matching.  It doesn't say anything
about whether it should be removed after matching is done.  The
only thing it does say is that:

	If the pattern does not match any existing filenames or pathnames,
	the pattern string shall be left unchanged.

IOW it's silent in the case where the pattern does match.

But I do concede that it makes a lot more sense to remove the
backslash in the matching case.

> That's true. The least unlikely scenario I can think of is a directory name
> containing [ but not ]. A script may end up escaping the [ in that case just
> as a precaution, even though it's not strictly needed, and glob() may return
> early due to GLOB_NOMAGIC picking up on the fact that [ is not magic in that
> context.

I agree.

However, the other reason this is not worth fixing is because all
those other shells that always treat the backslash as a literal
won't remove it in this case anyway.  So nor sensible script could
possibly use such a feature even if we implemented it.

$ ksh -c 'a="/usr/bin/\["; printf "%s\n" $a'
/usr/bin/\[
$

If you really wanted to do this in a script, the portable way is

$ ksh -c 'a="/usr/bin/[[]"; printf "%s\n" $a'
/usr/bin/[
$

Cheers,
Herbert Xu March 28, 2018, 6:19 a.m. UTC | #18
On Tue, Mar 27, 2018 at 11:32:36PM +0200, Jilles Tjoelker wrote:
>
> This implies that special characters cannot be escaped using a backslash
> in a context where shell quote removal is performed. Instead, special

My argument would be that in the context of parameter expansion
quote removal is not done on the portion of the word that results
from parameter expansion.

Cheers,
Herbert Xu March 28, 2018, 6:23 a.m. UTC | #19
On Wed, Mar 28, 2018 at 12:19:17AM +0200, Harald van Dijk wrote:
> 
> This introduces a buffer overread. When expmeta() sees a backslash, it
> assumes it can just skip the next character, assuming the next character is
> not a forward slash. By treating expanded backslashes as unquoted, it
> becomes possible for the next character to be the terminating '\0'.

This code has always had to deal with naked backslashes.  Can you
show me the exact pattern that results in the overread?

Thanks,
Harald van Dijk March 28, 2018, 6:44 a.m. UTC | #20
On 28/03/2018 08:23, Herbert Xu wrote:
> On Wed, Mar 28, 2018 at 12:19:17AM +0200, Harald van Dijk wrote:
>>
>> This introduces a buffer overread. When expmeta() sees a backslash, it
>> assumes it can just skip the next character, assuming the next character is
>> not a forward slash. By treating expanded backslashes as unquoted, it
>> becomes possible for the next character to be the terminating '\0'.
> 
> This code has always had to deal with naked backslashes.  Can you
> show me the exact pattern that results in the overread?

No, it hasn't, because expmeta() is not used in case patterns, and case 
patterns are currently the only case where naked backslashes can appear. 
In contexts where pathname expansion is performed, a backslash coming 
from a variable will be escaped by another backslash in currently 
released dash versions.

Test case:

   $v='*\'
   set -- $v

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Harald van Dijk March 28, 2018, 6:52 a.m. UTC | #21
On 28/03/2018 08:18, Herbert Xu wrote:
> On Tue, Mar 27, 2018 at 08:38:01PM +0200, Harald van Dijk wrote:
>>
>>> My inclination is to just drop the /d\ev issue and use the bash
>>> behaviour until such a time that bash changes or POSIX changes.
>>
>> What would need to change in POSIX to convince you to change dash to match
>> what you were already arguing is the POSIX-mandated behaviour? :)
> 
> No the passage I quoted only mandates that backslashes be
> interpreted when doing the matching.  It doesn't say anything
> about whether it should be removed after matching is done.  The
> only thing it does say is that:
> 
> 	If the pattern does not match any existing filenames or pathnames,
> 	the pattern string shall be left unchanged.
> 
> IOW it's silent in the case where the pattern does match.

Of course it doesn't say whether characters in the pattern are preserved 
in the case of a match. That's because when there's a match, pathname 
expansion produces the file name, not the original pattern.

In a directory containing foo.txt, you wouldn't argue that *.txt might 
expand to *foo.txt just because POSIX doesn't explicitly say the * is 
removed, would you?

> However, the other reason this is not worth fixing is because all
> those other shells that always treat the backslash as a literal
> won't remove it in this case anyway.

I seriously cannot understand the logic of pushing a break from 
traditional ash behaviour and from all shells apart from bash, giving 
POSIX as a reason for doing it, and then giving the behaviour of all 
those other shells as a reason for not doing it properly.

If all those other shells are a reason against doing it, then why isn't 
that a reason for just restoring the dash 0.5.4 behaviour that all those 
other shells implement, always taking expanded backslash as effectively 
quoted?

>                                       So nor sensible script could
> possibly use such a feature even if we implemented it.

No portable script could use such a feature. A sensible script certainly 
could. There are a lot of scripts that aren't meant to be portable 
across shells. I know I've written scripts for my own use that use dash 
features that don't work the same in bash.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 28, 2018, 7:31 a.m. UTC | #22
On Wed, Mar 28, 2018 at 08:52:57AM +0200, Harald van Dijk wrote:
> 
> I seriously cannot understand the logic of pushing a break from traditional
> ash behaviour and from all shells apart from bash, giving POSIX as a reason
> for doing it, and then giving the behaviour of all those other shells as a
> reason for not doing it properly.

It's logical for dash because this aligns the behaviour of pathname
expansion with case pattern matching.

Cheers,
Harald van Dijk March 28, 2018, 7:38 a.m. UTC | #23
On 28/03/2018 09:31, Herbert Xu wrote:
> On Wed, Mar 28, 2018 at 08:52:57AM +0200, Harald van Dijk wrote:
>>
>> I seriously cannot understand the logic of pushing a break from traditional
>> ash behaviour and from all shells apart from bash, giving POSIX as a reason
>> for doing it, and then giving the behaviour of all those other shells as a
>> reason for not doing it properly.
> 
> It's logical for dash because this aligns the behaviour of pathname
> expansion with case pattern matching.

Except it doesn't actually do that. It does in some cases, and not in 
others. You've made it clear that you don't care about the cases where 
it doesn't. That can be a valid decision, but it's one you should 
acknowledge.

Also, it's just as logical to restore the case pattern matching to its 
traditional behaviour to align it with pathname expansion.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 28, 2018, 8:16 a.m. UTC | #24
On Wed, Mar 28, 2018 at 09:38:04AM +0200, Harald van Dijk wrote:
>
> Also, it's just as logical to restore the case pattern matching to its
> traditional behaviour to align it with pathname expansion.

No I think that makes no sense and clearly contradicts POSIX.

Cheers,
Harald van Dijk March 28, 2018, 9:06 a.m. UTC | #25
On 28/03/2018 10:16, Herbert Xu wrote:
> On Wed, Mar 28, 2018 at 09:38:04AM +0200, Harald van Dijk wrote:
>>
>> Also, it's just as logical to restore the case pattern matching to its
>> traditional behaviour to align it with pathname expansion.
> 
> No I think that makes no sense and clearly contradicts POSIX.

That's not clear at all. As I already wrote earlier and Jilles Tjoelker 
chimed in with as well, if there isn't a single shell out there that 
actually implements what POSIX specifies (bash gets close, but isn't 
fully there either), not even the shells on which the POSIX 
specification was based, there's a fair chance it's a POSIX defect, 
regardless of whether it makes sense.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 28, 2018, 9:52 a.m. UTC | #26
On Wed, Mar 28, 2018 at 08:44:28AM +0200, Harald van Dijk wrote:
>
> Test case:
> 
>   $v='*\'
>   set -- $v

I don't see how this would cause an overrun, can you please explain
it for me?

Thanks,
Harald van Dijk March 28, 2018, 10:03 a.m. UTC | #27
On 28/03/2018 11:52, Herbert Xu wrote:
> On Wed, Mar 28, 2018 at 08:44:28AM +0200, Harald van Dijk wrote:
>>
>> Test case:
>>
>>    $v='*\'
>>    set -- $v
> 
> I don't see how this would cause an overrun, can you please explain
> it for me?

Line numbers are from 0.5.9.1.

When expanded backslashes are no longer treated as quoted, this would 
call expmeta() with the pattern *\, that is with a single unquoted 
trailing backslash, so:

expand.c:1333

                         if (*p == '\\')
                                 esc++;
                         if (p[esc] == '/') {

The first if statement will be hit and set esc to 1. p[esc] is then 
'\0', so the second if block doesn't get entered and the outer loop 
continues:

expand.c:1315

         for (p = name; esc = 0, *p; p += esc + 1) {

p += esc + 1 will increase p by 2, letting it point just past the 
terminating '\0'. The loop condition of *p now accesses the byte just 
past the pattern.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/expand.c b/src/expand.c
index 705fef7..07830eb 100644
--- a/src/expand.c
+++ b/src/expand.c
@@ -848,8 +848,7 @@  memtodest(const char *p, size_t len, const char *syntax, int quotes) {
 		if (c) {
 			if ((quotes & QUOTES_ESC) &&
 			    ((syntax[c] == CCTL) ||
-			     (((quotes & EXP_FULL) || syntax != BASESYNTAX) &&
-			      syntax[c] == CBACK)))
+			     (syntax != BASESYNTAX && syntax[c] == CBACK)))
 				USTPUTC(CTLESC, q);
 		} else if (!(quotes & QUOTES_KEEPNUL))
 			continue;