diff mbox series

[bpf,v6,1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

Message ID 470ffc3c76414443fc359b884080a5394dcccec3.1605560917.git.dxu@dxuuu.xyz (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Fix bpf_probe_read_user_str() overcopying | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning CHECK: spaces preferred around that '+' (ctx:VxV)
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Daniel Xu Nov. 16, 2020, 9:17 p.m. UTC
do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Fixes: 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
Based on on-list discussion and some off-list discussion with Alexei,
I'd like to propose the v4-style patch without the `(*out & ~mask)`
bit.

We can't really zero out the rest of the buffer due to ABI issues.
The bpf docs state for bpf_probe_read_user_str():

> In case the string length is smaller than *size*, the target is not
> padded with further NUL bytes.

 lib/strncpy_from_user.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Linus Torvalds Nov. 16, 2020, 10:15 p.m. UTC | #1
On Mon, Nov 16, 2020 at 1:17 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> Based on on-list discussion and some off-list discussion with Alexei,
> I'd like to propose the v4-style patch without the `(*out & ~mask)`
> bit.

So I've verified that at least on x86-64, this doesn't really make
code generation any worse, and I'm ok with the patch from that
standpoint.

However, this was not what the discussion actually amended at as far
as I'm concerned.

I mentioned that if BPF cares about the bytes past the end of the
string, I want a *BIG COMMENT* about it. Yes, in strncpy_from_user()
itself, but even more in the place that cares.

And no, that does not mean bpf_probe_read_user_str().  That function
clearly doesn't care at all, and doesn't access anything past the end
of the string. I want a comment in whatever code that accesses past
the end of the string.

And your ABI point is actively misleading:

> We can't really zero out the rest of the buffer due to ABI issues.
> The bpf docs state for bpf_probe_read_user_str():
>
> > In case the string length is smaller than *size*, the target is not
> > padded with further NUL bytes.

This comment is actively wrong and misleading.

The code (after the patch) clearly *does* pad a bit with "further NUL
bytes". It's just that it doesn't pad all the way to the end.

Where is the actual buffer zeroing done?

Because without the buffer zeroing, this whole patch is completely
pointless. Which is why I want that comment, and I want a pointer to
where that zeroing is done.

Really. You have two cases:

 (a) the buffer isn't zeroed before the strncpy_from_user()

 (b) the buffer is guaranteed to be zero before that

and in case (a), this patch is pointless, since the data after the
string is already undefined.

And in case (b), I want to see a comment and a pointer to the code
that actually does the zeroing.

HOWEVER. Look at bpf_probe_read_user_str_common(), and notice how it
ALREADY does the zeroing of the buffer past the end, it's just that it
only does it in the error case.

Why do you send this patch, instead of

 (a) get rid of the pointless pre-zeroing

 (b) change bpf_probe_read_user_str_common() to do

        int ret;
        u32 offset;

        ret = strncpy_from_user_nofault(dst, unsafe_ptr, size);
        offset = ret < 0 ? 0 : ret;
        memset(dst+offset, 0, size-offset);
        return ret;

which seems to be much simpler anyway. The comment you quote about
"target is not padded with further NUL bytes" is clearly wrong anyway,
since that error case *does* pad the target with NUL bytes, and always
has.

So honestly, in this whole discussion, it seems rather clear to me
that the bug has always been in bpf, not in strncpy_from_user(). The
ABI comment you quote is clearly not true, and I can point to that
existing bpf_probe_read_user_str_common() code itself:

        ret = strncpy_from_user_nofault(dst, unsafe_ptr, size);
        if (unlikely(ret < 0))
                memset(dst, 0, size);

as to why that is.

But guys, as mentioned, I'm willing to apply this patch, but only if
you add some actually *correct* comments about the odd bpf use of this
string, and point to where the pre-zeroing is done.

               Linus
Linus Torvalds Nov. 16, 2020, 10:44 p.m. UTC | #2
On Mon, Nov 16, 2020 at 2:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I've verified that at least on x86-64, this doesn't really make
> code generation any worse, and I'm ok with the patch from that
> standpoint.

.. looking closer, it will generate extra code on big-endian
architectures and on alpha, because of the added "zero_bytemask()".
But on the usual LE machines, zero_bytemask() will already be the same
as "mask", so all it adds is that "and" operation with values it
already had access to.

I don't think anybody cares about alpha and BE - traditional BE
architectures have moved to LE anyway. And looking at the alpha
word-at-a-time code, I don't even understand how it works at all.

Adding matt/rth/ivan to the cc, just so that maybe one of them can
educate me on how that odd alpha zero_bytemask() could possibly work.
The "2ul << .." part confuses me, I think it should be "1ul << ...".

I get the feeling that the alpha "2ul" constant might have come from
the tile version, but in the tile version, the " __builtin_ctzl()"
counts the leading zeroes to the top bit of any bytes in 'mask'. But
the alpha version actually uses "find_zero(mask) * 8", so rather than
have values of 7/15/23/... (for zero byte in byte 0/1/2/..
respectively), it has values 0/8/16/....

But it's entirely possible that I'm completely confused, and alpha
does it right, and I'm just not understanding the code.

It's also possible that the "2ul" vs "1ul" case doesn't matter.
because the extra bit is always going to mask the byte that is
actually zero, so being one bit off in the result is a non-event. I
think that is what may actually be going on.

                Linus
Daniel Xu Nov. 17, 2020, 4:53 a.m. UTC | #3
Hi Linus,

On Mon, Nov 16, 2020 at 02:15:52PM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2020 at 1:17 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > Based on on-list discussion and some off-list discussion with Alexei,
> > I'd like to propose the v4-style patch without the `(*out & ~mask)`
> > bit.
> 
> So I've verified that at least on x86-64, this doesn't really make
> code generation any worse, and I'm ok with the patch from that
> standpoint.
> 
> However, this was not what the discussion actually amended at as far
> as I'm concerned.
> 
> I mentioned that if BPF cares about the bytes past the end of the
> string, I want a *BIG COMMENT* about it. Yes, in strncpy_from_user()
> itself, but even more in the place that cares.

Sure, sorry. Will send another version with comments.

> And no, that does not mean bpf_probe_read_user_str().  That function
> clearly doesn't care at all, and doesn't access anything past the end
> of the string. I want a comment in whatever code that accesses past
> the end of the string.

If I'm reading things right, that place is technically in
kernel/bpf/hashtab.c:alloc_htab_elem. In line:

    memcpy(l_new->key, key, key_size);

where `key_size` is the width of the hashtab key. The flow looks like:

    <bpf prog code>
      bpf_map_update_elem()
        htab_map_update_elem()
          alloc_htab_elem()

It feels a bit weird to me to add a comment about strings in there
because the hash table code is string-agnostic, as mentioned in the
commit message.

> And your ABI point is actively misleading:
> 
> > We can't really zero out the rest of the buffer due to ABI issues.
> > The bpf docs state for bpf_probe_read_user_str():
> >
> > > In case the string length is smaller than *size*, the target is not
> > > padded with further NUL bytes.
> 
> This comment is actively wrong and misleading.
> 
> The code (after the patch) clearly *does* pad a bit with "further NUL
> bytes". It's just that it doesn't pad all the way to the end.

Right, it's a bit ugly and perhaps misleading. But it fixes the real
problem of keying a map with potentially random memory that
strncpy_from_user() might append on. If we pad a deterministic number of
zeros it should be ok.

> Where is the actual buffer zeroing done?

Usually the bpf prog does it. I believe (could be wrong) the verifier
enforces the key is initialized in some form.

For my specific use case, it's done in the bpftrace code:
https://github.com/iovisor/bpftrace/blob/0c88a1a4711a3d13e8c9475f7d08f83a5018fdfd/src/ast/codegen_llvm.cpp#L529-L530

> Because without the buffer zeroing, this whole patch is completely
> pointless. Which is why I want that comment, and I want a pointer to
> where that zeroing is done.
> 
> Really. You have two cases:
> 
>  (a) the buffer isn't zeroed before the strncpy_from_user()
> 
>  (b) the buffer is guaranteed to be zero before that
> 
> and in case (a), this patch is pointless, since the data after the
> string is already undefined.

See above -- I think the verifier enforces that the data is initialized.

> And in case (b), I want to see a comment and a pointer to the code
> that actually does the zeroing.

See above also.

> HOWEVER. Look at bpf_probe_read_user_str_common(), and notice how it
> ALREADY does the zeroing of the buffer past the end, it's just that it
> only does it in the error case.

I don't think the error case is very relevant here. I normally wouldn't
make any assumptions about the state of a buffer after a failed function
call.

> Why do you send this patch, instead of
> 
>  (a) get rid of the pointless pre-zeroing
> 
>  (b) change bpf_probe_read_user_str_common() to do
> 
>         int ret;
>         u32 offset;
> 
>         ret = strncpy_from_user_nofault(dst, unsafe_ptr, size);
>         offset = ret < 0 ? 0 : ret;
>         memset(dst+offset, 0, size-offset);
>         return ret;
> 
> which seems to be much simpler anyway. The comment you quote about
> "target is not padded with further NUL bytes" is clearly wrong anyway,
> since that error case *does* pad the target with NUL bytes, and always
> has.

I think on the bpf side there's the same concern about performance.
You can't really dynamically size buffers in bpf land so users usually
use a larger buffer than necessary, sometimes on the order of KBs. So if
we unnecessarily zero out the rest of the buffer it could cause perf
regressions.

[...]

Thanks,
Daniel
Ivan Kokshaysky Nov. 17, 2020, 6:22 p.m. UTC | #4
On Mon, Nov 16, 2020 at 02:44:56PM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2020 at 2:15 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So I've verified that at least on x86-64, this doesn't really make
> > code generation any worse, and I'm ok with the patch from that
> > standpoint.
> 
> .. looking closer, it will generate extra code on big-endian
> architectures and on alpha, because of the added "zero_bytemask()".
> But on the usual LE machines, zero_bytemask() will already be the same
> as "mask", so all it adds is that "and" operation with values it
> already had access to.
> 
> I don't think anybody cares about alpha and BE - traditional BE
> architectures have moved to LE anyway. And looking at the alpha
> word-at-a-time code, I don't even understand how it works at all.
> 
> Adding matt/rth/ivan to the cc, just so that maybe one of them can
> educate me on how that odd alpha zero_bytemask() could possibly work.
> The "2ul << .." part confuses me, I think it should be "1ul << ...".
> 
> I get the feeling that the alpha "2ul" constant might have come from
> the tile version, but in the tile version, the " __builtin_ctzl()"
> counts the leading zeroes to the top bit of any bytes in 'mask'. But
> the alpha version actually uses "find_zero(mask) * 8", so rather than
> have values of 7/15/23/... (for zero byte in byte 0/1/2/..
> respectively), it has values 0/8/16/....
> 
> But it's entirely possible that I'm completely confused, and alpha
> does it right, and I'm just not understanding the code.

No, you are right, it should be "1ul". Indeed, seems like it came from
the tile version which looks incorrect either, BTW. The tile-gx ISA
(https://studylib.net/doc/18755547/tile-gx-instruction-set-architecture)
says that clz/ctz instructions count up to the first "1", not to the
last "0", so the shift values in tile's zero_bytemask() are 0/8/16/...
as well.

> It's also possible that the "2ul" vs "1ul" case doesn't matter.
> because the extra bit is always going to mask the byte that is
> actually zero, so being one bit off in the result is a non-event. I
> think that is what may actually be going on.

Yes, looks like that.

Ivan.
diff mbox series

Patch

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index e6d5fcc2cdf3..de084f04e50d 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -35,17 +35,21 @@  static inline long do_strncpy_from_user(char *dst, const char __user *src,
 		goto byte_at_a_time;
 
 	while (max >= sizeof(unsigned long)) {
-		unsigned long c, data;
+		unsigned long c, data, mask;
 
 		/* Fall back to byte-at-a-time if we get a page fault */
 		unsafe_get_user(c, (unsigned long __user *)(src+res), byte_at_a_time);
 
-		*(unsigned long *)(dst+res) = c;
 		if (has_zero(c, &data, &constants)) {
 			data = prep_zero_mask(c, data, &constants);
 			data = create_zero_mask(data);
+			mask = zero_bytemask(data);
+			*(unsigned long *)(dst+res) = c & mask;
 			return res + find_zero(data);
 		}
+
+		*(unsigned long *)(dst+res) = c;
+
 		res += sizeof(unsigned long);
 		max -= sizeof(unsigned long);
 	}