Message ID | 470ffc3c76414443fc359b884080a5394dcccec3.1605560917.git.dxu@dxuuu.xyz (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | Fix bpf_probe_read_user_str() overcopying | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for bpf |
netdev/subject_prefix | success | Link |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 1 this patch: 1 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | warning | CHECK: spaces preferred around that '+' (ctx:VxV) |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 1 this patch: 1 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
On Mon, Nov 16, 2020 at 1:17 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > Based on on-list discussion and some off-list discussion with Alexei, > I'd like to propose the v4-style patch without the `(*out & ~mask)` > bit. So I've verified that at least on x86-64, this doesn't really make code generation any worse, and I'm ok with the patch from that standpoint. However, this was not what the discussion actually amended at as far as I'm concerned. I mentioned that if BPF cares about the bytes past the end of the string, I want a *BIG COMMENT* about it. Yes, in strncpy_from_user() itself, but even more in the place that cares. And no, that does not mean bpf_probe_read_user_str(). That function clearly doesn't care at all, and doesn't access anything past the end of the string. I want a comment in whatever code that accesses past the end of the string. And your ABI point is actively misleading: > We can't really zero out the rest of the buffer due to ABI issues. > The bpf docs state for bpf_probe_read_user_str(): > > > In case the string length is smaller than *size*, the target is not > > padded with further NUL bytes. This comment is actively wrong and misleading. The code (after the patch) clearly *does* pad a bit with "further NUL bytes". It's just that it doesn't pad all the way to the end. Where is the actual buffer zeroing done? Because without the buffer zeroing, this whole patch is completely pointless. Which is why I want that comment, and I want a pointer to where that zeroing is done. Really. You have two cases: (a) the buffer isn't zeroed before the strncpy_from_user() (b) the buffer is guaranteed to be zero before that and in case (a), this patch is pointless, since the data after the string is already undefined. And in case (b), I want to see a comment and a pointer to the code that actually does the zeroing. HOWEVER. Look at bpf_probe_read_user_str_common(), and notice how it ALREADY does the zeroing of the buffer past the end, it's just that it only does it in the error case. Why do you send this patch, instead of (a) get rid of the pointless pre-zeroing (b) change bpf_probe_read_user_str_common() to do int ret; u32 offset; ret = strncpy_from_user_nofault(dst, unsafe_ptr, size); offset = ret < 0 ? 0 : ret; memset(dst+offset, 0, size-offset); return ret; which seems to be much simpler anyway. The comment you quote about "target is not padded with further NUL bytes" is clearly wrong anyway, since that error case *does* pad the target with NUL bytes, and always has. So honestly, in this whole discussion, it seems rather clear to me that the bug has always been in bpf, not in strncpy_from_user(). The ABI comment you quote is clearly not true, and I can point to that existing bpf_probe_read_user_str_common() code itself: ret = strncpy_from_user_nofault(dst, unsafe_ptr, size); if (unlikely(ret < 0)) memset(dst, 0, size); as to why that is. But guys, as mentioned, I'm willing to apply this patch, but only if you add some actually *correct* comments about the odd bpf use of this string, and point to where the pre-zeroing is done. Linus
On Mon, Nov 16, 2020 at 2:15 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I've verified that at least on x86-64, this doesn't really make > code generation any worse, and I'm ok with the patch from that > standpoint. .. looking closer, it will generate extra code on big-endian architectures and on alpha, because of the added "zero_bytemask()". But on the usual LE machines, zero_bytemask() will already be the same as "mask", so all it adds is that "and" operation with values it already had access to. I don't think anybody cares about alpha and BE - traditional BE architectures have moved to LE anyway. And looking at the alpha word-at-a-time code, I don't even understand how it works at all. Adding matt/rth/ivan to the cc, just so that maybe one of them can educate me on how that odd alpha zero_bytemask() could possibly work. The "2ul << .." part confuses me, I think it should be "1ul << ...". I get the feeling that the alpha "2ul" constant might have come from the tile version, but in the tile version, the " __builtin_ctzl()" counts the leading zeroes to the top bit of any bytes in 'mask'. But the alpha version actually uses "find_zero(mask) * 8", so rather than have values of 7/15/23/... (for zero byte in byte 0/1/2/.. respectively), it has values 0/8/16/.... But it's entirely possible that I'm completely confused, and alpha does it right, and I'm just not understanding the code. It's also possible that the "2ul" vs "1ul" case doesn't matter. because the extra bit is always going to mask the byte that is actually zero, so being one bit off in the result is a non-event. I think that is what may actually be going on. Linus
Hi Linus, On Mon, Nov 16, 2020 at 02:15:52PM -0800, Linus Torvalds wrote: > On Mon, Nov 16, 2020 at 1:17 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > Based on on-list discussion and some off-list discussion with Alexei, > > I'd like to propose the v4-style patch without the `(*out & ~mask)` > > bit. > > So I've verified that at least on x86-64, this doesn't really make > code generation any worse, and I'm ok with the patch from that > standpoint. > > However, this was not what the discussion actually amended at as far > as I'm concerned. > > I mentioned that if BPF cares about the bytes past the end of the > string, I want a *BIG COMMENT* about it. Yes, in strncpy_from_user() > itself, but even more in the place that cares. Sure, sorry. Will send another version with comments. > And no, that does not mean bpf_probe_read_user_str(). That function > clearly doesn't care at all, and doesn't access anything past the end > of the string. I want a comment in whatever code that accesses past > the end of the string. If I'm reading things right, that place is technically in kernel/bpf/hashtab.c:alloc_htab_elem. In line: memcpy(l_new->key, key, key_size); where `key_size` is the width of the hashtab key. The flow looks like: <bpf prog code> bpf_map_update_elem() htab_map_update_elem() alloc_htab_elem() It feels a bit weird to me to add a comment about strings in there because the hash table code is string-agnostic, as mentioned in the commit message. > And your ABI point is actively misleading: > > > We can't really zero out the rest of the buffer due to ABI issues. > > The bpf docs state for bpf_probe_read_user_str(): > > > > > In case the string length is smaller than *size*, the target is not > > > padded with further NUL bytes. > > This comment is actively wrong and misleading. > > The code (after the patch) clearly *does* pad a bit with "further NUL > bytes". It's just that it doesn't pad all the way to the end. Right, it's a bit ugly and perhaps misleading. But it fixes the real problem of keying a map with potentially random memory that strncpy_from_user() might append on. If we pad a deterministic number of zeros it should be ok. > Where is the actual buffer zeroing done? Usually the bpf prog does it. I believe (could be wrong) the verifier enforces the key is initialized in some form. For my specific use case, it's done in the bpftrace code: https://github.com/iovisor/bpftrace/blob/0c88a1a4711a3d13e8c9475f7d08f83a5018fdfd/src/ast/codegen_llvm.cpp#L529-L530 > Because without the buffer zeroing, this whole patch is completely > pointless. Which is why I want that comment, and I want a pointer to > where that zeroing is done. > > Really. You have two cases: > > (a) the buffer isn't zeroed before the strncpy_from_user() > > (b) the buffer is guaranteed to be zero before that > > and in case (a), this patch is pointless, since the data after the > string is already undefined. See above -- I think the verifier enforces that the data is initialized. > And in case (b), I want to see a comment and a pointer to the code > that actually does the zeroing. See above also. > HOWEVER. Look at bpf_probe_read_user_str_common(), and notice how it > ALREADY does the zeroing of the buffer past the end, it's just that it > only does it in the error case. I don't think the error case is very relevant here. I normally wouldn't make any assumptions about the state of a buffer after a failed function call. > Why do you send this patch, instead of > > (a) get rid of the pointless pre-zeroing > > (b) change bpf_probe_read_user_str_common() to do > > int ret; > u32 offset; > > ret = strncpy_from_user_nofault(dst, unsafe_ptr, size); > offset = ret < 0 ? 0 : ret; > memset(dst+offset, 0, size-offset); > return ret; > > which seems to be much simpler anyway. The comment you quote about > "target is not padded with further NUL bytes" is clearly wrong anyway, > since that error case *does* pad the target with NUL bytes, and always > has. I think on the bpf side there's the same concern about performance. You can't really dynamically size buffers in bpf land so users usually use a larger buffer than necessary, sometimes on the order of KBs. So if we unnecessarily zero out the rest of the buffer it could cause perf regressions. [...] Thanks, Daniel
On Mon, Nov 16, 2020 at 02:44:56PM -0800, Linus Torvalds wrote: > On Mon, Nov 16, 2020 at 2:15 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So I've verified that at least on x86-64, this doesn't really make > > code generation any worse, and I'm ok with the patch from that > > standpoint. > > .. looking closer, it will generate extra code on big-endian > architectures and on alpha, because of the added "zero_bytemask()". > But on the usual LE machines, zero_bytemask() will already be the same > as "mask", so all it adds is that "and" operation with values it > already had access to. > > I don't think anybody cares about alpha and BE - traditional BE > architectures have moved to LE anyway. And looking at the alpha > word-at-a-time code, I don't even understand how it works at all. > > Adding matt/rth/ivan to the cc, just so that maybe one of them can > educate me on how that odd alpha zero_bytemask() could possibly work. > The "2ul << .." part confuses me, I think it should be "1ul << ...". > > I get the feeling that the alpha "2ul" constant might have come from > the tile version, but in the tile version, the " __builtin_ctzl()" > counts the leading zeroes to the top bit of any bytes in 'mask'. But > the alpha version actually uses "find_zero(mask) * 8", so rather than > have values of 7/15/23/... (for zero byte in byte 0/1/2/.. > respectively), it has values 0/8/16/.... > > But it's entirely possible that I'm completely confused, and alpha > does it right, and I'm just not understanding the code. No, you are right, it should be "1ul". Indeed, seems like it came from the tile version which looks incorrect either, BTW. The tile-gx ISA (https://studylib.net/doc/18755547/tile-gx-instruction-set-architecture) says that clz/ctz instructions count up to the first "1", not to the last "0", so the shift values in tile's zero_bytemask() are 0/8/16/... as well. > It's also possible that the "2ul" vs "1ul" case doesn't matter. > because the extra bit is always going to mask the byte that is > actually zero, so being one bit off in the result is a non-event. I > think that is what may actually be going on. Yes, looks like that. Ivan.
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c index e6d5fcc2cdf3..de084f04e50d 100644 --- a/lib/strncpy_from_user.c +++ b/lib/strncpy_from_user.c @@ -35,17 +35,21 @@ static inline long do_strncpy_from_user(char *dst, const char __user *src, goto byte_at_a_time; while (max >= sizeof(unsigned long)) { - unsigned long c, data; + unsigned long c, data, mask; /* Fall back to byte-at-a-time if we get a page fault */ unsafe_get_user(c, (unsigned long __user *)(src+res), byte_at_a_time); - *(unsigned long *)(dst+res) = c; if (has_zero(c, &data, &constants)) { data = prep_zero_mask(c, data, &constants); data = create_zero_mask(data); + mask = zero_bytemask(data); + *(unsigned long *)(dst+res) = c & mask; return res + find_zero(data); } + + *(unsigned long *)(dst+res) = c; + res += sizeof(unsigned long); max -= sizeof(unsigned long); }