Message ID | 9d0ef355-f430-e8e2-c844-b34cfcf60d88@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] Block fixes for 6.3-rc3 | expand |
On Fri, Mar 17, 2023 at 10:16 AM Jens Axboe <axboe@kernel.dk> wrote: > > - blk-mq SRCU fix for BLK_MQ_F_BLOCKING devices (Christ) Christ indeed. But I think you meant "Chris". > Side note - when doing the usual allmodconfig builds with gcc-12 and > clang before sending them out, for the latter I see this warning being > spewed with clang-15: > > drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc() > > Obviously not related to my changes, but mentioning it in case it has > been missed as I know you love squeaky clean builds :-). Doesn't happen > with clang-14. Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc, and only my own "normal config" builds with clang. So I don't see this particular issue and my builds are still squeaky clean. That said, when I explicitly try that allmodconfig thing with clang, I can see it too. And the reason seems to be something we've seen before: UBSAN functions being considered non-return by clang, so clang generates code like this: .... .LBB24_3: callq __sanitizer_cov_trace_pc@PLT movl $2, %esi movq $.L__unnamed_3, %rdi callq __ubsan_handle_out_of_bounds .Lfunc_end24: .size m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt ie the last thing in that m5mols_set_fmt() function is a call to __ubsan_handle_out_of_bounds, and then it "falls through" to the next function. And yes, I absolutely *detest* how clang does that. Not only does it cause objtool sanity checking issues, it fundamentally means that we can never treat UBSAN warnings as warnings. They are always fatal. This is a *huge* clang mis-feature, and I forget what we decided last that we saw it. But I suspect we need to disable UBSAN for clang, because clang gets this so *horribly* wrong. Fatal errors that cannot be recovered from are not something that the compiler is supposed to decide on. It's exactly the same issue as BUG() calls: it just results in a dead machine, and in the process the actual problem easily gets lost (because maybe this only happens while running X, and no serial console, and no way to actually see what the UBSAN warning was as a result). I really really detest this thing, and I think this is a fatal flaw, and means that as-is, UBSAN really *has* to be disabled for clang kernel builds. Maybe that will make somebody wake up and smell the roses, and stop this idiotic "undefined behavior is fatal" garbage. Nick? Do you remember what the fix was last time? Linus
On Fri, Mar 17, 2023 at 11:35 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I really really detest this thing, and I think this is a fatal flaw, > and means that as-is, UBSAN really *has* to be disabled for clang > kernel builds. Maybe that will make somebody wake up and smell the > roses, and stop this idiotic "undefined behavior is fatal" garbage. Btw, this is not at all kernel-centric, even if the kernel may be special in not necessarily being able to log things on fatal issues. Imagine you are a database vendor, and you cannot replicate something that a user sees, and the user is not willing to give you their database, and you worry that it might be some undefined thing. Are you going to ship the user a test-build that might die in the middle, and thus be unable to actually show the behavior in any half-way production setting? Would any sane user run that pile of garbage on their machines, knowing that any error would be fatal and result in their production database being broken? So any compiler person that says "undefined behavior can do anything at all according to the standard, so we're compliant with that" is an incompetent person, and shouldn't be let near a real compiler. Debug support *MUST* have an option to just continue. Sure, make "this is fatal" be an option *too*, because if you are a developer, maybe you really want the "fail hard, fail quickly" behavior. But that fatal option cannot be the *only* choice, or we cannot use said debug code in the kernel. Linus
The pull request you sent on Fri, 17 Mar 2023 11:16:02 -0600:
> git://git.kernel.dk/linux.git tags/block-6.3-2023-03-16
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8d3c682a5e3d9dfc2448ecbb22f4cd48359b9e21
Thank you!
+Sanitizer folks (BCC'd) (Top of lore thread: https://lore.kernel.org/linux-block/9d0ef355-f430-e8e2-c844-b34cfcf60d88@kernel.dk/) On Fri, Mar 17, 2023 at 11:35 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Mar 17, 2023 at 10:16 AM Jens Axboe <axboe@kernel.dk> wrote: > > > > - blk-mq SRCU fix for BLK_MQ_F_BLOCKING devices (Christ) > > Christ indeed. > > But I think you meant "Chris". > > > Side note - when doing the usual allmodconfig builds with gcc-12 and > > clang before sending them out, for the latter I see this warning being > > spewed with clang-15: > > > > drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc() > > > > Obviously not related to my changes, but mentioning it in case it has > > been missed as I know you love squeaky clean builds :-). Doesn't happen > > with clang-14. > > Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc, > and only my own "normal config" builds with clang. > > So I don't see this particular issue and my builds are still squeaky clean. > > That said, when I explicitly try that allmodconfig thing with clang, I > can see it too. And the reason seems to be something we've seen > before: UBSAN functions being considered non-return by clang, so clang > generates code like this: > > .... > .LBB24_3: > callq __sanitizer_cov_trace_pc@PLT > movl $2, %esi > movq $.L__unnamed_3, %rdi > callq __ubsan_handle_out_of_bounds > .Lfunc_end24: > .size m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt > > ie the last thing in that m5mols_set_fmt() function is a call to > __ubsan_handle_out_of_bounds, and then it "falls through" to the next > function. > > And yes, I absolutely *detest* how clang does that. Not only does it > cause objtool sanity checking issues, it fundamentally means that we > can never treat UBSAN warnings as warnings. They are always fatal. > > This is a *huge* clang mis-feature, and I forget what we decided last > that we saw it. > > But I suspect we need to disable UBSAN for clang, because clang gets > this so *horribly* wrong. > > Fatal errors that cannot be recovered from are not something that the > compiler is supposed to decide on. It's exactly the same issue as > BUG() calls: it just results in a dead machine, and in the process the > actual problem easily gets lost (because maybe this only happens while > running X, and no serial console, and no way to actually see what the > UBSAN warning was as a result). > > I really really detest this thing, and I think this is a fatal flaw, > and means that as-is, UBSAN really *has* to be disabled for clang > kernel builds. Maybe that will make somebody wake up and smell the > roses, and stop this idiotic "undefined behavior is fatal" garbage. > > Nick? Do you remember what the fix was last time? > > Linus
On 3/17/23 12:35?PM, Linus Torvalds wrote: > On Fri, Mar 17, 2023 at 10:16?AM Jens Axboe <axboe@kernel.dk> wrote: >> >> - blk-mq SRCU fix for BLK_MQ_F_BLOCKING devices (Christ) > > Christ indeed. > > But I think you meant "Chris". Oops yes indeed, good catch. Though I'd love to think that if we did have a resurrection of that nature, blk-mq would be high on the list of things to hack on. >> Side note - when doing the usual allmodconfig builds with gcc-12 and >> clang before sending them out, for the latter I see this warning being >> spewed with clang-15: >> >> drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc() >> >> Obviously not related to my changes, but mentioning it in case it has >> been missed as I know you love squeaky clean builds :-). Doesn't happen >> with clang-14. > > Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc, > and only my own "normal config" builds with clang. I just do both to avoid anything odd, it's just a few min each. > So I don't see this particular issue and my builds are still squeaky clean. > > That said, when I explicitly try that allmodconfig thing with clang, I > can see it too. And the reason seems to be something we've seen > before: UBSAN functions being considered non-return by clang, so clang > generates code like this: > > .... > .LBB24_3: > callq __sanitizer_cov_trace_pc@PLT > movl $2, %esi > movq $.L__unnamed_3, %rdi > callq __ubsan_handle_out_of_bounds > .Lfunc_end24: > .size m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt > > ie the last thing in that m5mols_set_fmt() function is a call to > __ubsan_handle_out_of_bounds, and then it "falls through" to the next > function. > > And yes, I absolutely *detest* how clang does that. Not only does it > cause objtool sanity checking issues, it fundamentally means that we > can never treat UBSAN warnings as warnings. They are always fatal. > > This is a *huge* clang mis-feature, and I forget what we decided last > that we saw it. Gotcha, thanks for digging into that. I must admit I didn't look any further at it, but figured it was worth reporting...
On Fri, Mar 17, 2023 at 11:59:11AM -0700, Nick Desaulniers wrote: > +Sanitizer folks (BCC'd) > (Top of lore thread: > https://lore.kernel.org/linux-block/9d0ef355-f430-e8e2-c844-b34cfcf60d88@kernel.dk/) > > On Fri, Mar 17, 2023 at 11:35 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Fri, Mar 17, 2023 at 10:16 AM Jens Axboe <axboe@kernel.dk> wrote: > > > Side note - when doing the usual allmodconfig builds with gcc-12 and > > > clang before sending them out, for the latter I see this warning being > > > spewed with clang-15: > > > > > > drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc() > > > > > > Obviously not related to my changes, but mentioning it in case it has > > > been missed as I know you love squeaky clean builds :-). Doesn't happen > > > with clang-14. > > > > Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc, > > and only my own "normal config" builds with clang. > > > > So I don't see this particular issue and my builds are still squeaky clean. > > > > That said, when I explicitly try that allmodconfig thing with clang, I > > can see it too. And the reason seems to be something we've seen > > before: UBSAN functions being considered non-return by clang, so clang > > generates code like this: > > > > .... > > .LBB24_3: > > callq __sanitizer_cov_trace_pc@PLT > > movl $2, %esi > > movq $.L__unnamed_3, %rdi > > callq __ubsan_handle_out_of_bounds > > .Lfunc_end24: > > .size m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt > > > > ie the last thing in that m5mols_set_fmt() function is a call to > > __ubsan_handle_out_of_bounds, and then it "falls through" to the next > > function. > > > > And yes, I absolutely *detest* how clang does that. Not only does it > > cause objtool sanity checking issues, it fundamentally means that we > > can never treat UBSAN warnings as warnings. They are always fatal. I've hit these cases a few times too. The __ubsan_handle_* stuff is designed to be recoverable. I think there are some cases where we're tripping over Clang bugs, though. Some of the past issues have been with Clang thinking some UBSAN feature was trap-only (e.g. -fsanitizer=local-bounds), but here it actually generated the call, but decided it was no-return. *sigh* > > But I suspect we need to disable UBSAN for clang, because clang gets > > this so *horribly* wrong. Which is to say, it normally gets it right, but there are some instances where things go weird. If it was horribly wrong, there would be a LOT more objtool warnings. :) I'm not opposed to disabling UBSAN for all*config builds if we need to, but I want to get these Clang bugs found and fixed so I'd be sad to lose the coverage.
On Fri, Mar 17, 2023 at 12:50 PM Kees Cook <keescook@chromium.org> wrote: > > I've hit these cases a few times too. The __ubsan_handle_* stuff > is designed to be recoverable. I think there are some cases where > we're tripping over Clang bugs, though. Some of the past issues > have been with Clang thinking some UBSAN feature was trap-only > (e.g. -fsanitizer=local-bounds), but here it actually generated the call, > but decided it was no-return. *sigh* Hmm. The problem here is that we only get an objdump warning by pure luck. This isn't a "objdump will always catch it", it's more like "objdump will only catch it if the code happens to be moved to the end of the function". So I'm very happy to hear that it's not by design, and that's very good. But the whole "this is a compiler bug" doesn't exactly give me the warm and fuzzies either. The code generation looks *really* odd to me. This is, as far as I can tell, the code sequence that this code has: m5mols_set_fmt: .. movq %rdx, %rbp .. movl 16(%rbp), %ebx movq %rbx, %rdi .. cmpq $16385, %rbx # imm = 0x4001 jne .LBB24_1 .. .LBB24_1: cmpl $8199, %ebx # imm = 0x2007 jne .LBB24_3 .. .LBB24_3: callq __sanitizer_cov_trace_pc@PLT movl $2, %esi movq $.L__unnamed_3, %rdi callq __ubsan_handle_out_of_bounds and as far as I can tell, that "movl 16(%rbp), %ebx" is enum m5mols_restype stype = __find_restype(mf->code); in __find_resolution(). But I don't understand those odd constants it's comparing against at all. 'enum m5mols_restype' should be in the range 0..2, but it seems to use a 64-bit compare against '16385' (and then a 32-bit compare against '8199') to decide it's out-of-bounds. I must be *completely* mis-reading this, because none of that makes a whit of sense to me. It's possible that clang is just *so* terminally confused that it just generates random code, but it's more likely that I'm mis-reading things. The above is my interpretation of what I see with the current F37 clang version for a "allmodconfig" build of that file. My 'clang -v' says clang version 15.0.7 (Fedora 15.0.7-1.fc37) in case some clang person wants to try to recreate this. > I'm not opposed to disabling UBSAN for all*config builds if we need to, > but I want to get these Clang bugs found and fixed so I'd be sad to lose > the coverage. I wonder how many people actually end up having UBSAN actually enabled. I get the feeling that most of the coverage it gets is exactly just the compile-only coverage by "allmodconfig" and friends. Otherwise it would be really easy to just say depends on !COMPILE_TEST for all these run-time debug things. But exactly *because* they have caused endless amounts of build-time pain - *and* because I doubt how much real run-time testing they get - limiting it to non-build tests sounds a bit counter-productive. Linus
On Fri, Mar 17, 2023 at 7:50 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Debug support *MUST* have an option to just continue. Sure, make "this > is fatal" be an option *too*, because if you are a developer, maybe > you really want the "fail hard, fail quickly" behavior. At least for userspace it has different modes -- the default is to report and continue: -fsanitize=...: print a verbose error report and continue execution (default); -fno-sanitize-recover=...: print a verbose error report and exit the program; -fsanitize-trap=...: execute a trap instruction (doesn’t require UBSan run-time support). -fno-sanitize=...: disable any check, e.g., -fno-sanitize=alignment. (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#usage) For the kernel I assume it is meant to work due to `UBSAN_TRAP`, but the optimizer may be getting in the way. From a quick look, it seems that `__find_restype` (which gets inlined into `m5mols_set_fmt` via `__find_resolution`) has a small loop: do { if (code == m5mols_default_ffmt[type].code) return type; } while (type++ != SIZE_DEFAULT_FFMT); that I guess gets simplified into something like: if (code == first code) yield 0; if (code == second code) yield 1; out of bounds; When UBSAN is disabled, it may be assuming: if (code == first code) yield 0; yield 1; Thus no issue. Though even with UBSAN disabled, if I add a dummy opaque call inside the loop, it goes back to something like before, except the jump goes nowhere and then `objtool` complains again: .LBB24_20: callq dummydummydummy .Lfunc_end24: So it is reproducible even without UBSAN getting involved: https://godbolt.org/z/hTe91b3eG Cheers, Miguel
On Fri, Mar 17, 2023 at 12:50 PM Kees Cook <keescook@chromium.org> wrote: > > On Fri, Mar 17, 2023 at 11:59:11AM -0700, Nick Desaulniers wrote: > > +Sanitizer folks (BCC'd) > > (Top of lore thread: > > https://lore.kernel.org/linux-block/9d0ef355-f430-e8e2-c844-b34cfcf60d88@kernel.dk/) > > > > On Fri, Mar 17, 2023 at 11:35 AM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Fri, Mar 17, 2023 at 10:16 AM Jens Axboe <axboe@kernel.dk> wrote: > > > > Side note - when doing the usual allmodconfig builds with gcc-12 and > > > > clang before sending them out, for the latter I see this warning being > > > > spewed with clang-15: > > > > > > > > drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc() > > > > > > > > Obviously not related to my changes, but mentioning it in case it has > > > > been missed as I know you love squeaky clean builds :-). Doesn't happen > > > > with clang-14. > > > > > > Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc, > > > and only my own "normal config" builds with clang. > > > > > > So I don't see this particular issue and my builds are still squeaky clean. > > > > > > That said, when I explicitly try that allmodconfig thing with clang, I > > > can see it too. And the reason seems to be something we've seen > > > before: UBSAN functions being considered non-return by clang, so clang > > > generates code like this: > > > > > > .... > > > .LBB24_3: > > > callq __sanitizer_cov_trace_pc@PLT > > > movl $2, %esi > > > movq $.L__unnamed_3, %rdi > > > callq __ubsan_handle_out_of_bounds > > > .Lfunc_end24: > > > .size m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt > > > > > > ie the last thing in that m5mols_set_fmt() function is a call to > > > __ubsan_handle_out_of_bounds, and then it "falls through" to the next > > > function. > > > > > > And yes, I absolutely *detest* how clang does that. Not only does it > > > cause objtool sanity checking issues, it fundamentally means that we > > > can never treat UBSAN warnings as warnings. They are always fatal. > > I've hit these cases a few times too. The __ubsan_handle_* stuff > is designed to be recoverable. I think there are some cases where > we're tripping over Clang bugs, though. Some of the past issues > have been with Clang thinking some UBSAN feature was trap-only > (e.g. -fsanitizer=local-bounds), but here it actually generated the call, > but decided it was no-return. *sigh* I think no-return is a red herring (or rather, I don't think noreturn is at play here at all). Looking at the IR, I don't see anything that indicated anything was deduced to be noreturn. It looks like this is coming from the loop in __find_restype() being fully unrolled. On the initial iteration, `type` == `M5MOLS_RESTYPE_MONITOR` == 0. `m5mols_default_ffmt` is a 2 element array. If we don't match we loop again, `type` == `M5MOLS_RESTYPE_CAPTURE` == 1. `SIZE_DEFAULT_FFMT` == ARRAY_SIZE(m5mols_default_ffmt) == 2, so we loop again. `type` == 2, accessing m5mols_default_ffmt out of bounds. Perhaps this code meant to simply use a for loop rather than do-while? (Or pre-increment rather than post increment for the do-while)? ``` diff --git a/drivers/media/i2c/m5mols/m5mols_core.c b/drivers/media/i2c/m5mols/m5mols_core.c index 2b01873ba0db..603b1036127e 100644 --- a/drivers/media/i2c/m5mols/m5mols_core.c +++ b/drivers/media/i2c/m5mols/m5mols_core.c @@ -485,10 +485,9 @@ static enum m5mols_restype __find_restype(u32 code) { enum m5mols_restype type = M5MOLS_RESTYPE_MONITOR; - do { + for (; type != M5MOLS_RESTYPE_MAX; ++type) if (code == m5mols_default_ffmt[type].code) return type; - } while (type++ != SIZE_DEFAULT_FFMT); return 0; } ``` > > > > But I suspect we need to disable UBSAN for clang, because clang gets > > > this so *horribly* wrong. I think Linus is thinking about, commit e5d523f1ae8f ("ubsan: disable UBSAN_DIV_ZERO for clang") I can see the loop unroller inserting a branch on "poison" which is a magic value in LLVM related to but distinct from "undef". That gets replaced with an "unreachable" instruction. I wonder if we can change loop unroller to not insert branch on poison when the sanitizers are enabled, or freeze poison. https://llvm.org/devmtg/2020-09/slides/Lee-UndefPoison.pdf Maybe Linus has thoughts he can share on this thread: https://github.com/llvm/llvm-project/issues/56289? Finally, there's always `-mllvm -trap-unreachable` though that's a last resort kind of thing; `-mllvm` flags need to be passed to the linker for LTO, and compiler for non-LTO and LTO. I do think in this case that the fallthough is bringing to our attention an issue in the source. > > Which is to say, it normally gets it right, but there are some instances > where things go weird. If it was horribly wrong, there would be a LOT > more objtool warnings. :) > > I'm not opposed to disabling UBSAN for all*config builds if we need to, > but I want to get these Clang bugs found and fixed so I'd be sad to lose > the coverage. > > -- > Kees Cook -- Thanks, ~Nick Desaulniers
On Fri, Mar 17, 2023 at 9:31 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > 'enum m5mols_restype' should be in the range 0..2, but it seems to use > a 64-bit compare against '16385' (and then a 32-bit compare against > '8199') to decide it's out-of-bounds. It is comparing against just the `.code` in the `m5mols_default_ffmt` table, i.e. the `MEDIA_BUS_FMT_VYUY8_2X8` (8199 = 0x2007) and `MEDIA_BUS_FMT_JPEG_1X8` (16385 = 0x4001), see https://elixir.bootlin.com/linux/latest/source/drivers/media/i2c/m5mols/m5mols_core.c#L52. So it is basically: if (code == m5mols_default_ffmt[0].code) return type; if (type == 2) continue; ++type; if (code == m5mols_default_ffmt[1].code) return type; if (type == 2) continue; ++type; m5mols_default_ffmt[2].code -> out of bounds UB -> unreachable If the condition had `++type` instead, it would not be a problem, because the loop stops before we go into the out of bounds access thus no UB. Cheers, Miguel
On Fri, Mar 17, 2023 at 1:31 PM Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> wrote: > > From a quick look, it seems that `__find_restype` (which gets inlined > into `m5mols_set_fmt` via `__find_resolution`) has a small loop: > > do { > if (code == m5mols_default_ffmt[type].code) > return type; > } while (type++ != SIZE_DEFAULT_FFMT); > > that I guess gets simplified into something like: > > if (code == first code) > yield 0; > if (code == second code) > yield 1; > out of bounds; Ahh. That explains the odd constants I saw. I did figure out it had something to do with loading 'mf->code', but then it went off the rails. > Though even with UBSAN disabled, if I add a dummy opaque call inside > the loop, it goes back to something like before, except the jump goes > nowhere and then `objtool` complains again: > > .LBB24_20: > callq dummydummydummy > .Lfunc_end24: > > So it is reproducible even without UBSAN getting involved: > https://godbolt.org/z/hTe91b3eG That code generation is some crazy stuff. Yeah, that's not acceptable, and the bug is clearly not UBSAN, but some generic bogus clang thing. Much nicer to try to debug this as a "objtool complains about the generated code" rather than as some insane runtine problem with code falling off the end of the function. Linus
On Fri, Mar 17, 2023 at 1:42 PM Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> wrote: > > It is comparing against just the `.code` in the `m5mols_default_ffmt` > table, i.e. the `MEDIA_BUS_FMT_VYUY8_2X8` (8199 = 0x2007) and > `MEDIA_BUS_FMT_JPEG_1X8` (16385 = 0x4001), see Yeah, I see what it's doing. But: > If the condition had `++type` instead, it would not be a problem, > because the loop stops before we go into the out of bounds access thus > no UB. Yeah, but clang really should have generated a proper third iteration, which calls that "out of bounds" case, and then returns, instead fo falling off the end. I do think that on the kernel side, the fix is to just change } while (type++ != SIZE_DEFAULT_FFMT); to } while (++type != SIZE_DEFAULT_FFMT); but I would *really* like clang to be fixed to not silently generate code that does insane things and would be basically impossible to debug if it ever triggers. We would have spent a *lot* of time wondering how the heck we Oopsed in m5mols_get_frame_desc(). Linus
On Fri, Mar 17, 2023 at 1:51 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > but I would *really* like clang to be fixed to not silently generate > code that does insane things and would be basically impossible to > debug if it ever triggers. Side note: the key word here is "silently". If clang notices that it generates crazy code, a warning at build-time would be preferable to the "oh, we noticed the crazy code generation because we do sanity checking that just happened to catch it". If that "fall-through due to impossible third iteration" had happened *inside* the function, rather than at the end, we'd have never noticed. So we were kind of lucky in just where things triggered this time. Maybe that's always going to be true due to some random internal clang logic ("dead code fallthrough is always at the end because of how we sort basic blocks"), but I don't see why it would be in general. Linus
On Fri, Mar 17, 2023 at 9:51 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Yeah, I see what it's doing. Yeah, sorry, I saw your later email after sending that one. > Yeah, but clang really should have generated a proper third iteration, > which calls that "out of bounds" case, and then returns, instead fo > falling off the end. > > I do think that on the kernel side, the fix is to just change > > } while (type++ != SIZE_DEFAULT_FFMT); > > to > > } while (++type != SIZE_DEFAULT_FFMT); > > but I would *really* like clang to be fixed to not silently generate > code that does insane things and would be basically impossible to > debug if it ever triggers. Not sure how easy is for them to realize that they should do a 3rd iteration. But perhaps it would be possible that Clang/LLVM does a similar check to objtool and at least emit a warning about similar situations that would help developers diagnose this (since it should have way more information about what happened than objtool). Cheers, Miguel
On Fri, Mar 17, 2023 at 2:00 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Mar 17, 2023 at 1:51 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > but I would *really* like clang to be fixed to not silently generate > > code that does insane things and would be basically impossible to > > debug if it ever triggers. > > Side note: the key word here is "silently". > > If clang notices that it generates crazy code, a warning at build-time > would be preferable to the "oh, we noticed the crazy code generation > because we do sanity checking that just happened to catch it". That's fair. I have something hacked up locally that can spot the fallthough from m5mols_set_fmt() as objtool did. With some polish, we can likely ship that as a compiler warning. Then we can have these checks regardless of objtool arch support. First I need to teach LLVM that __stack_chk_fail is noreturn, though I've only verified that thus far in glibc, musl, and bionic; I still need to check that's the case for the BSDs' libcs.
On Fri, Mar 17, 2023 at 1:51 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I do think that on the kernel side, the fix is to just change > > } while (type++ != SIZE_DEFAULT_FFMT); > > to > > } while (++type != SIZE_DEFAULT_FFMT); Ok, I ended up deciding to just commit that minimal change, even though it might have been better to just make it a normal for-loop (and use M5MOLS_RESTYPE_MAX instead as the end condition). So maybe it would be more legible (and less likely to have had that off-by-one) if the loop had been for (type = 0; type < M5MOLS_RESTYPE_MAX; type++) instead. But I'll leave that decision to the driver authors (now cc'd). For people brought in late, this is now commit efbcbb12ee99 ("media: m5mols: fix off-by-one loop termination error") with link to the discussion here https://lore.kernel.org/linux-block/CAHk-=wgTSdKYbmB1JYM5vmHMcD9J9UZr0mn7BOYM_LudrP+Xvw@mail.gmail.com/ so you can see the history of it (me having initially blamed UBSAN, but the problem can be triggered at least in theory without it). Linus
Em Fri, 17 Mar 2023 13:51:17 -0700 Linus Torvalds <torvalds@linux-foundation.org> escreveu: > On Fri, Mar 17, 2023 at 1:42 PM Miguel Ojeda > <miguel.ojeda.sandonis@gmail.com> wrote: > > > > It is comparing against just the `.code` in the `m5mols_default_ffmt` > > table, i.e. the `MEDIA_BUS_FMT_VYUY8_2X8` (8199 = 0x2007) and > > `MEDIA_BUS_FMT_JPEG_1X8` (16385 = 0x4001), see > > Yeah, I see what it's doing. > > But: > > > If the condition had `++type` instead, it would not be a problem, > > because the loop stops before we go into the out of bounds access thus > > no UB. > > Yeah, but clang really should have generated a proper third iteration, > which calls that "out of bounds" case, and then returns, instead fo > falling off the end. > > I do think that on the kernel side, the fix is to just change > > } while (type++ != SIZE_DEFAULT_FFMT); > > to > > } while (++type != SIZE_DEFAULT_FFMT); Yeah, that seems to be the right fix to me too. Ack on such change: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=efbcbb12ee99f750c9f25c873b55ad774871de2a Regards, Mauro > > but I would *really* like clang to be fixed to not silently generate > code that does insane things and would be basically impossible to > debug if it ever triggers. > > We would have spent a *lot* of time wondering how the heck we Oopsed > in m5mols_get_frame_desc(). > > Linus > Thanks, Mauro