mbox series

[GIT,PULL] Block fixes for 6.3-rc3

Message ID 9d0ef355-f430-e8e2-c844-b34cfcf60d88@kernel.dk (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] Block fixes for 6.3-rc3 | expand

Pull-request

git://git.kernel.dk/linux.git tags/block-6.3-2023-03-16

Message

Jens Axboe March 17, 2023, 5:16 p.m. UTC
Hi Linus,

Set of fixes for block that should go into the 6.3-rc3 release. A bit
bigger than usual, as the NVMe pull request missed last weeks
submission. In detail:

- NVMe pull request via Christoph:
	- Avoid potential UAF in nvmet_req_complete (Damien Le Moal)
	- More quirks (Elmer Miroslav Mosher Golovin, Philipp Geulen)
	- Fix a memory leak in the nvme-pci probe teardown path
	  (Irvin Cote)
	- Repair the MAINTAINERS entry (Lukas Bulwahn)
	- Fix handling single range discard request (Ming Lei)
	- Show more opcode names in trace events (Minwoo Im)
	- Fix nvme-tcp timeout reporting (Sagi Grimberg)

- MD pull request via Song:
	- Two fixes for old issues (Neil)
	- Resource leak in device stopping (Xiao)

- Bio based device stats fix (Yu)

- Kill unused CONFIG_BLOCK_COMPAT (Lukas)

- sunvdc missing mdesc_grab() failure check (Liang)

- Fix for reversal of request ordering upon issue for certain cases
  (Jan)

- null_blk timeout fixes (Damien)

- Loop use-after-free fix (Bart)

- blk-mq SRCU fix for BLK_MQ_F_BLOCKING devices (Christ)

Side note - when doing the usual allmodconfig builds with gcc-12 and
clang before sending them out, for the latter I see this warning being
spewed with clang-15:

drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc()

Obviously not related to my changes, but mentioning it in case it has
been missed as I know you love squeaky clean builds :-). Doesn't happen
with clang-14.

Please pull!


The following changes since commit e2f2a39452c43b64ea3191642a2661cb8d03827a:

  block, bfq: fix uaf for 'stable_merge_bfqq' (2023-03-08 07:34:50 -0700)

are available in the Git repository at:

  git://git.kernel.dk/linux.git tags/block-6.3-2023-03-16

for you to fetch changes up to 8f0d196e4dc137470bbd5de98278d941c8002fcb:

  block: remove obsolete config BLOCK_COMPAT (2023-03-16 09:35:44 -0600)

----------------------------------------------------------------
block-6.3-2023-03-16

----------------------------------------------------------------
Bart Van Assche (1):
      loop: Fix use-after-free issues

Chris Leech (1):
      blk-mq: fix "bad unlock balance detected" on q->srcu in __blk_mq_run_dispatch_ops

Damien Le Moal (3):
      block: null_blk: Fix handling of fake timeout request
      block: null_blk: cleanup null_queue_rq()
      nvmet: avoid potential UAF in nvmet_req_complete()

Elmer Miroslav Mosher Golovin (1):
      nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV3000

Irvin Cote (1):
      nvme-pci: fixing memory leak in probe teardown path

Jan Kara (1):
      block: do not reverse request order when flushing plug list

Jens Axboe (2):
      Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.3
      Merge tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme into block-6.3

Liang He (1):
      block: sunvdc: add check for mdesc_grab() returning NULL

Lukas Bulwahn (2):
      MAINTAINERS: repair malformed T: entries in NVM EXPRESS DRIVERS
      block: remove obsolete config BLOCK_COMPAT

Ming Lei (1):
      nvme: fix handling single range discard request

Minwoo Im (1):
      nvme-trace: show more opcode names

NeilBrown (2):
      md: avoid signed overflow in slot_store()
      md: select BLOCK_LEGACY_AUTOLOAD

Philipp Geulen (1):
      nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM620

Sagi Grimberg (2):
      nvme-tcp: fix opcode reporting in the timeout handler
      nvme-tcp: add nvme-tcp pdu size build protection

Xiao Ni (1):
      md: Free resources in __md_stop

Yu Kuai (1):
      block: count 'ios' and 'sectors' when io is done for bio-based device

 MAINTAINERS                   |  8 ++++----
 block/Kconfig                 |  3 ---
 block/blk-core.c              | 16 ++++++----------
 block/blk-mq.c                |  5 +++--
 block/blk-mq.h                |  5 +++--
 drivers/block/loop.c          | 25 +++++++++++++++++--------
 drivers/block/null_blk/main.c | 31 +++++++++++++++----------------
 drivers/block/sunvdc.c        |  2 ++
 drivers/md/Kconfig            |  4 ++++
 drivers/md/dm.c               |  6 +++---
 drivers/md/md.c               | 17 ++++++++---------
 drivers/nvme/host/core.c      | 28 +++++++++++++++++++---------
 drivers/nvme/host/multipath.c |  8 ++++----
 drivers/nvme/host/pci.c       |  5 +++++
 drivers/nvme/host/tcp.c       | 33 +++++++++++++++++++++++++++------
 drivers/nvme/target/core.c    |  4 +++-
 include/linux/blk-mq.h        |  6 ++++++
 include/linux/blkdev.h        |  5 ++---
 include/linux/nvme.h          |  5 +++++
 19 files changed, 136 insertions(+), 80 deletions(-)

Comments

Linus Torvalds March 17, 2023, 6:35 p.m. UTC | #1
On Fri, Mar 17, 2023 at 10:16 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> - blk-mq SRCU fix for BLK_MQ_F_BLOCKING devices (Christ)

Christ indeed.

But I think you meant "Chris".

> Side note - when doing the usual allmodconfig builds with gcc-12 and
> clang before sending them out, for the latter I see this warning being
> spewed with clang-15:
>
> drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc()
>
> Obviously not related to my changes, but mentioning it in case it has
> been missed as I know you love squeaky clean builds :-). Doesn't happen
> with clang-14.

Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc,
and only my own "normal config" builds with clang.

So I don't see this particular issue and my builds are still squeaky clean.

That said, when I explicitly try that allmodconfig thing with clang, I
can see it too. And the reason seems to be something we've seen
before: UBSAN functions being considered non-return by clang, so clang
generates code like this:

   ....
.LBB24_3:
        callq   __sanitizer_cov_trace_pc@PLT
        movl    $2, %esi
        movq    $.L__unnamed_3, %rdi
        callq   __ubsan_handle_out_of_bounds
.Lfunc_end24:
        .size   m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt

ie the last thing in that m5mols_set_fmt() function is a call to
__ubsan_handle_out_of_bounds, and then it "falls through" to the next
function.

And yes, I absolutely *detest* how clang does that. Not only does it
cause objtool sanity checking issues, it fundamentally means that we
can never treat UBSAN warnings as warnings. They are always fatal.

This is a *huge* clang mis-feature, and I forget what we decided last
that we saw it.

But I suspect we need to disable UBSAN for clang, because clang gets
this so *horribly* wrong.

Fatal errors that cannot be recovered from are not something that the
compiler is supposed to decide on. It's exactly the same issue as
BUG() calls: it just results in a dead machine, and in the process the
actual problem easily gets lost (because maybe this only happens while
running X, and no serial console, and no way to actually see what the
UBSAN warning was as a result).

I really really detest this thing, and I think this is a fatal flaw,
and means that as-is, UBSAN really *has* to be disabled for clang
kernel builds. Maybe that will make somebody wake up and smell the
roses, and stop this idiotic "undefined behavior is fatal" garbage.

Nick? Do you remember what the fix was last time?

               Linus
Linus Torvalds March 17, 2023, 6:50 p.m. UTC | #2
On Fri, Mar 17, 2023 at 11:35 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I really really detest this thing, and I think this is a fatal flaw,
> and means that as-is, UBSAN really *has* to be disabled for clang
> kernel builds. Maybe that will make somebody wake up and smell the
> roses, and stop this idiotic "undefined behavior is fatal" garbage.

Btw, this is not at all kernel-centric, even if the kernel may be
special in not necessarily being able to log things on fatal issues.

Imagine you are a database vendor, and you cannot replicate something
that a user sees, and the user is not willing to give you their
database, and you worry that it might be some undefined thing.

Are you going to ship the user a test-build that might die in the
middle, and thus be unable to actually show the behavior in any
half-way production setting? Would any sane user run that pile of
garbage on their machines, knowing that any error would be fatal and
result in their production database being broken?

So any compiler person that says "undefined behavior can do anything
at all according to the standard, so we're compliant with that" is an
incompetent person, and shouldn't be let near a real compiler.

Debug support *MUST* have an option to just continue. Sure, make "this
is fatal" be an option *too*, because if you are a developer, maybe
you really want the "fail hard, fail quickly" behavior.

But that fatal option cannot be the *only* choice, or we cannot use
said debug code in the kernel.

             Linus
pr-tracker-bot@kernel.org March 17, 2023, 6:52 p.m. UTC | #3
The pull request you sent on Fri, 17 Mar 2023 11:16:02 -0600:

> git://git.kernel.dk/linux.git tags/block-6.3-2023-03-16

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8d3c682a5e3d9dfc2448ecbb22f4cd48359b9e21

Thank you!
Nick Desaulniers March 17, 2023, 6:59 p.m. UTC | #4
+Sanitizer folks (BCC'd)
(Top of lore thread:
https://lore.kernel.org/linux-block/9d0ef355-f430-e8e2-c844-b34cfcf60d88@kernel.dk/)

On Fri, Mar 17, 2023 at 11:35 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Mar 17, 2023 at 10:16 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > - blk-mq SRCU fix for BLK_MQ_F_BLOCKING devices (Christ)
>
> Christ indeed.
>
> But I think you meant "Chris".
>
> > Side note - when doing the usual allmodconfig builds with gcc-12 and
> > clang before sending them out, for the latter I see this warning being
> > spewed with clang-15:
> >
> > drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc()
> >
> > Obviously not related to my changes, but mentioning it in case it has
> > been missed as I know you love squeaky clean builds :-). Doesn't happen
> > with clang-14.
>
> Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc,
> and only my own "normal config" builds with clang.
>
> So I don't see this particular issue and my builds are still squeaky clean.
>
> That said, when I explicitly try that allmodconfig thing with clang, I
> can see it too. And the reason seems to be something we've seen
> before: UBSAN functions being considered non-return by clang, so clang
> generates code like this:
>
>    ....
> .LBB24_3:
>         callq   __sanitizer_cov_trace_pc@PLT
>         movl    $2, %esi
>         movq    $.L__unnamed_3, %rdi
>         callq   __ubsan_handle_out_of_bounds
> .Lfunc_end24:
>         .size   m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt
>
> ie the last thing in that m5mols_set_fmt() function is a call to
> __ubsan_handle_out_of_bounds, and then it "falls through" to the next
> function.
>
> And yes, I absolutely *detest* how clang does that. Not only does it
> cause objtool sanity checking issues, it fundamentally means that we
> can never treat UBSAN warnings as warnings. They are always fatal.
>
> This is a *huge* clang mis-feature, and I forget what we decided last
> that we saw it.
>
> But I suspect we need to disable UBSAN for clang, because clang gets
> this so *horribly* wrong.
>
> Fatal errors that cannot be recovered from are not something that the
> compiler is supposed to decide on. It's exactly the same issue as
> BUG() calls: it just results in a dead machine, and in the process the
> actual problem easily gets lost (because maybe this only happens while
> running X, and no serial console, and no way to actually see what the
> UBSAN warning was as a result).
>
> I really really detest this thing, and I think this is a fatal flaw,
> and means that as-is, UBSAN really *has* to be disabled for clang
> kernel builds. Maybe that will make somebody wake up and smell the
> roses, and stop this idiotic "undefined behavior is fatal" garbage.
>
> Nick? Do you remember what the fix was last time?
>
>                Linus
Jens Axboe March 17, 2023, 7:44 p.m. UTC | #5
On 3/17/23 12:35?PM, Linus Torvalds wrote:
> On Fri, Mar 17, 2023 at 10:16?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> - blk-mq SRCU fix for BLK_MQ_F_BLOCKING devices (Christ)
> 
> Christ indeed.
> 
> But I think you meant "Chris".

Oops yes indeed, good catch. Though I'd love to think that if we did
have a resurrection of that nature, blk-mq would be high on the list of
things to hack on.

>> Side note - when doing the usual allmodconfig builds with gcc-12 and
>> clang before sending them out, for the latter I see this warning being
>> spewed with clang-15:
>>
>> drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc()
>>
>> Obviously not related to my changes, but mentioning it in case it has
>> been missed as I know you love squeaky clean builds :-). Doesn't happen
>> with clang-14.
> 
> Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc,
> and only my own "normal config" builds with clang.

I just do both to avoid anything odd, it's just a few min each.

> So I don't see this particular issue and my builds are still squeaky clean.
> 
> That said, when I explicitly try that allmodconfig thing with clang, I
> can see it too. And the reason seems to be something we've seen
> before: UBSAN functions being considered non-return by clang, so clang
> generates code like this:
> 
>    ....
> .LBB24_3:
>         callq   __sanitizer_cov_trace_pc@PLT
>         movl    $2, %esi
>         movq    $.L__unnamed_3, %rdi
>         callq   __ubsan_handle_out_of_bounds
> .Lfunc_end24:
>         .size   m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt
> 
> ie the last thing in that m5mols_set_fmt() function is a call to
> __ubsan_handle_out_of_bounds, and then it "falls through" to the next
> function.
> 
> And yes, I absolutely *detest* how clang does that. Not only does it
> cause objtool sanity checking issues, it fundamentally means that we
> can never treat UBSAN warnings as warnings. They are always fatal.
> 
> This is a *huge* clang mis-feature, and I forget what we decided last
> that we saw it.

Gotcha, thanks for digging into that. I must admit I didn't look any
further at it, but figured it was worth reporting...
Kees Cook March 17, 2023, 7:50 p.m. UTC | #6
On Fri, Mar 17, 2023 at 11:59:11AM -0700, Nick Desaulniers wrote:
> +Sanitizer folks (BCC'd)
> (Top of lore thread:
> https://lore.kernel.org/linux-block/9d0ef355-f430-e8e2-c844-b34cfcf60d88@kernel.dk/)
> 
> On Fri, Mar 17, 2023 at 11:35 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, Mar 17, 2023 at 10:16 AM Jens Axboe <axboe@kernel.dk> wrote:
> > > Side note - when doing the usual allmodconfig builds with gcc-12 and
> > > clang before sending them out, for the latter I see this warning being
> > > spewed with clang-15:
> > >
> > > drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc()
> > >
> > > Obviously not related to my changes, but mentioning it in case it has
> > > been missed as I know you love squeaky clean builds :-). Doesn't happen
> > > with clang-14.
> >
> > Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc,
> > and only my own "normal config" builds with clang.
> >
> > So I don't see this particular issue and my builds are still squeaky clean.
> >
> > That said, when I explicitly try that allmodconfig thing with clang, I
> > can see it too. And the reason seems to be something we've seen
> > before: UBSAN functions being considered non-return by clang, so clang
> > generates code like this:
> >
> >    ....
> > .LBB24_3:
> >         callq   __sanitizer_cov_trace_pc@PLT
> >         movl    $2, %esi
> >         movq    $.L__unnamed_3, %rdi
> >         callq   __ubsan_handle_out_of_bounds
> > .Lfunc_end24:
> >         .size   m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt
> >
> > ie the last thing in that m5mols_set_fmt() function is a call to
> > __ubsan_handle_out_of_bounds, and then it "falls through" to the next
> > function.
> >
> > And yes, I absolutely *detest* how clang does that. Not only does it
> > cause objtool sanity checking issues, it fundamentally means that we
> > can never treat UBSAN warnings as warnings. They are always fatal.

I've hit these cases a few times too. The __ubsan_handle_* stuff
is designed to be recoverable. I think there are some cases where
we're tripping over Clang bugs, though. Some of the past issues
have been with Clang thinking some UBSAN feature was trap-only
(e.g. -fsanitizer=local-bounds), but here it actually generated the call,
but decided it was no-return. *sigh*

> > But I suspect we need to disable UBSAN for clang, because clang gets
> > this so *horribly* wrong.

Which is to say, it normally gets it right, but there are some instances
where things go weird. If it was horribly wrong, there would be a LOT
more objtool warnings. :)

I'm not opposed to disabling UBSAN for all*config builds if we need to,
but I want to get these Clang bugs found and fixed so I'd be sad to lose
the coverage.
Linus Torvalds March 17, 2023, 8:30 p.m. UTC | #7
On Fri, Mar 17, 2023 at 12:50 PM Kees Cook <keescook@chromium.org> wrote:
>
> I've hit these cases a few times too. The __ubsan_handle_* stuff
> is designed to be recoverable. I think there are some cases where
> we're tripping over Clang bugs, though. Some of the past issues
> have been with Clang thinking some UBSAN feature was trap-only
> (e.g. -fsanitizer=local-bounds), but here it actually generated the call,
> but decided it was no-return. *sigh*

Hmm. The problem here is that we only get an objdump warning by pure luck.

This isn't a "objdump will always catch it", it's more like "objdump
will only catch it if the code happens to be moved to the end of the
function".

So I'm very happy to hear that it's not by design, and that's very good.

But the whole "this is a compiler bug" doesn't exactly give me the
warm and fuzzies either.

The code generation looks *really* odd to me. This is, as far as I can
tell, the code sequence that this code has:

m5mols_set_fmt:
  ..
        movq    %rdx, %rbp
  ..
        movl    16(%rbp), %ebx
        movq    %rbx, %rdi
  ..
        cmpq    $16385, %rbx                    # imm = 0x4001
        jne     .LBB24_1
  ..
.LBB24_1:
        cmpl    $8199, %ebx                     # imm = 0x2007
        jne     .LBB24_3
  ..
.LBB24_3:
        callq   __sanitizer_cov_trace_pc@PLT
        movl    $2, %esi
        movq    $.L__unnamed_3, %rdi
        callq   __ubsan_handle_out_of_bounds

and as far as I can tell, that "movl 16(%rbp), %ebx" is

        enum m5mols_restype stype = __find_restype(mf->code);

in __find_resolution(). But I don't understand those odd constants
it's comparing against at all.

'enum m5mols_restype' should be in the range 0..2, but it seems to use
a 64-bit compare against '16385' (and then a 32-bit compare against
'8199') to decide it's out-of-bounds.

I must be *completely* mis-reading this, because none of that makes a
whit of sense to me.

It's possible that clang is just *so* terminally confused that it just
generates random code, but it's more likely that I'm mis-reading
things.

The above is my interpretation of what I see with the current F37
clang version for a "allmodconfig" build of that file.

My 'clang -v' says

    clang version 15.0.7 (Fedora 15.0.7-1.fc37)

in case some clang person wants to try to recreate this.

> I'm not opposed to disabling UBSAN for all*config builds if we need to,
> but I want to get these Clang bugs found and fixed so I'd be sad to lose
> the coverage.

I wonder how many people actually end up having UBSAN actually
enabled. I get the feeling that most of the coverage it gets is
exactly just the compile-only coverage by "allmodconfig" and friends.

Otherwise it would be really easy to just say

        depends on !COMPILE_TEST

for all these run-time debug things. But exactly *because* they have
caused endless amounts of build-time pain - *and* because I doubt how
much real run-time testing they get - limiting it to non-build tests
sounds a bit counter-productive.

              Linus
Miguel Ojeda March 17, 2023, 8:31 p.m. UTC | #8
On Fri, Mar 17, 2023 at 7:50 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Debug support *MUST* have an option to just continue. Sure, make "this
> is fatal" be an option *too*, because if you are a developer, maybe
> you really want the "fail hard, fail quickly" behavior.

At least for userspace it has different modes -- the default is to
report and continue:

    -fsanitize=...: print a verbose error report and continue
execution (default);
    -fno-sanitize-recover=...: print a verbose error report and exit
the program;
    -fsanitize-trap=...: execute a trap instruction (doesn’t require
UBSan run-time support).
    -fno-sanitize=...: disable any check, e.g., -fno-sanitize=alignment.

    (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#usage)

For the kernel I assume it is meant to work due to `UBSAN_TRAP`, but
the optimizer may be getting in the way.

From a quick look, it seems that `__find_restype` (which gets inlined
into `m5mols_set_fmt` via `__find_resolution`) has a small loop:

    do {
        if (code == m5mols_default_ffmt[type].code)
            return type;
    } while (type++ != SIZE_DEFAULT_FFMT);

that I guess gets simplified into something like:

    if (code == first code)
        yield 0;
    if (code == second code)
        yield 1;
    out of bounds;

When UBSAN is disabled, it may be assuming:

    if (code == first code)
        yield 0;
    yield 1;

Thus no issue.

Though even with UBSAN disabled, if I add a dummy opaque call inside
the loop, it goes back to something like before, except the jump goes
nowhere and then `objtool` complains again:

    .LBB24_20:
        callq    dummydummydummy
    .Lfunc_end24:

So it is reproducible even without UBSAN getting involved:
https://godbolt.org/z/hTe91b3eG

Cheers,
Miguel
Nick Desaulniers March 17, 2023, 8:35 p.m. UTC | #9
On Fri, Mar 17, 2023 at 12:50 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, Mar 17, 2023 at 11:59:11AM -0700, Nick Desaulniers wrote:
> > +Sanitizer folks (BCC'd)
> > (Top of lore thread:
> > https://lore.kernel.org/linux-block/9d0ef355-f430-e8e2-c844-b34cfcf60d88@kernel.dk/)
> >
> > On Fri, Mar 17, 2023 at 11:35 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Fri, Mar 17, 2023 at 10:16 AM Jens Axboe <axboe@kernel.dk> wrote:
> > > > Side note - when doing the usual allmodconfig builds with gcc-12 and
> > > > clang before sending them out, for the latter I see this warning being
> > > > spewed with clang-15:
> > > >
> > > > drivers/media/i2c/m5mols/m5mols.o: warning: objtool: m5mols_set_fmt() falls through to next function m5mols_get_frame_desc()
> > > >
> > > > Obviously not related to my changes, but mentioning it in case it has
> > > > been missed as I know you love squeaky clean builds :-). Doesn't happen
> > > > with clang-14.
> > >
> > > Hmm. I have clang-15 too, but I do the allmodconfig builds with gcc,
> > > and only my own "normal config" builds with clang.
> > >
> > > So I don't see this particular issue and my builds are still squeaky clean.
> > >
> > > That said, when I explicitly try that allmodconfig thing with clang, I
> > > can see it too. And the reason seems to be something we've seen
> > > before: UBSAN functions being considered non-return by clang, so clang
> > > generates code like this:
> > >
> > >    ....
> > > .LBB24_3:
> > >         callq   __sanitizer_cov_trace_pc@PLT
> > >         movl    $2, %esi
> > >         movq    $.L__unnamed_3, %rdi
> > >         callq   __ubsan_handle_out_of_bounds
> > > .Lfunc_end24:
> > >         .size   m5mols_set_fmt, .Lfunc_end24-m5mols_set_fmt
> > >
> > > ie the last thing in that m5mols_set_fmt() function is a call to
> > > __ubsan_handle_out_of_bounds, and then it "falls through" to the next
> > > function.
> > >
> > > And yes, I absolutely *detest* how clang does that. Not only does it
> > > cause objtool sanity checking issues, it fundamentally means that we
> > > can never treat UBSAN warnings as warnings. They are always fatal.
>
> I've hit these cases a few times too. The __ubsan_handle_* stuff
> is designed to be recoverable. I think there are some cases where
> we're tripping over Clang bugs, though. Some of the past issues
> have been with Clang thinking some UBSAN feature was trap-only
> (e.g. -fsanitizer=local-bounds), but here it actually generated the call,
> but decided it was no-return. *sigh*

I think no-return is a red herring (or rather, I don't think noreturn
is at play here at all). Looking at the IR, I don't see anything that
indicated anything was deduced to be noreturn.

It looks like this is coming from the loop in __find_restype() being
fully unrolled.

On the initial iteration, `type` == `M5MOLS_RESTYPE_MONITOR` == 0.
`m5mols_default_ffmt` is a 2 element array.
If we don't match we loop again, `type` == `M5MOLS_RESTYPE_CAPTURE` == 1.
`SIZE_DEFAULT_FFMT` == ARRAY_SIZE(m5mols_default_ffmt) == 2, so we loop again.
`type` == 2, accessing m5mols_default_ffmt out of bounds.

Perhaps this code meant to simply use a for loop rather than do-while?
(Or pre-increment rather than post increment for the do-while)?

```
diff --git a/drivers/media/i2c/m5mols/m5mols_core.c
b/drivers/media/i2c/m5mols/m5mols_core.c
index 2b01873ba0db..603b1036127e 100644
--- a/drivers/media/i2c/m5mols/m5mols_core.c
+++ b/drivers/media/i2c/m5mols/m5mols_core.c
@@ -485,10 +485,9 @@ static enum m5mols_restype __find_restype(u32 code)
 {
        enum m5mols_restype type = M5MOLS_RESTYPE_MONITOR;

-       do {
+       for (; type != M5MOLS_RESTYPE_MAX; ++type)
                if (code == m5mols_default_ffmt[type].code)
                        return type;
-       } while (type++ != SIZE_DEFAULT_FFMT);

        return 0;
 }
```

>
> > > But I suspect we need to disable UBSAN for clang, because clang gets
> > > this so *horribly* wrong.

I think Linus is thinking about,
commit e5d523f1ae8f ("ubsan: disable UBSAN_DIV_ZERO for clang")

I can see the loop unroller inserting a branch on "poison" which is a
magic value in LLVM related to but distinct from "undef".  That gets
replaced with an "unreachable" instruction.
I wonder if we can change loop unroller to not insert branch on poison
when the sanitizers are enabled, or freeze poison.
https://llvm.org/devmtg/2020-09/slides/Lee-UndefPoison.pdf

Maybe Linus has thoughts he can share on this thread:
https://github.com/llvm/llvm-project/issues/56289?

Finally, there's always `-mllvm -trap-unreachable` though that's a
last resort kind of thing; `-mllvm` flags need to be passed to the
linker for LTO, and compiler for non-LTO and LTO.  I do think in this
case that the fallthough is bringing to our attention an issue in the
source.

>
> Which is to say, it normally gets it right, but there are some instances
> where things go weird. If it was horribly wrong, there would be a LOT
> more objtool warnings. :)
>
> I'm not opposed to disabling UBSAN for all*config builds if we need to,
> but I want to get these Clang bugs found and fixed so I'd be sad to lose
> the coverage.
>
> --
> Kees Cook



--
Thanks,
~Nick Desaulniers
Miguel Ojeda March 17, 2023, 8:42 p.m. UTC | #10
On Fri, Mar 17, 2023 at 9:31 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> 'enum m5mols_restype' should be in the range 0..2, but it seems to use
> a 64-bit compare against '16385' (and then a 32-bit compare against
> '8199') to decide it's out-of-bounds.

It is comparing against just the `.code` in the `m5mols_default_ffmt`
table, i.e. the `MEDIA_BUS_FMT_VYUY8_2X8` (8199 = 0x2007) and
`MEDIA_BUS_FMT_JPEG_1X8` (16385 = 0x4001), see
https://elixir.bootlin.com/linux/latest/source/drivers/media/i2c/m5mols/m5mols_core.c#L52.

So it is basically:

    if (code == m5mols_default_ffmt[0].code)
        return type;
    if (type == 2)
        continue;
    ++type;

    if (code == m5mols_default_ffmt[1].code)
        return type;
    if (type == 2)
        continue;
    ++type;

    m5mols_default_ffmt[2].code -> out of bounds UB -> unreachable

If the condition had `++type` instead, it would not be a problem,
because the loop stops before we go into the out of bounds access thus
no UB.

Cheers,
Miguel
Linus Torvalds March 17, 2023, 8:45 p.m. UTC | #11
On Fri, Mar 17, 2023 at 1:31 PM Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
>
> From a quick look, it seems that `__find_restype` (which gets inlined
> into `m5mols_set_fmt` via `__find_resolution`) has a small loop:
>
>     do {
>         if (code == m5mols_default_ffmt[type].code)
>             return type;
>     } while (type++ != SIZE_DEFAULT_FFMT);
>
> that I guess gets simplified into something like:
>
>     if (code == first code)
>         yield 0;
>     if (code == second code)
>         yield 1;
>     out of bounds;

Ahh. That explains the odd constants I saw. I did figure out it had
something to do with loading 'mf->code', but then it went off the
rails.

> Though even with UBSAN disabled, if I add a dummy opaque call inside
> the loop, it goes back to something like before, except the jump goes
> nowhere and then `objtool` complains again:
>
>     .LBB24_20:
>         callq    dummydummydummy
>     .Lfunc_end24:
>
> So it is reproducible even without UBSAN getting involved:
> https://godbolt.org/z/hTe91b3eG

That code generation is some crazy stuff.

Yeah, that's not acceptable, and the bug is clearly not UBSAN, but
some generic bogus clang thing.

Much nicer to try to debug this as a "objtool complains about the
generated code" rather than as some insane runtine problem with code
falling off the end of the function.

                Linus
Linus Torvalds March 17, 2023, 8:51 p.m. UTC | #12
On Fri, Mar 17, 2023 at 1:42 PM Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
>
> It is comparing against just the `.code` in the `m5mols_default_ffmt`
> table, i.e. the `MEDIA_BUS_FMT_VYUY8_2X8` (8199 = 0x2007) and
> `MEDIA_BUS_FMT_JPEG_1X8` (16385 = 0x4001), see

Yeah, I see what it's doing.

But:

> If the condition had `++type` instead, it would not be a problem,
> because the loop stops before we go into the out of bounds access thus
> no UB.

Yeah, but clang really should have generated a proper third iteration,
which calls that "out of bounds" case, and then returns, instead fo
falling off the end.

I do think that on the kernel side, the fix is to just change

        } while (type++ != SIZE_DEFAULT_FFMT);

to

        } while (++type != SIZE_DEFAULT_FFMT);

but I would *really* like clang to be fixed to not silently generate
code that does insane things and would be basically impossible to
debug if it ever triggers.

We would have spent a *lot* of time wondering how the heck we Oopsed
in m5mols_get_frame_desc().

             Linus
Linus Torvalds March 17, 2023, 9 p.m. UTC | #13
On Fri, Mar 17, 2023 at 1:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> but I would *really* like clang to be fixed to not silently generate
> code that does insane things and would be basically impossible to
> debug if it ever triggers.

Side note: the key word here is "silently".

If clang notices that it generates crazy code, a warning at build-time
would be preferable to the "oh, we noticed the crazy code generation
because we do sanity checking that just happened to catch it".

If that "fall-through due to impossible third iteration" had happened
*inside* the function, rather than at the end, we'd have never
noticed.

So we were kind of lucky in just where things triggered this time.
Maybe that's always going to be true due to some random internal clang
logic ("dead code fallthrough is always at the end because of how we
sort basic blocks"), but I don't see why it would be in general.

                      Linus
Miguel Ojeda March 17, 2023, 9:01 p.m. UTC | #14
On Fri, Mar 17, 2023 at 9:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yeah, I see what it's doing.

Yeah, sorry, I saw your later email after sending that one.

> Yeah, but clang really should have generated a proper third iteration,
> which calls that "out of bounds" case, and then returns, instead fo
> falling off the end.
>
> I do think that on the kernel side, the fix is to just change
>
>         } while (type++ != SIZE_DEFAULT_FFMT);
>
> to
>
>         } while (++type != SIZE_DEFAULT_FFMT);
>
> but I would *really* like clang to be fixed to not silently generate
> code that does insane things and would be basically impossible to
> debug if it ever triggers.

Not sure how easy is for them to realize that they should do a 3rd
iteration. But perhaps it would be possible that Clang/LLVM does a
similar check to objtool and at least emit a warning about similar
situations that would help developers diagnose this (since it should
have way more information about what happened than objtool).

Cheers,
Miguel
Nick Desaulniers March 17, 2023, 10:56 p.m. UTC | #15
On Fri, Mar 17, 2023 at 2:00 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Mar 17, 2023 at 1:51 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > but I would *really* like clang to be fixed to not silently generate
> > code that does insane things and would be basically impossible to
> > debug if it ever triggers.
>
> Side note: the key word here is "silently".
>
> If clang notices that it generates crazy code, a warning at build-time
> would be preferable to the "oh, we noticed the crazy code generation
> because we do sanity checking that just happened to catch it".

That's fair.  I have something hacked up locally that can spot the
fallthough from m5mols_set_fmt() as objtool did. With some polish, we
can likely ship that as a compiler warning.  Then we can have these
checks regardless of objtool arch support.

First I need to teach LLVM that __stack_chk_fail is noreturn, though
I've only verified that thus far in glibc, musl, and bionic; I still
need to check that's the case for the BSDs' libcs.
Linus Torvalds March 18, 2023, 6:20 p.m. UTC | #16
On Fri, Mar 17, 2023 at 1:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I do think that on the kernel side, the fix is to just change
>
>         } while (type++ != SIZE_DEFAULT_FFMT);
>
> to
>
>         } while (++type != SIZE_DEFAULT_FFMT);

Ok, I ended up deciding to just commit that minimal change, even
though it might have been better to just make it a normal for-loop
(and use M5MOLS_RESTYPE_MAX instead as the end condition).

So maybe it would be more legible (and less likely to have had that
off-by-one) if the loop had been

        for (type = 0; type < M5MOLS_RESTYPE_MAX; type++)

instead. But I'll leave that decision to the driver authors (now cc'd).

For people brought in late, this is now commit efbcbb12ee99 ("media:
m5mols: fix off-by-one loop termination error") with link to the
discussion here

    https://lore.kernel.org/linux-block/CAHk-=wgTSdKYbmB1JYM5vmHMcD9J9UZr0mn7BOYM_LudrP+Xvw@mail.gmail.com/

so you can see the history of it (me having initially blamed UBSAN,
but the problem can be triggered at least in theory without it).

                  Linus
Mauro Carvalho Chehab March 19, 2023, 12:48 a.m. UTC | #17
Em Fri, 17 Mar 2023 13:51:17 -0700
Linus Torvalds <torvalds@linux-foundation.org> escreveu:

> On Fri, Mar 17, 2023 at 1:42 PM Miguel Ojeda
> <miguel.ojeda.sandonis@gmail.com> wrote:
> >
> > It is comparing against just the `.code` in the `m5mols_default_ffmt`
> > table, i.e. the `MEDIA_BUS_FMT_VYUY8_2X8` (8199 = 0x2007) and
> > `MEDIA_BUS_FMT_JPEG_1X8` (16385 = 0x4001), see  
> 
> Yeah, I see what it's doing.
> 
> But:
> 
> > If the condition had `++type` instead, it would not be a problem,
> > because the loop stops before we go into the out of bounds access thus
> > no UB.  
> 
> Yeah, but clang really should have generated a proper third iteration,
> which calls that "out of bounds" case, and then returns, instead fo
> falling off the end.
> 
> I do think that on the kernel side, the fix is to just change
> 
>         } while (type++ != SIZE_DEFAULT_FFMT);
> 
> to
> 
>         } while (++type != SIZE_DEFAULT_FFMT);

Yeah, that seems to be the right fix to me too.

Ack on such change: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=efbcbb12ee99f750c9f25c873b55ad774871de2a

Regards,
Mauro


> 
> but I would *really* like clang to be fixed to not silently generate
> code that does insane things and would be basically impossible to
> debug if it ever triggers.
> 
> We would have spent a *lot* of time wondering how the heck we Oopsed
> in m5mols_get_frame_desc().
> 
>              Linus
> 



Thanks,
Mauro