diff mbox

usercopy whitelist woe in scsi_sense_cache

Message ID CAGXu5jLQuzpjFcfjpT=MvgOxBizFNMjnfmY+E7eCHkDAV5swHg@mail.gmail.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Kees Cook April 17, 2018, 3:12 a.m. UTC
On Mon, Apr 16, 2018 at 1:44 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Apr 12, 2018 at 8:02 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook <keescook@chromium.org> wrote:
>>> After fixing up some build issues in the middle of the 4.16 cycle, I
>>> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
>>> 'for-4.16/block'"). Instead of letting the test run longer, I'm going
>>> to switch to doing several shorter test boots per kernel and see if
>>> that helps. One more bisect coming...
>>
>> Okay, so I can confirm the bisect points at the _merge_ itself, not a
>> specific patch. I'm not sure how to proceed here. It looks like some
>> kind of interaction between separate trees? Jens, do you have
>> suggestions on how to track this down?
>
> Turning off HARDENED_USERCOPY and turning on KASAN, I see the same report:
>
> [   38.274106] BUG: KASAN: slab-out-of-bounds in _copy_to_user+0x42/0x60
> [   38.274841] Read of size 22 at addr ffff8800122b8c4b by task smartctl/1064
> [   38.275630]
> [   38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted 4.17.0-rc1-ARCH+ #266
> [   38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   38.277690] Call Trace:
> [   38.277988]  dump_stack+0x71/0xab
> [   38.278397]  ? _copy_to_user+0x42/0x60
> [   38.278833]  print_address_description+0x6a/0x270
> [   38.279368]  ? _copy_to_user+0x42/0x60
> [   38.279800]  kasan_report+0x243/0x360
> [   38.280221]  _copy_to_user+0x42/0x60
> [   38.280635]  sg_io+0x459/0x660
> ...
>
> Though we get slightly more details (some we already knew):
>
> [   38.301330] Allocated by task 329:
> [   38.301734]  kmem_cache_alloc_node+0xca/0x220
> [   38.302239]  scsi_mq_init_request+0x64/0x130 [scsi_mod]
> [   38.302821]  blk_mq_alloc_rqs+0x2cf/0x370
> [   38.303265]  blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0
> [   38.303820]  blk_mq_init_sched+0xf0/0x220
> [   38.304268]  elevator_switch+0x17a/0x2c0
> [   38.304705]  elv_iosched_store+0x173/0x220
> [   38.305171]  queue_attr_store+0x72/0xb0
> [   38.305602]  kernfs_fop_write+0x188/0x220
> [   38.306049]  __vfs_write+0xb6/0x330
> [   38.306436]  vfs_write+0xe9/0x240
> [   38.306804]  ksys_write+0x98/0x110
> [   38.307181]  do_syscall_64+0x6d/0x1d0
> [   38.307590]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   38.308142]
> [   38.308316] Freed by task 0:
> [   38.308652] (stack is not available)
> [   38.309060]
> [   38.309243] The buggy address belongs to the object at ffff8800122b8c00
> [   38.309243]  which belongs to the cache scsi_sense_cache of size 96
> [   38.310625] The buggy address is located 75 bytes inside of
> [   38.310625]  96-byte region [ffff8800122b8c00, ffff8800122b8c60)

With a hardware watchpoint, I've isolated the corruption to here:

bfq_dispatch_request+0x2be/0x1610:
__bfq_dispatch_request at block/bfq-iosched.c:3902
3900            if (rq) {
3901    inc_in_driver_start_rq:
3902                    bfqd->rq_in_driver++;
3903    start_rq:
3904                    rq->rq_flags |= RQF_STARTED;
3905            }

Through some race condition(?), rq_in_driver is also sense_buffer, and
it can get incremented. Specifically, I am doing:

        /* flush rq in flush machinery need to be dispatched directly */
        if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
@@ -461,6 +485,9 @@ void blk_mq_sched_insert_request(struct request
*rq, bool at_head,
 run:
        if (run_queue)
                blk_mq_run_hw_queue(hctx, async);
+
+       if (ok_to_go)
+               unregister_wide_hw_breakpoint(sample_hbp);
 }

 void blk_mq_sched_insert_requests(struct request_queue *q,

Where "ok_to_go" is wired up to a sysctl so I don't trip other things
(?) at boot-time.

I still haven't figured this out, though... any have a moment to look at this?

-Kees

Comments

Oleksandr Natalenko April 17, 2018, 9:19 a.m. UTC | #1
Hi.

17.04.2018 05:12, Kees Cook wrote:
>> Turning off HARDENED_USERCOPY and turning on KASAN, I see the same 
>> report:
>> 
>> [   38.274106] BUG: KASAN: slab-out-of-bounds in 
>> _copy_to_user+0x42/0x60
>> [   38.274841] Read of size 22 at addr ffff8800122b8c4b by task 
>> smartctl/1064
>> [   38.275630]
>> [   38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted 
>> 4.17.0-rc1-ARCH+ #266
>> [   38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> [   38.277690] Call Trace:
>> [   38.277988]  dump_stack+0x71/0xab
>> [   38.278397]  ? _copy_to_user+0x42/0x60
>> [   38.278833]  print_address_description+0x6a/0x270
>> [   38.279368]  ? _copy_to_user+0x42/0x60
>> [   38.279800]  kasan_report+0x243/0x360
>> [   38.280221]  _copy_to_user+0x42/0x60
>> [   38.280635]  sg_io+0x459/0x660
>> ...
>> 
>> Though we get slightly more details (some we already knew):
>> 
>> [   38.301330] Allocated by task 329:
>> [   38.301734]  kmem_cache_alloc_node+0xca/0x220
>> [   38.302239]  scsi_mq_init_request+0x64/0x130 [scsi_mod]
>> [   38.302821]  blk_mq_alloc_rqs+0x2cf/0x370
>> [   38.303265]  blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0
>> [   38.303820]  blk_mq_init_sched+0xf0/0x220
>> [   38.304268]  elevator_switch+0x17a/0x2c0
>> [   38.304705]  elv_iosched_store+0x173/0x220
>> [   38.305171]  queue_attr_store+0x72/0xb0
>> [   38.305602]  kernfs_fop_write+0x188/0x220
>> [   38.306049]  __vfs_write+0xb6/0x330
>> [   38.306436]  vfs_write+0xe9/0x240
>> [   38.306804]  ksys_write+0x98/0x110
>> [   38.307181]  do_syscall_64+0x6d/0x1d0
>> [   38.307590]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [   38.308142]
>> [   38.308316] Freed by task 0:
>> [   38.308652] (stack is not available)
>> [   38.309060]
>> [   38.309243] The buggy address belongs to the object at 
>> ffff8800122b8c00
>> [   38.309243]  which belongs to the cache scsi_sense_cache of size 96
>> [   38.310625] The buggy address is located 75 bytes inside of
>> [   38.310625]  96-byte region [ffff8800122b8c00, ffff8800122b8c60)
> 
> With a hardware watchpoint, I've isolated the corruption to here:
> 
> bfq_dispatch_request+0x2be/0x1610:
> __bfq_dispatch_request at block/bfq-iosched.c:3902
> 3900            if (rq) {
> 3901    inc_in_driver_start_rq:
> 3902                    bfqd->rq_in_driver++;
> 3903    start_rq:
> 3904                    rq->rq_flags |= RQF_STARTED;
> 3905            }
> 
> Through some race condition(?), rq_in_driver is also sense_buffer, and
> it can get incremented.
> …
> I still haven't figured this out, though... any have a moment to look 
> at this?

By any chance, have you tried to simplify the reproducer environment, or 
it still needs my complex layout to trigger things even with KASAN?

Regards,
   Oleksandr
James Bottomley April 17, 2018, 10:02 a.m. UTC | #2
On Mon, 2018-04-16 at 20:12 -0700, Kees Cook wrote:
> I still haven't figured this out, though... any have a moment to look
> at this?

Just to let you know you're not alone ... but I can't make any sense of
this either.  The bfdq is the elevator_data, which is initialised when
the scheduler is attached, so it shouldn't change.  Is it possible to
set a data break point on elevator_data after it's initialised and see
if it got changed by something?

James
Kees Cook April 17, 2018, 4:25 p.m. UTC | #3
On Tue, Apr 17, 2018 at 2:19 AM, Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:
> By any chance, have you tried to simplify the reproducer environment, or it
> still needs my complex layout to trigger things even with KASAN?

I haven't tried minimizing the reproducer yet, no. Now that I have a
specific place to watch in the kernel for the corruption, though, that
might help. If I get stuck again today, I'll try it.

-Kees
Kees Cook April 17, 2018, 4:30 p.m. UTC | #4
On Tue, Apr 17, 2018 at 3:02 AM, James Bottomley
<jejb@linux.vnet.ibm.com> wrote:
> On Mon, 2018-04-16 at 20:12 -0700, Kees Cook wrote:
>> I still haven't figured this out, though... any have a moment to look
>> at this?
>
> Just to let you know you're not alone ... but I can't make any sense of
> this either.  The bfdq is the elevator_data, which is initialised when
> the scheduler is attached, so it shouldn't change.  Is it possible to
> set a data break point on elevator_data after it's initialised and see
> if it got changed by something?

Yeah, it seems like some pointer chain is getting overwritten outside
of a lock or rcu or ?. I don't know this code well enough to guess at
where to check, though. What I find so strange is that the structure
offsets are different between bfpd's rq_in_driver field and
scsi_request's sense field, so even THAT doesn't look to be clear-cut
either:

struct bfq_data {
        struct request_queue *     queue;                /*     0     8 */
        struct list_head           dispatch;             /*     8    16 */
        struct bfq_group *         root_group;           /*    24     8 */
        struct rb_root             queue_weights_tree;   /*    32     8 */
        struct rb_root             group_weights_tree;   /*    40     8 */
        int                        busy_queues;          /*    48     4 */
        int                        wr_busy_queues;       /*    52     4 */
        int                        queued;               /*    56     4 */
        int                        rq_in_driver;         /*    60     4 */
...

struct scsi_request {
        unsigned char              __cmd[16];            /*     0    16 */
        unsigned char *            cmd;                  /*    16     8 */
        short unsigned int         cmd_len;              /*    24     2 */

        /* XXX 2 bytes hole, try to pack */

        int                        result;               /*    28     4 */
        unsigned int               sense_len;            /*    32     4 */
        unsigned int               resid_len;            /*    36     4 */
        int                        retries;              /*    40     4 */

        /* XXX 4 bytes hole, try to pack */

        void *                     sense;                /*    48     8 */
...

This is _so_ weird. :P

-Kees
Kees Cook April 17, 2018, 4:42 p.m. UTC | #5
On Mon, Apr 16, 2018 at 8:12 PM, Kees Cook <keescook@chromium.org> wrote:
> With a hardware watchpoint, I've isolated the corruption to here:
>
> bfq_dispatch_request+0x2be/0x1610:
> __bfq_dispatch_request at block/bfq-iosched.c:3902
> 3900            if (rq) {
> 3901    inc_in_driver_start_rq:
> 3902                    bfqd->rq_in_driver++;
> 3903    start_rq:
> 3904                    rq->rq_flags |= RQF_STARTED;
> 3905            }

FWIW, the stacktrace here (removing the ? lines) is:

[   34.311980] RIP: 0010:bfq_dispatch_request+0x2be/0x1610
[   34.452491]  blk_mq_do_dispatch_sched+0x1d9/0x260
[   34.454561]  blk_mq_sched_dispatch_requests+0x3da/0x4b0
[   34.458789]  __blk_mq_run_hw_queue+0xae/0x130
[   34.460001]  __blk_mq_delay_run_hw_queue+0x192/0x280
[   34.460823]  blk_mq_run_hw_queue+0x10b/0x1b0
[   34.463240]  blk_mq_sched_insert_request+0x3bd/0x4d0
[   34.467342]  blk_execute_rq+0xcf/0x140
[   34.468483]  sg_io+0x2f7/0x730

Can anyone tell me more about the memory allocation layout of the
various variables here? It looks like struct request is a header in
front of struct scsi_request? How do struct elevator_queue, struct
blk_mq_ctx, and struct blk_mq_hw_ctx overlap these?

Regardless, I'll check for elevator data changing too...

-Kees
Jens Axboe April 17, 2018, 4:46 p.m. UTC | #6
On 4/17/18 10:42 AM, Kees Cook wrote:
> On Mon, Apr 16, 2018 at 8:12 PM, Kees Cook <keescook@chromium.org> wrote:
>> With a hardware watchpoint, I've isolated the corruption to here:
>>
>> bfq_dispatch_request+0x2be/0x1610:
>> __bfq_dispatch_request at block/bfq-iosched.c:3902
>> 3900            if (rq) {
>> 3901    inc_in_driver_start_rq:
>> 3902                    bfqd->rq_in_driver++;
>> 3903    start_rq:
>> 3904                    rq->rq_flags |= RQF_STARTED;
>> 3905            }
> 
> FWIW, the stacktrace here (removing the ? lines) is:
> 
> [   34.311980] RIP: 0010:bfq_dispatch_request+0x2be/0x1610
> [   34.452491]  blk_mq_do_dispatch_sched+0x1d9/0x260
> [   34.454561]  blk_mq_sched_dispatch_requests+0x3da/0x4b0
> [   34.458789]  __blk_mq_run_hw_queue+0xae/0x130
> [   34.460001]  __blk_mq_delay_run_hw_queue+0x192/0x280
> [   34.460823]  blk_mq_run_hw_queue+0x10b/0x1b0
> [   34.463240]  blk_mq_sched_insert_request+0x3bd/0x4d0
> [   34.467342]  blk_execute_rq+0xcf/0x140
> [   34.468483]  sg_io+0x2f7/0x730
> 
> Can anyone tell me more about the memory allocation layout of the
> various variables here? It looks like struct request is a header in
> front of struct scsi_request? How do struct elevator_queue, struct
> blk_mq_ctx, and struct blk_mq_hw_ctx overlap these?

The scsi_request is a payload item for the block request, it's
located right after the request in memory. These are persistent
allocations, we don't allocate/free them per IO.

blk_mq_ctx are the blk-mq software queues, they are percpu and
allocated when the queue is setup.

blk_mq_hw_ctx is the hardware queue. You probably have just one,
it's allocated when the queue is setup.

struct elevator_queue is allocated when the scheduler is attached
to the queue. This can get freed and allocated if you switch
the scheduler on a queue, otherwise it persists until the queue
is torn down (and the scheduler data is freed).

> Regardless, I'll check for elevator data changing too...

It should not change unless you switch IO schedulers. If you're
using BFQ and not switching, then it won't change.
Kees Cook April 17, 2018, 8:03 p.m. UTC | #7
On Mon, Apr 16, 2018 at 8:12 PM, Kees Cook <keescook@chromium.org> wrote:
> With a hardware watchpoint, I've isolated the corruption to here:
>
> bfq_dispatch_request+0x2be/0x1610:
> __bfq_dispatch_request at block/bfq-iosched.c:3902
> 3900            if (rq) {
> 3901    inc_in_driver_start_rq:
> 3902                    bfqd->rq_in_driver++;
> 3903    start_rq:
> 3904                    rq->rq_flags |= RQF_STARTED;
> 3905            }

This state continues to appear to be "impossible".

[   68.845979] watchpoint triggered
[   68.846462] sense before:ffff8b8f9f6aae00 after:ffff8b8f9f6aae01
[   68.847196] elevator before:ffff8b8f9a2c2000 after:ffff8b8f9a2c2000
[   68.847905] elevator_data before:ffff8b8f9a2c0400 after:ffff8b8f9a2c0400
...
[   68.856925] RIP: 0010:bfq_dispatch_request+0x99/0xad0
[   68.857553] RSP: 0018:ffff900280c63a58 EFLAGS: 00000082
[   68.858253] RAX: ffff8b8f9aefbe80 RBX: ffff8b8f9a2c0400 RCX: ffff8b8f9a2c0408
[   68.859201] RDX: ffff8b8f9a2c0408 RSI: ffff900280c63b34 RDI: 0000000000000001
[   68.860147] RBP: 0000000000000000 R08: 0000000f00000204 R09: 0000000000000000
[   68.861122] R10: ffff900280c63af0 R11: 0000000000000040 R12: ffff8b8f9aefbe00
[   68.862089] R13: ffff8b8f9a221950 R14: 0000000000000000 R15: ffff8b8f9a2c0770

Here we can see that sense buffer has, as we've seen, been
incremented. However, the "before" values for elevator and
elevator_data still match their expected values. As such, this should
be "impossible", since:

static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
        struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
...
        if (rq) {
inc_in_driver_start_rq:
                bfqd->rq_in_driver++;
start_rq:
                rq->rq_flags |= RQF_STARTED;
        }
exit:
        return rq;
}

For rq_in_driver++ to touch sense, bfqd must be equal to scsi_request
- 12 bytes (rq_in_driver is 60 byte offset from struct bfq_data, and
sense is 48 bytes offset from struct scsi_request).

The above bfq_dispatch_request+0x99/0xad0 is still
__bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN
removed. 0x99 is 153 decimal:

(gdb) disass bfq_dispatch_request
Dump of assembler code for function bfq_dispatch_request:
...
   0xffffffff8134b2ad <+141>:   test   %rax,%rax
   0xffffffff8134b2b0 <+144>:   je     0xffffffff8134b2bd
<bfq_dispatch_request+157>
   0xffffffff8134b2b2 <+146>:   addl   $0x1,0x100(%rax)
   0xffffffff8134b2b9 <+153>:   addl   $0x1,0x3c(%rbx)
   0xffffffff8134b2bd <+157>:   orl    $0x2,0x18(%r12)
   0xffffffff8134b2c3 <+163>:   test   %ebp,%ebp
   0xffffffff8134b2c5 <+165>:   je     0xffffffff8134b2ce
<bfq_dispatch_request+174>
   0xffffffff8134b2c7 <+167>:   mov    0x108(%r14),%rax
   0xffffffff8134b2ce <+174>:   mov    %r15,%rdi
   0xffffffff8134b2d1 <+177>:   callq  0xffffffff81706f90 <_raw_spin_unlock_irq>

Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18
offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |=
RQF_STARTED", the next C statement. I don't know what +146 is, though?
An increment of something 256 bytes offset? There's a lot of inline
fun and reordering happening here, so I'm ignoring that for the
moment.

So at +153 %rbx should be bfqd (i.e. elevator_data), but this is the
tripped instruction. The watchpoint dump shows RBX as ffff8b8f9a2c0400
... which matches elevator_data.

So, what can this be? Some sort of cache issue? By the time the
watchpoint handler captures the register information, it's already
masked the problem? I'm stumped again. :(

-Kees
Kees Cook April 17, 2018, 8:20 p.m. UTC | #8
On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keescook@chromium.org> wrote:
> The above bfq_dispatch_request+0x99/0xad0 is still
> __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN
> removed. 0x99 is 153 decimal:
>
> (gdb) disass bfq_dispatch_request
> Dump of assembler code for function bfq_dispatch_request:
> ...
>    0xffffffff8134b2ad <+141>:   test   %rax,%rax
>    0xffffffff8134b2b0 <+144>:   je     0xffffffff8134b2bd
> <bfq_dispatch_request+157>
>    0xffffffff8134b2b2 <+146>:   addl   $0x1,0x100(%rax)
>    0xffffffff8134b2b9 <+153>:   addl   $0x1,0x3c(%rbx)
>    0xffffffff8134b2bd <+157>:   orl    $0x2,0x18(%r12)
>    0xffffffff8134b2c3 <+163>:   test   %ebp,%ebp
>    0xffffffff8134b2c5 <+165>:   je     0xffffffff8134b2ce
> <bfq_dispatch_request+174>
>    0xffffffff8134b2c7 <+167>:   mov    0x108(%r14),%rax
>    0xffffffff8134b2ce <+174>:   mov    %r15,%rdi
>    0xffffffff8134b2d1 <+177>:   callq  0xffffffff81706f90 <_raw_spin_unlock_irq>
>
> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18
> offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |=
> RQF_STARTED", the next C statement. I don't know what +146 is, though?
> An increment of something 256 bytes offset? There's a lot of inline
> fun and reordering happening here, so I'm ignoring that for the
> moment.

No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so
+146 is the offender.

[   29.284746] watchpoint @ ffff95d41a0fe580 triggered
[   29.285349] sense before:ffff95d41f45f700 after:ffff95d41f45f701 (@ffff95d41a
0fe580)
[   29.286176] elevator before:ffff95d419419c00 after:ffff95d419419c00
[   29.286847] elevator_data before:ffff95d419418c00 after:ffff95d419418c00
...
[   29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0
[   29.295622] RSP: 0018:ffffb26e01707a40 EFLAGS: 00000002
[   29.296181] RAX: ffff95d41a0fe480 RBX: ffff95d419418c00 RCX: ffff95d419418c08

RAX is ffff95d41a0fe480 and sense is stored at ffff95d41a0fe580,
exactly 0x100 away.

WTF is this addl?

-Kees
Kees Cook April 17, 2018, 8:25 p.m. UTC | #9
On Tue, Apr 17, 2018 at 1:20 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keescook@chromium.org> wrote:
>> The above bfq_dispatch_request+0x99/0xad0 is still
>> __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN
>> removed. 0x99 is 153 decimal:
>>
>> (gdb) disass bfq_dispatch_request
>> Dump of assembler code for function bfq_dispatch_request:
>> ...
>>    0xffffffff8134b2ad <+141>:   test   %rax,%rax
>>    0xffffffff8134b2b0 <+144>:   je     0xffffffff8134b2bd
>> <bfq_dispatch_request+157>
>>    0xffffffff8134b2b2 <+146>:   addl   $0x1,0x100(%rax)
>>    0xffffffff8134b2b9 <+153>:   addl   $0x1,0x3c(%rbx)
>>    0xffffffff8134b2bd <+157>:   orl    $0x2,0x18(%r12)
>>    0xffffffff8134b2c3 <+163>:   test   %ebp,%ebp
>>    0xffffffff8134b2c5 <+165>:   je     0xffffffff8134b2ce
>> <bfq_dispatch_request+174>
>>    0xffffffff8134b2c7 <+167>:   mov    0x108(%r14),%rax
>>    0xffffffff8134b2ce <+174>:   mov    %r15,%rdi
>>    0xffffffff8134b2d1 <+177>:   callq  0xffffffff81706f90 <_raw_spin_unlock_irq>
>>
>> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18
>> offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |=
>> RQF_STARTED", the next C statement. I don't know what +146 is, though?
>> An increment of something 256 bytes offset? There's a lot of inline
>> fun and reordering happening here, so I'm ignoring that for the
>> moment.
>
> No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so
> +146 is the offender.
>
> [   29.284746] watchpoint @ ffff95d41a0fe580 triggered
> [   29.285349] sense before:ffff95d41f45f700 after:ffff95d41f45f701 (@ffff95d41a
> 0fe580)
> [   29.286176] elevator before:ffff95d419419c00 after:ffff95d419419c00
> [   29.286847] elevator_data before:ffff95d419418c00 after:ffff95d419418c00
> ...
> [   29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0
> [   29.295622] RSP: 0018:ffffb26e01707a40 EFLAGS: 00000002
> [   29.296181] RAX: ffff95d41a0fe480 RBX: ffff95d419418c00 RCX: ffff95d419418c08
>
> RAX is ffff95d41a0fe480 and sense is stored at ffff95d41a0fe580,
> exactly 0x100 away.
>
> WTF is this addl?

What are the chances? :P Two ++ statements in a row separate by a
collapsed goto. FML. :)

...
                        bfqq->dispatched++;
                        goto inc_in_driver_start_rq;
...
inc_in_driver_start_rq:
                bfqd->rq_in_driver++;
...

And there's the 0x100 (256):

struct bfq_queue {
...
        int                        dispatched;           /*   256     4 */

So bfqq is corrupted somewhere... I'll keep digging. I hope you're all
enjoying my live debugging transcript. ;)

-Kees
Jens Axboe April 17, 2018, 8:28 p.m. UTC | #10
On 4/17/18 2:25 PM, Kees Cook wrote:
> On Tue, Apr 17, 2018 at 1:20 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keescook@chromium.org> wrote:
>>> The above bfq_dispatch_request+0x99/0xad0 is still
>>> __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN
>>> removed. 0x99 is 153 decimal:
>>>
>>> (gdb) disass bfq_dispatch_request
>>> Dump of assembler code for function bfq_dispatch_request:
>>> ...
>>>    0xffffffff8134b2ad <+141>:   test   %rax,%rax
>>>    0xffffffff8134b2b0 <+144>:   je     0xffffffff8134b2bd
>>> <bfq_dispatch_request+157>
>>>    0xffffffff8134b2b2 <+146>:   addl   $0x1,0x100(%rax)
>>>    0xffffffff8134b2b9 <+153>:   addl   $0x1,0x3c(%rbx)
>>>    0xffffffff8134b2bd <+157>:   orl    $0x2,0x18(%r12)
>>>    0xffffffff8134b2c3 <+163>:   test   %ebp,%ebp
>>>    0xffffffff8134b2c5 <+165>:   je     0xffffffff8134b2ce
>>> <bfq_dispatch_request+174>
>>>    0xffffffff8134b2c7 <+167>:   mov    0x108(%r14),%rax
>>>    0xffffffff8134b2ce <+174>:   mov    %r15,%rdi
>>>    0xffffffff8134b2d1 <+177>:   callq  0xffffffff81706f90 <_raw_spin_unlock_irq>
>>>
>>> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18
>>> offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |=
>>> RQF_STARTED", the next C statement. I don't know what +146 is, though?
>>> An increment of something 256 bytes offset? There's a lot of inline
>>> fun and reordering happening here, so I'm ignoring that for the
>>> moment.
>>
>> No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so
>> +146 is the offender.
>>
>> [   29.284746] watchpoint @ ffff95d41a0fe580 triggered
>> [   29.285349] sense before:ffff95d41f45f700 after:ffff95d41f45f701 (@ffff95d41a
>> 0fe580)
>> [   29.286176] elevator before:ffff95d419419c00 after:ffff95d419419c00
>> [   29.286847] elevator_data before:ffff95d419418c00 after:ffff95d419418c00
>> ...
>> [   29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0
>> [   29.295622] RSP: 0018:ffffb26e01707a40 EFLAGS: 00000002
>> [   29.296181] RAX: ffff95d41a0fe480 RBX: ffff95d419418c00 RCX: ffff95d419418c08
>>
>> RAX is ffff95d41a0fe480 and sense is stored at ffff95d41a0fe580,
>> exactly 0x100 away.
>>
>> WTF is this addl?
> 
> What are the chances? :P Two ++ statements in a row separate by a
> collapsed goto. FML. :)
> 
> ...
>                         bfqq->dispatched++;
>                         goto inc_in_driver_start_rq;
> ...
> inc_in_driver_start_rq:
>                 bfqd->rq_in_driver++;
> ...
> 
> And there's the 0x100 (256):
> 
> struct bfq_queue {
> ...
>         int                        dispatched;           /*   256     4 */
> 
> So bfqq is corrupted somewhere... I'll keep digging. I hope you're all
> enjoying my live debugging transcript. ;)

It has to be the latter bfqq->dispatched increment, as those are
transient (and bfqd is not).

Adding Paolo.
Kees Cook April 17, 2018, 8:46 p.m. UTC | #11
On Tue, Apr 17, 2018 at 1:28 PM, Jens Axboe <axboe@kernel.dk> wrote:
> It has to be the latter bfqq->dispatched increment, as those are
> transient (and bfqd is not).

Yeah, and I see a lot of comments around the lifetime of rq and bfqq,
so I assume something is not being locked correctly.

#define RQ_BFQQ(rq)             ((rq)->elv.priv[1])

static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
        struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
        struct request *rq = NULL;
        struct bfq_queue *bfqq = NULL;

        if (!list_empty(&bfqd->dispatch)) {
                rq = list_first_entry(&bfqd->dispatch, struct request,
                                      queuelist);
                list_del_init(&rq->queuelist);

                bfqq = RQ_BFQQ(rq);

                if (bfqq) {
                        /*
                         * Increment counters here, because this
                         * dispatch does not follow the standard
                         * dispatch flow (where counters are
                         * incremented)
                         */
                        bfqq->dispatched++;
...

I see elv.priv[1] assignments made in a few places -- is it possible
there is some kind of uninitialized-but-not-NULL state that can leak
in there?

bfq_prepare_request() assigns elv.priv[1], and bfq_insert_request()
only checks that it's non-NULL (if at all) in one case. Can
bfq_insert_request() get called without bfq_prepare_request() being
called first?

-Kees
diff mbox

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 25c14c58385c..f50d5053d732 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -6,6 +6,8 @@ 
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/blk-mq.h>
+#include <scsi/scsi_request.h>
+#include <linux/hw_breakpoint.h>

 #include <trace/events/block.h>

@@ -428,6 +430,18 @@  void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
        }
 }

+static void sample_hbp_handler(struct perf_event *bp,
+                               struct perf_sample_data *data,
+                               struct pt_regs *regs)
+{
+        printk(KERN_INFO "sense_buffer value is changed\n");
+        dump_stack();
+        printk(KERN_INFO "Dump stack from sample_hbp_handler\n");
+}
+
+struct perf_event * __percpu *sample_hbp;
+int ok_to_go;
+
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
                                 bool run_queue, bool async)
 {
@@ -435,6 +449,16 @@  void blk_mq_sched_insert_request(struct request
*rq, bool at_head,
        struct elevator_queue *e = q->elevator;
        struct blk_mq_ctx *ctx = rq->mq_ctx;
        struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+       struct perf_event_attr attr;
+       struct scsi_request *req = scsi_req(rq);
+
+       if (ok_to_go) {
+               hw_breakpoint_init(&attr);
+               attr.bp_addr = (uintptr_t)&(req->sense);
+               attr.bp_len = HW_BREAKPOINT_LEN_8;
+               attr.bp_type = HW_BREAKPOINT_W;
+               sample_hbp = register_wide_hw_breakpoint(&attr,
sample_hbp_handler, NULL);
+       }