Message ID | 20240606062655.2185006-1-yebin@huaweicloud.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: bio-integrity: fix potential null-ptr-deref in bio_integrity_free | expand |
What kernel is this on? As of Linux 6.9 we are now always freezing the queue while updating the logical_block_size in the nvme driver, so there should be no inflight I/O while it is changing.
On 2024/6/6 14:44, Christoph Hellwig wrote: > What kernel is this on? As of Linux 6.9 we are now always freezing v4.18 > the queue while updating the logical_block_size in the nvme driver, > so there should be no inflight I/O while it is changing. > The root cause of the problem is that there is no concurrency protection between issuing DIO checks in __ blkdev direct IO simple() and updating logical block sizes , resulting in the block layer being able to see DIOs that are not aligned with logical blocks.
On Thu, Jun 06, 2024 at 07:34:14PM +0800, yebin wrote: > > > On 2024/6/6 14:44, Christoph Hellwig wrote: > > What kernel is this on? As of Linux 6.9 we are now always freezing > v4.18 > > the queue while updating the logical_block_size in the nvme driver, > > so there should be no inflight I/O while it is changing. > > > The root cause of the problem is that there is no concurrency protection > between > issuing DIO checks in __ blkdev direct IO simple() and updating logical > block sizes , > resulting in the block layer being able to see DIOs that are not aligned > with logical > blocks. Yeah, that is one area queue freezing can't cover logical block size change, but I'd suggest to put the logical bs check into submit_bio() or slow path of __bio_queue_enter() at least. BTW, Yi has one reproducer, and slab is corrupted just like this report when running 'nvme format' & IO on partitions. I am not sure if this kind of change can avoid the issue completely, anyway Yi and I can test it and see if the kind of change works. My concern is that nvme format is started without draining IO, and IO can be submitted to hardware when nvme FW is handling formatting. I am not sure if nvme FW can deal with this situation correctly. Ewan suggested to run 'nvme format' with exclusive nvme disk open, which needs nvme-cli change. Thanks, Ming
On Thu, Jun 06, 2024 at 10:30:06PM +0800, Ming Lei wrote: > Yeah, that is one area queue freezing can't cover logical block size > change, but I'd suggest to put the logical bs check into submit_bio() or > slow path of __bio_queue_enter() at least. We really need an alignment check in submit_bio anyway, so doing it under the freeze protection would also help with this. > My concern is that nvme format is started without draining IO, and > IO can be submitted to hardware when nvme FW is handling formatting. > I am not sure if nvme FW can deal with this situation correctly. > Ewan suggested to run 'nvme format' with exclusive nvme disk open, which > needs nvme-cli change. .. and doesn't protect against someone using a different tool anyway. That beeing said, nvme_passthru_start actually freezes all queues based on the commands supported an affects log, and nvme_init_known_nvm_effects should force this even for controllers not supporting the log or reporting bogus information. So in general the queue should be frozen during the actual format.
On Thu, Jun 06, 2024 at 07:52:51AM -0700, Christoph Hellwig wrote: > On Thu, Jun 06, 2024 at 10:30:06PM +0800, Ming Lei wrote: > > Yeah, that is one area queue freezing can't cover logical block size > > change, but I'd suggest to put the logical bs check into submit_bio() or > > slow path of __bio_queue_enter() at least. > > We really need an alignment check in submit_bio anyway, so doing it > under the freeze protection would also help with this. > > > My concern is that nvme format is started without draining IO, and > > IO can be submitted to hardware when nvme FW is handling formatting. > > I am not sure if nvme FW can deal with this situation correctly. > > Ewan suggested to run 'nvme format' with exclusive nvme disk open, which > > needs nvme-cli change. > > .. and doesn't protect against someone using a different tool anyway. It also doesn't work if format is sent from a different host. If you have a shared namespace, anyone could change the format without coordinating with anyone else. That's just an unfixable gap with the current spec. The driver tries to react to the Namespace Notice AEN for this, but that can be too late or not even enabled. > That beeing said, nvme_passthru_start actually freezes all queues > based on the commands supported an affects log, and > nvme_init_known_nvm_effects should force this even for controllers > not supporting the log or reporting bogus information. So in general > the queue should be frozen during the actual format. Just for single host consideration, the our current nvme freeze does ensure IO is drained before dispatching the format, and should work for most format changes.
On Thu, Jun 06, 2024 at 07:52:51AM -0700, Christoph Hellwig wrote: > On Thu, Jun 06, 2024 at 10:30:06PM +0800, Ming Lei wrote: > > Yeah, that is one area queue freezing can't cover logical block size > > change, but I'd suggest to put the logical bs check into submit_bio() or > > slow path of __bio_queue_enter() at least. > > We really need an alignment check in submit_bio anyway, so doing it > under the freeze protection would also help with this. > > > My concern is that nvme format is started without draining IO, and > > IO can be submitted to hardware when nvme FW is handling formatting. > > I am not sure if nvme FW can deal with this situation correctly. > > Ewan suggested to run 'nvme format' with exclusive nvme disk open, which > > needs nvme-cli change. > > .. and doesn't protect against someone using a different tool anyway. > > That beeing said, nvme_passthru_start actually freezes all queues > based on the commands supported an affects log, and > nvme_init_known_nvm_effects should force this even for controllers > not supporting the log or reporting bogus information. So in general > the queue should be frozen during the actual format. That is something I missed, thanks for sharing the story. Thanks, Ming
On Thu, Jun 06, 2024 at 02:26:55PM +0800, Ye Bin wrote: > From: Ye Bin <yebin10@huawei.com> > > There's a issue as follows when do format NVME with IO: > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > PGD 101727f067 P4D 1011fae067 PUD fbed78067 PMD 0 > Oops: 0000 [#1] SMP NOPTI > RIP: 0010:kfree+0x4f/0x160 > RSP: 0018:ff705a800912b910 EFLAGS: 00010247 > RAX: 0000000000000000 RBX: 0d06d30000000000 RCX: ff4fb320260ad990 > RDX: ff4fb30ee7acba40 RSI: 0000000000000000 RDI: 00b04cff80000000 > RBP: ff4fb30ee7acba40 R08: 0000000000000200 R09: ff705a800912bb60 > R10: 0000000000000000 R11: ff4fb3103b67c750 R12: ffffffff9a62d566 > R13: ff4fb30aa0530000 R14: 0000000000000000 R15: 000000000000000a > FS: 00007f4399b6b700(0000) GS:ff4fb31040140000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000008 CR3: 0000001014cd4002 CR4: 0000000000761ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > bio_integrity_free+0xa6/0xb0 > __bio_integrity_endio+0x8c/0xa0 > bio_endio+0x2b/0x130 > blk_update_request+0x78/0x2b0 > blk_mq_end_request+0x1a/0x140 > blk_mq_try_issue_directly+0x5d/0xc0 > blk_mq_make_request+0x46b/0x540 > generic_make_request+0x121/0x300 > submit_bio+0x6c/0x140 > __blkdev_direct_IO_simple+0x1ca/0x3a0 > blkdev_direct_IO+0x3d9/0x460 > generic_file_read_iter+0xb4/0xc60 > new_sync_read+0x121/0x170 > vfs_read+0x89/0x130 > ksys_read+0x52/0xc0 > do_syscall_64+0x5d/0x1d0 > entry_SYSCALL_64_after_hwframe+0x65/0xca > > Assuming a 512 byte directIO is issued, the initial logical block size of > the state block device is 512 bytes, and then modified to 4096 bytes. > Above issue may happen as follows: > Direct read format NVME > __blkdev_direct_IO_simple(iocb, iter, nr_pages); > if ((pos | iov_iter_alignment(iter)) & (bdev_logical_block_size(bdev) - 1)) > -->The logical block size is 512, and the IO issued is 512 bytes, > which can be checked > return -EINVAL; > submit_bio(&bio); > nvme_dev_ioctl > case NVME_IOCTL_RESCAN: > nvme_queue_scan(ctrl); > ... > nvme_update_disk_info(disk, ns, id); > blk_queue_logical_block_size(disk->queue, bs); > --> 512->4096 > blk_queue_enter(q, flags) > blk_mq_make_request(q, bio) > bio_integrity_prep(bio) > len = bio_integrity_bytes(bi, bio_sectors(bio)); > -->At this point, because the logical block size has increased to > 4096 bytes, the calculated 'len' here is 0 > buf = kmalloc(len, GFP_NOIO | q->bounce_gfp); > -->Passed in len=0 and returned buf=16 > end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT; > start = ((unsigned long) buf) >> PAGE_SHIFT; > nr_pages = end - start; -->nr_pages == 1 > bip->bip_flags |= BIP_BLOCK_INTEGRITY; > for (i = 0 ; i < nr_pages ; i++) { > if (len <= 0) > -->Not initializing the bip_vec of bio_integrity, will result > in null pointer access during subsequent releases. Even if > initialized, it will still cause subsequent releases access > null pointer because the buffer address is incorrect. > break; > > Firstly, it is unreasonable to format NVME in the presence of IO. It is also > possible to see IO smaller than the logical block size in the block layer for > this type of concurrency. It is expected that this type of IO device will > return an error, so exception handling should also be done for this type of > IO to prevent null pointer access from causing system crashes. Actually unaligned IO handling is one mess for nvme hardware. Yes, IO may fail, but it is observed that meta buffer is overwrite by DMA in read IO. Ye and Yi, can you test the following patch in your 'nvme format' & IO workload? diff --git a/block/blk-core.c b/block/blk-core.c index 82c3ae22d76d..a41ab4a3a398 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) return 0; } +static bool bio_unaligned(struct bio *bio) +{ + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); + + if (bio->bi_iter.bi_size & (bs - 1)) + return true; + + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) + return true; + + return false; +} + int __bio_queue_enter(struct request_queue *q, struct bio *bio) { while (!blk_try_enter_queue(q, false)) { @@ -362,6 +375,15 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio) test_bit(GD_DEAD, &disk->state)); if (test_bit(GD_DEAD, &disk->state)) goto dead; + /* + * Not like other queue limits, logical block size is one + * fundamental limit which can't be covered by bio split. + * + * Device reconfiguration may happen and logical block size + * is changed, so fail the IO if that is true. + */ + if (bio_unaligned(bio)) + goto dead; } return 0; Thanks, Ming
On 2024/6/7 8:13, Ming Lei wrote: > On Thu, Jun 06, 2024 at 02:26:55PM +0800, Ye Bin wrote: >> From: Ye Bin <yebin10@huawei.com> >> >> There's a issue as follows when do format NVME with IO: >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 >> PGD 101727f067 P4D 1011fae067 PUD fbed78067 PMD 0 >> Oops: 0000 [#1] SMP NOPTI >> RIP: 0010:kfree+0x4f/0x160 >> RSP: 0018:ff705a800912b910 EFLAGS: 00010247 >> RAX: 0000000000000000 RBX: 0d06d30000000000 RCX: ff4fb320260ad990 >> RDX: ff4fb30ee7acba40 RSI: 0000000000000000 RDI: 00b04cff80000000 >> RBP: ff4fb30ee7acba40 R08: 0000000000000200 R09: ff705a800912bb60 >> R10: 0000000000000000 R11: ff4fb3103b67c750 R12: ffffffff9a62d566 >> R13: ff4fb30aa0530000 R14: 0000000000000000 R15: 000000000000000a >> FS: 00007f4399b6b700(0000) GS:ff4fb31040140000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000000000008 CR3: 0000001014cd4002 CR4: 0000000000761ee0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 >> PKRU: 55555554 >> Call Trace: >> bio_integrity_free+0xa6/0xb0 >> __bio_integrity_endio+0x8c/0xa0 >> bio_endio+0x2b/0x130 >> blk_update_request+0x78/0x2b0 >> blk_mq_end_request+0x1a/0x140 >> blk_mq_try_issue_directly+0x5d/0xc0 >> blk_mq_make_request+0x46b/0x540 >> generic_make_request+0x121/0x300 >> submit_bio+0x6c/0x140 >> __blkdev_direct_IO_simple+0x1ca/0x3a0 >> blkdev_direct_IO+0x3d9/0x460 >> generic_file_read_iter+0xb4/0xc60 >> new_sync_read+0x121/0x170 >> vfs_read+0x89/0x130 >> ksys_read+0x52/0xc0 >> do_syscall_64+0x5d/0x1d0 >> entry_SYSCALL_64_after_hwframe+0x65/0xca >> >> Assuming a 512 byte directIO is issued, the initial logical block size of >> the state block device is 512 bytes, and then modified to 4096 bytes. >> Above issue may happen as follows: >> Direct read format NVME >> __blkdev_direct_IO_simple(iocb, iter, nr_pages); >> if ((pos | iov_iter_alignment(iter)) & (bdev_logical_block_size(bdev) - 1)) >> -->The logical block size is 512, and the IO issued is 512 bytes, >> which can be checked >> return -EINVAL; >> submit_bio(&bio); >> nvme_dev_ioctl >> case NVME_IOCTL_RESCAN: >> nvme_queue_scan(ctrl); >> ... >> nvme_update_disk_info(disk, ns, id); >> blk_queue_logical_block_size(disk->queue, bs); >> --> 512->4096 >> blk_queue_enter(q, flags) >> blk_mq_make_request(q, bio) >> bio_integrity_prep(bio) >> len = bio_integrity_bytes(bi, bio_sectors(bio)); >> -->At this point, because the logical block size has increased to >> 4096 bytes, the calculated 'len' here is 0 >> buf = kmalloc(len, GFP_NOIO | q->bounce_gfp); >> -->Passed in len=0 and returned buf=16 >> end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT; >> start = ((unsigned long) buf) >> PAGE_SHIFT; >> nr_pages = end - start; -->nr_pages == 1 >> bip->bip_flags |= BIP_BLOCK_INTEGRITY; >> for (i = 0 ; i < nr_pages ; i++) { >> if (len <= 0) >> -->Not initializing the bip_vec of bio_integrity, will result >> in null pointer access during subsequent releases. Even if >> initialized, it will still cause subsequent releases access >> null pointer because the buffer address is incorrect. >> break; >> >> Firstly, it is unreasonable to format NVME in the presence of IO. It is also >> possible to see IO smaller than the logical block size in the block layer for >> this type of concurrency. It is expected that this type of IO device will >> return an error, so exception handling should also be done for this type of >> IO to prevent null pointer access from causing system crashes. > Actually unaligned IO handling is one mess for nvme hardware. Yes, IO may fail, > but it is observed that meta buffer is overwrite by DMA in read IO. > > Ye and Yi, can you test the following patch in your 'nvme format' & IO workload? > > > diff --git a/block/blk-core.c b/block/blk-core.c > index 82c3ae22d76d..a41ab4a3a398 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) > return 0; > } > > +static bool bio_unaligned(struct bio *bio) > +{ > + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); > + > + if (bio->bi_iter.bi_size & (bs - 1)) > + return true; > + > + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) > + return true; > + > + return false; > +} I think this judgment is a bit incorrect. It should not be sufficient to only determine whether the length and starting sector are logically block aligned. > + > int __bio_queue_enter(struct request_queue *q, struct bio *bio) > { > while (!blk_try_enter_queue(q, false)) { > @@ -362,6 +375,15 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio) > test_bit(GD_DEAD, &disk->state)); > if (test_bit(GD_DEAD, &disk->state)) > goto dead; > + /* > + * Not like other queue limits, logical block size is one > + * fundamental limit which can't be covered by bio split. > + * > + * Device reconfiguration may happen and logical block size > + * is changed, so fail the IO if that is true. > + */ > + if (bio_unaligned(bio)) > + goto dead; > } > > return 0; > > Thanks, > Ming > >
On Fri, Jun 07, 2024 at 09:32:29AM +0800, yebin wrote: > > > On 2024/6/7 8:13, Ming Lei wrote: > > On Thu, Jun 06, 2024 at 02:26:55PM +0800, Ye Bin wrote: > > > From: Ye Bin <yebin10@huawei.com> > > > > > > There's a issue as follows when do format NVME with IO: > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > > > PGD 101727f067 P4D 1011fae067 PUD fbed78067 PMD 0 > > > Oops: 0000 [#1] SMP NOPTI > > > RIP: 0010:kfree+0x4f/0x160 > > > RSP: 0018:ff705a800912b910 EFLAGS: 00010247 > > > RAX: 0000000000000000 RBX: 0d06d30000000000 RCX: ff4fb320260ad990 > > > RDX: ff4fb30ee7acba40 RSI: 0000000000000000 RDI: 00b04cff80000000 > > > RBP: ff4fb30ee7acba40 R08: 0000000000000200 R09: ff705a800912bb60 > > > R10: 0000000000000000 R11: ff4fb3103b67c750 R12: ffffffff9a62d566 > > > R13: ff4fb30aa0530000 R14: 0000000000000000 R15: 000000000000000a > > > FS: 00007f4399b6b700(0000) GS:ff4fb31040140000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 0000000000000008 CR3: 0000001014cd4002 CR4: 0000000000761ee0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > > > PKRU: 55555554 > > > Call Trace: > > > bio_integrity_free+0xa6/0xb0 > > > __bio_integrity_endio+0x8c/0xa0 > > > bio_endio+0x2b/0x130 > > > blk_update_request+0x78/0x2b0 > > > blk_mq_end_request+0x1a/0x140 > > > blk_mq_try_issue_directly+0x5d/0xc0 > > > blk_mq_make_request+0x46b/0x540 > > > generic_make_request+0x121/0x300 > > > submit_bio+0x6c/0x140 > > > __blkdev_direct_IO_simple+0x1ca/0x3a0 > > > blkdev_direct_IO+0x3d9/0x460 > > > generic_file_read_iter+0xb4/0xc60 > > > new_sync_read+0x121/0x170 > > > vfs_read+0x89/0x130 > > > ksys_read+0x52/0xc0 > > > do_syscall_64+0x5d/0x1d0 > > > entry_SYSCALL_64_after_hwframe+0x65/0xca > > > > > > Assuming a 512 byte directIO is issued, the initial logical block size of > > > the state block device is 512 bytes, and then modified to 4096 bytes. > > > Above issue may happen as follows: > > > Direct read format NVME > > > __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > > if ((pos | iov_iter_alignment(iter)) & (bdev_logical_block_size(bdev) - 1)) > > > -->The logical block size is 512, and the IO issued is 512 bytes, > > > which can be checked > > > return -EINVAL; > > > submit_bio(&bio); > > > nvme_dev_ioctl > > > case NVME_IOCTL_RESCAN: > > > nvme_queue_scan(ctrl); > > > ... > > > nvme_update_disk_info(disk, ns, id); > > > blk_queue_logical_block_size(disk->queue, bs); > > > --> 512->4096 > > > blk_queue_enter(q, flags) > > > blk_mq_make_request(q, bio) > > > bio_integrity_prep(bio) > > > len = bio_integrity_bytes(bi, bio_sectors(bio)); > > > -->At this point, because the logical block size has increased to > > > 4096 bytes, the calculated 'len' here is 0 > > > buf = kmalloc(len, GFP_NOIO | q->bounce_gfp); > > > -->Passed in len=0 and returned buf=16 > > > end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT; > > > start = ((unsigned long) buf) >> PAGE_SHIFT; > > > nr_pages = end - start; -->nr_pages == 1 > > > bip->bip_flags |= BIP_BLOCK_INTEGRITY; > > > for (i = 0 ; i < nr_pages ; i++) { > > > if (len <= 0) > > > -->Not initializing the bip_vec of bio_integrity, will result > > > in null pointer access during subsequent releases. Even if > > > initialized, it will still cause subsequent releases access > > > null pointer because the buffer address is incorrect. > > > break; > > > > > > Firstly, it is unreasonable to format NVME in the presence of IO. It is also > > > possible to see IO smaller than the logical block size in the block layer for > > > this type of concurrency. It is expected that this type of IO device will > > > return an error, so exception handling should also be done for this type of > > > IO to prevent null pointer access from causing system crashes. > > Actually unaligned IO handling is one mess for nvme hardware. Yes, IO may fail, > > but it is observed that meta buffer is overwrite by DMA in read IO. > > > > Ye and Yi, can you test the following patch in your 'nvme format' & IO workload? > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > index 82c3ae22d76d..a41ab4a3a398 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) > > return 0; > > } > > +static bool bio_unaligned(struct bio *bio) > > +{ > > + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); > > + > > + if (bio->bi_iter.bi_size & (bs - 1)) > > + return true; > > + > > + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) > > + return true; > > + > > + return false; > > +} > I think this judgment is a bit incorrect. It should not be sufficient to > only determine whether > the length and starting sector are logically block aligned. Can you explain why the two are not enough? Other limits should be handled by bio split. Thanks, Ming
On 2024/6/7 9:35, Ming Lei wrote: > On Fri, Jun 07, 2024 at 09:32:29AM +0800, yebin wrote: >> >> On 2024/6/7 8:13, Ming Lei wrote: >>> On Thu, Jun 06, 2024 at 02:26:55PM +0800, Ye Bin wrote: >>>> From: Ye Bin <yebin10@huawei.com> >>>> >>>> There's a issue as follows when do format NVME with IO: >>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 >>>> PGD 101727f067 P4D 1011fae067 PUD fbed78067 PMD 0 >>>> Oops: 0000 [#1] SMP NOPTI >>>> RIP: 0010:kfree+0x4f/0x160 >>>> RSP: 0018:ff705a800912b910 EFLAGS: 00010247 >>>> RAX: 0000000000000000 RBX: 0d06d30000000000 RCX: ff4fb320260ad990 >>>> RDX: ff4fb30ee7acba40 RSI: 0000000000000000 RDI: 00b04cff80000000 >>>> RBP: ff4fb30ee7acba40 R08: 0000000000000200 R09: ff705a800912bb60 >>>> R10: 0000000000000000 R11: ff4fb3103b67c750 R12: ffffffff9a62d566 >>>> R13: ff4fb30aa0530000 R14: 0000000000000000 R15: 000000000000000a >>>> FS: 00007f4399b6b700(0000) GS:ff4fb31040140000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: 0000000000000008 CR3: 0000001014cd4002 CR4: 0000000000761ee0 >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 >>>> PKRU: 55555554 >>>> Call Trace: >>>> bio_integrity_free+0xa6/0xb0 >>>> __bio_integrity_endio+0x8c/0xa0 >>>> bio_endio+0x2b/0x130 >>>> blk_update_request+0x78/0x2b0 >>>> blk_mq_end_request+0x1a/0x140 >>>> blk_mq_try_issue_directly+0x5d/0xc0 >>>> blk_mq_make_request+0x46b/0x540 >>>> generic_make_request+0x121/0x300 >>>> submit_bio+0x6c/0x140 >>>> __blkdev_direct_IO_simple+0x1ca/0x3a0 >>>> blkdev_direct_IO+0x3d9/0x460 >>>> generic_file_read_iter+0xb4/0xc60 >>>> new_sync_read+0x121/0x170 >>>> vfs_read+0x89/0x130 >>>> ksys_read+0x52/0xc0 >>>> do_syscall_64+0x5d/0x1d0 >>>> entry_SYSCALL_64_after_hwframe+0x65/0xca >>>> >>>> Assuming a 512 byte directIO is issued, the initial logical block size of >>>> the state block device is 512 bytes, and then modified to 4096 bytes. >>>> Above issue may happen as follows: >>>> Direct read format NVME >>>> __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>>> if ((pos | iov_iter_alignment(iter)) & (bdev_logical_block_size(bdev) - 1)) >>>> -->The logical block size is 512, and the IO issued is 512 bytes, >>>> which can be checked >>>> return -EINVAL; >>>> submit_bio(&bio); >>>> nvme_dev_ioctl >>>> case NVME_IOCTL_RESCAN: >>>> nvme_queue_scan(ctrl); >>>> ... >>>> nvme_update_disk_info(disk, ns, id); >>>> blk_queue_logical_block_size(disk->queue, bs); >>>> --> 512->4096 >>>> blk_queue_enter(q, flags) >>>> blk_mq_make_request(q, bio) >>>> bio_integrity_prep(bio) >>>> len = bio_integrity_bytes(bi, bio_sectors(bio)); >>>> -->At this point, because the logical block size has increased to >>>> 4096 bytes, the calculated 'len' here is 0 >>>> buf = kmalloc(len, GFP_NOIO | q->bounce_gfp); >>>> -->Passed in len=0 and returned buf=16 >>>> end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT; >>>> start = ((unsigned long) buf) >> PAGE_SHIFT; >>>> nr_pages = end - start; -->nr_pages == 1 >>>> bip->bip_flags |= BIP_BLOCK_INTEGRITY; >>>> for (i = 0 ; i < nr_pages ; i++) { >>>> if (len <= 0) >>>> -->Not initializing the bip_vec of bio_integrity, will result >>>> in null pointer access during subsequent releases. Even if >>>> initialized, it will still cause subsequent releases access >>>> null pointer because the buffer address is incorrect. >>>> break; >>>> >>>> Firstly, it is unreasonable to format NVME in the presence of IO. It is also >>>> possible to see IO smaller than the logical block size in the block layer for >>>> this type of concurrency. It is expected that this type of IO device will >>>> return an error, so exception handling should also be done for this type of >>>> IO to prevent null pointer access from causing system crashes. >>> Actually unaligned IO handling is one mess for nvme hardware. Yes, IO may fail, >>> but it is observed that meta buffer is overwrite by DMA in read IO. >>> >>> Ye and Yi, can you test the following patch in your 'nvme format' & IO workload? >>> >>> >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index 82c3ae22d76d..a41ab4a3a398 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) >>> return 0; >>> } >>> +static bool bio_unaligned(struct bio *bio) >>> +{ >>> + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); >>> + >>> + if (bio->bi_iter.bi_size & (bs - 1)) >>> + return true; >>> + >>> + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) >>> + return true; >>> + >>> + return false; >>> +} >> I think this judgment is a bit incorrect. It should not be sufficient to >> only determine whether >> the length and starting sector are logically block aligned. > Can you explain why the two are not enough? Other limits should be handled > by bio split. If logical block size is 512 bytes, BIO has 4 segments, each segment length is 512 bytes, bio->bi_iter.bi_sector == 0. If logical block size change to 4096 bytes, bio_unaligned() will return false. I'm not sure if the example I gave is appropriate? > > Thanks, > Ming > > . >
On Tue, Jun 11, 2024 at 10:48:13AM +0800, yebin (H) wrote: > > > On 2024/6/7 9:35, Ming Lei wrote: > > On Fri, Jun 07, 2024 at 09:32:29AM +0800, yebin wrote: > > > > > > On 2024/6/7 8:13, Ming Lei wrote: > > > > On Thu, Jun 06, 2024 at 02:26:55PM +0800, Ye Bin wrote: > > > > > From: Ye Bin <yebin10@huawei.com> > > > > > > > > > > There's a issue as follows when do format NVME with IO: > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > > > > > PGD 101727f067 P4D 1011fae067 PUD fbed78067 PMD 0 > > > > > Oops: 0000 [#1] SMP NOPTI > > > > > RIP: 0010:kfree+0x4f/0x160 > > > > > RSP: 0018:ff705a800912b910 EFLAGS: 00010247 > > > > > RAX: 0000000000000000 RBX: 0d06d30000000000 RCX: ff4fb320260ad990 > > > > > RDX: ff4fb30ee7acba40 RSI: 0000000000000000 RDI: 00b04cff80000000 > > > > > RBP: ff4fb30ee7acba40 R08: 0000000000000200 R09: ff705a800912bb60 > > > > > R10: 0000000000000000 R11: ff4fb3103b67c750 R12: ffffffff9a62d566 > > > > > R13: ff4fb30aa0530000 R14: 0000000000000000 R15: 000000000000000a > > > > > FS: 00007f4399b6b700(0000) GS:ff4fb31040140000(0000) knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > CR2: 0000000000000008 CR3: 0000001014cd4002 CR4: 0000000000761ee0 > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > > > > > PKRU: 55555554 > > > > > Call Trace: > > > > > bio_integrity_free+0xa6/0xb0 > > > > > __bio_integrity_endio+0x8c/0xa0 > > > > > bio_endio+0x2b/0x130 > > > > > blk_update_request+0x78/0x2b0 > > > > > blk_mq_end_request+0x1a/0x140 > > > > > blk_mq_try_issue_directly+0x5d/0xc0 > > > > > blk_mq_make_request+0x46b/0x540 > > > > > generic_make_request+0x121/0x300 > > > > > submit_bio+0x6c/0x140 > > > > > __blkdev_direct_IO_simple+0x1ca/0x3a0 > > > > > blkdev_direct_IO+0x3d9/0x460 > > > > > generic_file_read_iter+0xb4/0xc60 > > > > > new_sync_read+0x121/0x170 > > > > > vfs_read+0x89/0x130 > > > > > ksys_read+0x52/0xc0 > > > > > do_syscall_64+0x5d/0x1d0 > > > > > entry_SYSCALL_64_after_hwframe+0x65/0xca > > > > > > > > > > Assuming a 512 byte directIO is issued, the initial logical block size of > > > > > the state block device is 512 bytes, and then modified to 4096 bytes. > > > > > Above issue may happen as follows: > > > > > Direct read format NVME > > > > > __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > > > > if ((pos | iov_iter_alignment(iter)) & (bdev_logical_block_size(bdev) - 1)) > > > > > -->The logical block size is 512, and the IO issued is 512 bytes, > > > > > which can be checked > > > > > return -EINVAL; > > > > > submit_bio(&bio); > > > > > nvme_dev_ioctl > > > > > case NVME_IOCTL_RESCAN: > > > > > nvme_queue_scan(ctrl); > > > > > ... > > > > > nvme_update_disk_info(disk, ns, id); > > > > > blk_queue_logical_block_size(disk->queue, bs); > > > > > --> 512->4096 > > > > > blk_queue_enter(q, flags) > > > > > blk_mq_make_request(q, bio) > > > > > bio_integrity_prep(bio) > > > > > len = bio_integrity_bytes(bi, bio_sectors(bio)); > > > > > -->At this point, because the logical block size has increased to > > > > > 4096 bytes, the calculated 'len' here is 0 > > > > > buf = kmalloc(len, GFP_NOIO | q->bounce_gfp); > > > > > -->Passed in len=0 and returned buf=16 > > > > > end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT; > > > > > start = ((unsigned long) buf) >> PAGE_SHIFT; > > > > > nr_pages = end - start; -->nr_pages == 1 > > > > > bip->bip_flags |= BIP_BLOCK_INTEGRITY; > > > > > for (i = 0 ; i < nr_pages ; i++) { > > > > > if (len <= 0) > > > > > -->Not initializing the bip_vec of bio_integrity, will result > > > > > in null pointer access during subsequent releases. Even if > > > > > initialized, it will still cause subsequent releases access > > > > > null pointer because the buffer address is incorrect. > > > > > break; > > > > > > > > > > Firstly, it is unreasonable to format NVME in the presence of IO. It is also > > > > > possible to see IO smaller than the logical block size in the block layer for > > > > > this type of concurrency. It is expected that this type of IO device will > > > > > return an error, so exception handling should also be done for this type of > > > > > IO to prevent null pointer access from causing system crashes. > > > > Actually unaligned IO handling is one mess for nvme hardware. Yes, IO may fail, > > > > but it is observed that meta buffer is overwrite by DMA in read IO. > > > > > > > > Ye and Yi, can you test the following patch in your 'nvme format' & IO workload? > > > > > > > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > > > index 82c3ae22d76d..a41ab4a3a398 100644 > > > > --- a/block/blk-core.c > > > > +++ b/block/blk-core.c > > > > @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) > > > > return 0; > > > > } > > > > +static bool bio_unaligned(struct bio *bio) > > > > +{ > > > > + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); > > > > + > > > > + if (bio->bi_iter.bi_size & (bs - 1)) > > > > + return true; > > > > + > > > > + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) > > > > + return true; > > > > + > > > > + return false; > > > > +} > > > I think this judgment is a bit incorrect. It should not be sufficient to > > > only determine whether > > > the length and starting sector are logically block aligned. > > Can you explain why the two are not enough? Other limits should be handled > > by bio split. > If logical block size is 512 bytes, BIO has 4 segments, each segment length > is 512 bytes, > bio->bi_iter.bi_sector == 0. If logical block size change to 4096 bytes, > bio_unaligned() will > return false. Yes, this IO is still 4096 aligned in block size level. It is just that each bvec buffer isn't page-aligned, for nvme, if virt_boundary is set, this bio will be split. However, we don't add logical block size check in submit_bio_noacct() yet, 512byte bio still can be sent to device. > I'm not sure if the example I gave is appropriate? Absolutely it is one good example. BTW, Yi have tested both your patch and my patch which checks lbs in blk_queue_enter(), looks slab corruption still can be triggered with either one. Yi, can you test the following patch? diff --git a/block/blk-core.c b/block/blk-core.c index 82c3ae22d76d..c47e69795c86 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) return 0; } +static inline bool bio_unaligned(struct bio *bio) +{ + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); + + if (bio->bi_iter.bi_size & (bs - 1)) + return true; + + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) + return true; + + return false; +} + int __bio_queue_enter(struct request_queue *q, struct bio *bio) { while (!blk_try_enter_queue(q, false)) { @@ -362,6 +375,15 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio) test_bit(GD_DEAD, &disk->state)); if (test_bit(GD_DEAD, &disk->state)) goto dead; + /* + * Not like other queue limits, logical block size is one + * fundamental limit which can't be covered by bio split. + * + * Device reconfiguration may happen and logical block size + * is changed, so fail the IO if that is true. + */ + if (bio_unaligned(bio)) + goto dead; } return 0; @@ -765,6 +787,8 @@ void submit_bio_noacct(struct bio *bio) if (should_fail_bio(bio)) goto end_io; + if (bio->bi_iter.bi_size && bio_unaligned(bio)) + goto end_io; bio_check_ro(bio); if (!bio_flagged(bio, BIO_REMAPPED)) { if (unlikely(bio_check_eod(bio))) Thanks, Ming
… > The root cause of this issue is the concurrency between the write process > and the block size update process. However, this concurrency does not exist > in actual production environments. To solve above issue, Verify if the > segments of BIO are aligned with integrity intervals. * Please improve the change description with an imperative wording. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc2#n94 * Would you like to add the tag “Fixes” accordingly? * How do you think about to use the summary phrase “Avoid null pointer dereference in bio_integrity_free()”? Regards, Markus
On 2024/6/11 11:29, Ming Lei wrote: > On Tue, Jun 11, 2024 at 10:48:13AM +0800, yebin (H) wrote: >> >> On 2024/6/7 9:35, Ming Lei wrote: >>> On Fri, Jun 07, 2024 at 09:32:29AM +0800, yebin wrote: >>>> On 2024/6/7 8:13, Ming Lei wrote: >>>>> On Thu, Jun 06, 2024 at 02:26:55PM +0800, Ye Bin wrote: >>>>>> From: Ye Bin <yebin10@huawei.com> >>>>>> >>>>>> There's a issue as follows when do format NVME with IO: >>>>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 >>>>>> PGD 101727f067 P4D 1011fae067 PUD fbed78067 PMD 0 >>>>>> Oops: 0000 [#1] SMP NOPTI >>>>>> RIP: 0010:kfree+0x4f/0x160 >>>>>> RSP: 0018:ff705a800912b910 EFLAGS: 00010247 >>>>>> RAX: 0000000000000000 RBX: 0d06d30000000000 RCX: ff4fb320260ad990 >>>>>> RDX: ff4fb30ee7acba40 RSI: 0000000000000000 RDI: 00b04cff80000000 >>>>>> RBP: ff4fb30ee7acba40 R08: 0000000000000200 R09: ff705a800912bb60 >>>>>> R10: 0000000000000000 R11: ff4fb3103b67c750 R12: ffffffff9a62d566 >>>>>> R13: ff4fb30aa0530000 R14: 0000000000000000 R15: 000000000000000a >>>>>> FS: 00007f4399b6b700(0000) GS:ff4fb31040140000(0000) knlGS:0000000000000000 >>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>>> CR2: 0000000000000008 CR3: 0000001014cd4002 CR4: 0000000000761ee0 >>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>>>> DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 >>>>>> PKRU: 55555554 >>>>>> Call Trace: >>>>>> bio_integrity_free+0xa6/0xb0 >>>>>> __bio_integrity_endio+0x8c/0xa0 >>>>>> bio_endio+0x2b/0x130 >>>>>> blk_update_request+0x78/0x2b0 >>>>>> blk_mq_end_request+0x1a/0x140 >>>>>> blk_mq_try_issue_directly+0x5d/0xc0 >>>>>> blk_mq_make_request+0x46b/0x540 >>>>>> generic_make_request+0x121/0x300 >>>>>> submit_bio+0x6c/0x140 >>>>>> __blkdev_direct_IO_simple+0x1ca/0x3a0 >>>>>> blkdev_direct_IO+0x3d9/0x460 >>>>>> generic_file_read_iter+0xb4/0xc60 >>>>>> new_sync_read+0x121/0x170 >>>>>> vfs_read+0x89/0x130 >>>>>> ksys_read+0x52/0xc0 >>>>>> do_syscall_64+0x5d/0x1d0 >>>>>> entry_SYSCALL_64_after_hwframe+0x65/0xca >>>>>> >>>>>> Assuming a 512 byte directIO is issued, the initial logical block size of >>>>>> the state block device is 512 bytes, and then modified to 4096 bytes. >>>>>> Above issue may happen as follows: >>>>>> Direct read format NVME >>>>>> __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>>>>> if ((pos | iov_iter_alignment(iter)) & (bdev_logical_block_size(bdev) - 1)) >>>>>> -->The logical block size is 512, and the IO issued is 512 bytes, >>>>>> which can be checked >>>>>> return -EINVAL; >>>>>> submit_bio(&bio); >>>>>> nvme_dev_ioctl >>>>>> case NVME_IOCTL_RESCAN: >>>>>> nvme_queue_scan(ctrl); >>>>>> ... >>>>>> nvme_update_disk_info(disk, ns, id); >>>>>> blk_queue_logical_block_size(disk->queue, bs); >>>>>> --> 512->4096 >>>>>> blk_queue_enter(q, flags) >>>>>> blk_mq_make_request(q, bio) >>>>>> bio_integrity_prep(bio) >>>>>> len = bio_integrity_bytes(bi, bio_sectors(bio)); >>>>>> -->At this point, because the logical block size has increased to >>>>>> 4096 bytes, the calculated 'len' here is 0 >>>>>> buf = kmalloc(len, GFP_NOIO | q->bounce_gfp); >>>>>> -->Passed in len=0 and returned buf=16 >>>>>> end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT; >>>>>> start = ((unsigned long) buf) >> PAGE_SHIFT; >>>>>> nr_pages = end - start; -->nr_pages == 1 >>>>>> bip->bip_flags |= BIP_BLOCK_INTEGRITY; >>>>>> for (i = 0 ; i < nr_pages ; i++) { >>>>>> if (len <= 0) >>>>>> -->Not initializing the bip_vec of bio_integrity, will result >>>>>> in null pointer access during subsequent releases. Even if >>>>>> initialized, it will still cause subsequent releases access >>>>>> null pointer because the buffer address is incorrect. >>>>>> break; >>>>>> >>>>>> Firstly, it is unreasonable to format NVME in the presence of IO. It is also >>>>>> possible to see IO smaller than the logical block size in the block layer for >>>>>> this type of concurrency. It is expected that this type of IO device will >>>>>> return an error, so exception handling should also be done for this type of >>>>>> IO to prevent null pointer access from causing system crashes. >>>>> Actually unaligned IO handling is one mess for nvme hardware. Yes, IO may fail, >>>>> but it is observed that meta buffer is overwrite by DMA in read IO. >>>>> >>>>> Ye and Yi, can you test the following patch in your 'nvme format' & IO workload? >>>>> >>>>> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>>> index 82c3ae22d76d..a41ab4a3a398 100644 >>>>> --- a/block/blk-core.c >>>>> +++ b/block/blk-core.c >>>>> @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) >>>>> return 0; >>>>> } >>>>> +static bool bio_unaligned(struct bio *bio) >>>>> +{ >>>>> + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); >>>>> + >>>>> + if (bio->bi_iter.bi_size & (bs - 1)) >>>>> + return true; >>>>> + >>>>> + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) >>>>> + return true; >>>>> + >>>>> + return false; >>>>> +} >>>> I think this judgment is a bit incorrect. It should not be sufficient to >>>> only determine whether >>>> the length and starting sector are logically block aligned. >>> Can you explain why the two are not enough? Other limits should be handled >>> by bio split. >> If logical block size is 512 bytes, BIO has 4 segments, each segment length >> is 512 bytes, >> bio->bi_iter.bi_sector == 0. If logical block size change to 4096 bytes, >> bio_unaligned() will >> return false. > Yes, this IO is still 4096 aligned in block size level. > > It is just that each bvec buffer isn't page-aligned, for nvme, if virt_boundary > is set, this bio will be split. However, we don't add logical block size > check in submit_bio_noacct() yet, 512byte bio still can be sent to > device. > >> I'm not sure if the example I gave is appropriate? > Absolutely it is one good example. > > BTW, Yi have tested both your patch and my patch which checks lbs in > blk_queue_enter(), looks slab corruption still can be triggered with > either one. Yes, my patch only solves the integrity process in a single point to avoid affecting the normal IO process. I am not sure if other processes will have similar issues. > > Yi, can you test the following patch? > > diff --git a/block/blk-core.c b/block/blk-core.c > index 82c3ae22d76d..c47e69795c86 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) > return 0; > } > > +static inline bool bio_unaligned(struct bio *bio) > +{ > + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); > + > + if (bio->bi_iter.bi_size & (bs - 1)) > + return true; > + > + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) > + return true; > + > + return false; > +} > + > int __bio_queue_enter(struct request_queue *q, struct bio *bio) > { > while (!blk_try_enter_queue(q, false)) { > @@ -362,6 +375,15 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio) > test_bit(GD_DEAD, &disk->state)); > if (test_bit(GD_DEAD, &disk->state)) > goto dead; > + /* > + * Not like other queue limits, logical block size is one > + * fundamental limit which can't be covered by bio split. > + * > + * Device reconfiguration may happen and logical block size > + * is changed, so fail the IO if that is true. > + */ > + if (bio_unaligned(bio)) > + goto dead; > } > > return 0; > @@ -765,6 +787,8 @@ void submit_bio_noacct(struct bio *bio) > > if (should_fail_bio(bio)) > goto end_io; > + if (bio->bi_iter.bi_size && bio_unaligned(bio)) > + goto end_io; I think this check should be added after the bio_queue_enter() call. Early judgment should be unreliable. > bio_check_ro(bio); > if (!bio_flagged(bio, BIO_REMAPPED)) { > if (unlikely(bio_check_eod(bio))) > > > Thanks, > Ming > > > . >
On Mon, Jun 17, 2024 at 11:49:59AM +0800, yebin (H) wrote: > > > On 2024/6/11 11:29, Ming Lei wrote: > > On Tue, Jun 11, 2024 at 10:48:13AM +0800, yebin (H) wrote: > > > > > > On 2024/6/7 9:35, Ming Lei wrote: > > > > On Fri, Jun 07, 2024 at 09:32:29AM +0800, yebin wrote: > > > > > On 2024/6/7 8:13, Ming Lei wrote: > > > > > > On Thu, Jun 06, 2024 at 02:26:55PM +0800, Ye Bin wrote: > > > > > > > From: Ye Bin <yebin10@huawei.com> > > > > > > > > > > > > > > There's a issue as follows when do format NVME with IO: > > > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > > > > > > > PGD 101727f067 P4D 1011fae067 PUD fbed78067 PMD 0 > > > > > > > Oops: 0000 [#1] SMP NOPTI > > > > > > > RIP: 0010:kfree+0x4f/0x160 > > > > > > > RSP: 0018:ff705a800912b910 EFLAGS: 00010247 > > > > > > > RAX: 0000000000000000 RBX: 0d06d30000000000 RCX: ff4fb320260ad990 > > > > > > > RDX: ff4fb30ee7acba40 RSI: 0000000000000000 RDI: 00b04cff80000000 > > > > > > > RBP: ff4fb30ee7acba40 R08: 0000000000000200 R09: ff705a800912bb60 > > > > > > > R10: 0000000000000000 R11: ff4fb3103b67c750 R12: ffffffff9a62d566 > > > > > > > R13: ff4fb30aa0530000 R14: 0000000000000000 R15: 000000000000000a > > > > > > > FS: 00007f4399b6b700(0000) GS:ff4fb31040140000(0000) knlGS:0000000000000000 > > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > > > CR2: 0000000000000008 CR3: 0000001014cd4002 CR4: 0000000000761ee0 > > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > > > DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > > > > > > > PKRU: 55555554 > > > > > > > Call Trace: > > > > > > > bio_integrity_free+0xa6/0xb0 > > > > > > > __bio_integrity_endio+0x8c/0xa0 > > > > > > > bio_endio+0x2b/0x130 > > > > > > > blk_update_request+0x78/0x2b0 > > > > > > > blk_mq_end_request+0x1a/0x140 > > > > > > > blk_mq_try_issue_directly+0x5d/0xc0 > > > > > > > blk_mq_make_request+0x46b/0x540 > > > > > > > generic_make_request+0x121/0x300 > > > > > > > submit_bio+0x6c/0x140 > > > > > > > __blkdev_direct_IO_simple+0x1ca/0x3a0 > > > > > > > blkdev_direct_IO+0x3d9/0x460 > > > > > > > generic_file_read_iter+0xb4/0xc60 > > > > > > > new_sync_read+0x121/0x170 > > > > > > > vfs_read+0x89/0x130 > > > > > > > ksys_read+0x52/0xc0 > > > > > > > do_syscall_64+0x5d/0x1d0 > > > > > > > entry_SYSCALL_64_after_hwframe+0x65/0xca > > > > > > > > > > > > > > Assuming a 512 byte directIO is issued, the initial logical block size of > > > > > > > the state block device is 512 bytes, and then modified to 4096 bytes. > > > > > > > Above issue may happen as follows: > > > > > > > Direct read format NVME > > > > > > > __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > > > > > > if ((pos | iov_iter_alignment(iter)) & (bdev_logical_block_size(bdev) - 1)) > > > > > > > -->The logical block size is 512, and the IO issued is 512 bytes, > > > > > > > which can be checked > > > > > > > return -EINVAL; > > > > > > > submit_bio(&bio); > > > > > > > nvme_dev_ioctl > > > > > > > case NVME_IOCTL_RESCAN: > > > > > > > nvme_queue_scan(ctrl); > > > > > > > ... > > > > > > > nvme_update_disk_info(disk, ns, id); > > > > > > > blk_queue_logical_block_size(disk->queue, bs); > > > > > > > --> 512->4096 > > > > > > > blk_queue_enter(q, flags) > > > > > > > blk_mq_make_request(q, bio) > > > > > > > bio_integrity_prep(bio) > > > > > > > len = bio_integrity_bytes(bi, bio_sectors(bio)); > > > > > > > -->At this point, because the logical block size has increased to > > > > > > > 4096 bytes, the calculated 'len' here is 0 > > > > > > > buf = kmalloc(len, GFP_NOIO | q->bounce_gfp); > > > > > > > -->Passed in len=0 and returned buf=16 > > > > > > > end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT; > > > > > > > start = ((unsigned long) buf) >> PAGE_SHIFT; > > > > > > > nr_pages = end - start; -->nr_pages == 1 > > > > > > > bip->bip_flags |= BIP_BLOCK_INTEGRITY; > > > > > > > for (i = 0 ; i < nr_pages ; i++) { > > > > > > > if (len <= 0) > > > > > > > -->Not initializing the bip_vec of bio_integrity, will result > > > > > > > in null pointer access during subsequent releases. Even if > > > > > > > initialized, it will still cause subsequent releases access > > > > > > > null pointer because the buffer address is incorrect. > > > > > > > break; > > > > > > > > > > > > > > Firstly, it is unreasonable to format NVME in the presence of IO. It is also > > > > > > > possible to see IO smaller than the logical block size in the block layer for > > > > > > > this type of concurrency. It is expected that this type of IO device will > > > > > > > return an error, so exception handling should also be done for this type of > > > > > > > IO to prevent null pointer access from causing system crashes. > > > > > > Actually unaligned IO handling is one mess for nvme hardware. Yes, IO may fail, > > > > > > but it is observed that meta buffer is overwrite by DMA in read IO. > > > > > > > > > > > > Ye and Yi, can you test the following patch in your 'nvme format' & IO workload? > > > > > > > > > > > > > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > > > > > index 82c3ae22d76d..a41ab4a3a398 100644 > > > > > > --- a/block/blk-core.c > > > > > > +++ b/block/blk-core.c > > > > > > @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) > > > > > > return 0; > > > > > > } > > > > > > +static bool bio_unaligned(struct bio *bio) > > > > > > +{ > > > > > > + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); > > > > > > + > > > > > > + if (bio->bi_iter.bi_size & (bs - 1)) > > > > > > + return true; > > > > > > + > > > > > > + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) > > > > > > + return true; > > > > > > + > > > > > > + return false; > > > > > > +} > > > > > I think this judgment is a bit incorrect. It should not be sufficient to > > > > > only determine whether > > > > > the length and starting sector are logically block aligned. > > > > Can you explain why the two are not enough? Other limits should be handled > > > > by bio split. > > > If logical block size is 512 bytes, BIO has 4 segments, each segment length > > > is 512 bytes, > > > bio->bi_iter.bi_sector == 0. If logical block size change to 4096 bytes, > > > bio_unaligned() will > > > return false. > > Yes, this IO is still 4096 aligned in block size level. > > > > It is just that each bvec buffer isn't page-aligned, for nvme, if virt_boundary > > is set, this bio will be split. However, we don't add logical block size > > check in submit_bio_noacct() yet, 512byte bio still can be sent to > > device. > > > > > I'm not sure if the example I gave is appropriate? > > Absolutely it is one good example. > > > > BTW, Yi have tested both your patch and my patch which checks lbs in > > blk_queue_enter(), looks slab corruption still can be triggered with > > either one. > Yes, my patch only solves the integrity process in a single point to avoid > affecting > the normal IO process. I am not sure if other processes will have similar > issues. > > > > Yi, can you test the following patch? > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > index 82c3ae22d76d..c47e69795c86 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) > > return 0; > > } > > +static inline bool bio_unaligned(struct bio *bio) > > +{ > > + unsigned int bs = bdev_logical_block_size(bio->bi_bdev); > > + > > + if (bio->bi_iter.bi_size & (bs - 1)) > > + return true; > > + > > + if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1)) > > + return true; > > + > > + return false; > > +} > > + > > int __bio_queue_enter(struct request_queue *q, struct bio *bio) > > { > > while (!blk_try_enter_queue(q, false)) { > > @@ -362,6 +375,15 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio) > > test_bit(GD_DEAD, &disk->state)); > > if (test_bit(GD_DEAD, &disk->state)) > > goto dead; > > + /* > > + * Not like other queue limits, logical block size is one > > + * fundamental limit which can't be covered by bio split. > > + * > > + * Device reconfiguration may happen and logical block size > > + * is changed, so fail the IO if that is true. > > + */ > > + if (bio_unaligned(bio)) > > + goto dead; > > } > > return 0; > > @@ -765,6 +787,8 @@ void submit_bio_noacct(struct bio *bio) > > if (should_fail_bio(bio)) > > goto end_io; > > + if (bio->bi_iter.bi_size && bio_unaligned(bio)) > > + goto end_io; > I think this check should be added after the bio_queue_enter() call. Early > judgment should be unreliable. The test patch checks bio alignment in both __bio_queue_enter() and submit_bio_noacct(), and it should solve this trouble if that is the reason. And the final fix can convert the two checks into single one in blk_try_enter_queue() or __bio_queue_enter(), which may not add extra cost if both ->chunk_sectors and ->logical_block_size are put into single cache line. Thanks, Ming
diff --git a/block/bio-integrity.c b/block/bio-integrity.c index 2e3e8e04961e..00a0d1bafe06 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -431,7 +431,7 @@ bool bio_integrity_prep(struct bio *bio) void *buf; unsigned long start, end; unsigned int len, nr_pages; - unsigned int bytes, offset, i; + unsigned int bytes, offset, i, intervals; if (!bi) return true; @@ -457,7 +457,13 @@ bool bio_integrity_prep(struct bio *bio) } /* Allocate kernel buffer for protection data */ - len = bio_integrity_bytes(bi, bio_sectors(bio)); + intervals = bio_integrity_intervals(bi, bio_sectors(bio)); + if (unlikely((bio->bi_vcnt && intervals < bio->bi_vcnt) || + (!bio->bi_vcnt && intervals < bio_segments(bio)))) { + printk(KERN_ERR"BIO segments are not aligned according to integrity interval\n"); + goto err_end_io; + } + len = intervals * bi->tuple_size; buf = kmalloc(len, GFP_NOIO); if (unlikely(buf == NULL)) { printk(KERN_ERR "could not allocate integrity buffer\n");