Message ID | 20190125021107.4595-1-zhangxiaoxu5@huawei.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Mike Snitzer |
Headers | show |
Series | [v2] block: Fix a WRITE SAME BUG_ON | expand |
Hi. I have read a bit of DM code and spent an hour reviewing this... I didn't get to the point of knowing what the right fix for the problem is, and I may have a wrong understanding, but I have two thoughts about the patch: I don't think this is the right solution for two reasons: In the first place, if it's an LVM-only issue, we should fix it only for device-mapper devices. If this is the right way to fix it, possibly the way to do that would be to change DM calls to blk_queue_max_write_same_sectors() to only set the max sectors to more than 0 if and only if the logical block sizes match. In the second place, I don't think this is the problem. The line of code that's calling BUG_ON is, I think, BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size); This is because write_same bios are supposed to have a single sector of data at the beginning of a page in their bio_iovec(bio) [I think, based on a commit message I've read] -- in other words, bio_iovec(bio) is supposed to say something like { .page = something, .offset = 0, .len = 1 native sector }. But clearly, since the BUG_ON is being called, one of these is not true. Have you added a log statement right before the BUG_ON() to print out bio_offset(bio) and bio_iovec(bio).bv_len to see which value defies expectations? I would be happy to help you trace through this or attempt to reproduce it myself -- what stack of devices can you reproduce this on? Is this a dm-linear device on top of a disk? Does it have a filesystem on top, and if so, what filesystem? Thank you! John Dorminy On Fri, Jan 25, 2019 at 9:53 AM Zhang Xiaoxu <zhangxiaoxu5@huawei.com> wrote: > > If the lvm is stacked by different logical_block_size disks, > when WRITE SAME on it, will bug_on: > > kernel BUG at drivers/scsi/sd.c:968! > invalid opcode: 0000 [#1] SMP PTI > CPU: 11 PID: 525 Comm: kworker/11:1H Tainted: G O 5.0.0-rc3+ #2 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 > Workqueue: kblockd blk_mq_run_work_fn > RIP: 0010:sd_init_command+0x7aa/0xdb0 > Code: 30 75 00 00 c7 85 44 01 00 00 18 00 00 00 0f 85 fa 04 00 00 48 83 c4 40 48 > 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b6 ca fe ff <0f> 0b 41 bc 09 > RSP: 0018:ffffb55f80ddbca0 EFLAGS: 00010206 > RAX: 0000000000001000 RBX: ffff9ed23fb927a0 RCX: 000000000000f000 > RDX: ffff9ed23f0a8400 RSI: ffff9ed27bc79800 RDI: 0000000000000000 > RBP: ffff9ed23fb92680 R08: ffff9ed27c8c0000 R09: ffff9ed23fb927d8 > R10: 0000000000000000 R11: fefefefefefefeff R12: ffff9ed27bc79800 > R13: ffff9ed2787a0000 R14: ffff9ed27bdf3400 R15: ffff9ed23fb927a0 > FS: 0000000000000000(0000) GS:ffff9ed27c8c0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f6b14cf9341 CR3: 0000000069058000 CR4: 00000000000006e0 > Call Trace: > ? vp_notify+0x12/0x20 > scsi_queue_rq+0x525/0xa30 > blk_mq_dispatch_rq_list+0x8d/0x580 > ? syscall_return_via_sysret+0x10/0x7f > ? elv_rb_del+0x1f/0x30 > ? deadline_remove_request+0x55/0xc0 > blk_mq_do_dispatch_sched+0x76/0x110 > blk_mq_sched_dispatch_requests+0xf9/0x170 > __blk_mq_run_hw_queue+0x51/0xd0 > process_one_work+0x195/0x380 > worker_thread+0x30/0x390 > ? process_one_work+0x380/0x380 > kthread+0x113/0x130 > ? kthread_park+0x90/0x90 > ret_from_fork+0x35/0x40 > Modules linked in: alloc(O+) > ---[ end trace dc92ddeb2e6d1fe5 ]--- > > The logical_block_size of the LVM is the max value of the sub disks, > it maybe different with one of the sub disk. when WRITE SAME on the > disk, it will BUG_ON when setup WRITE SAME cmd. > > Close WRITE_SAME feature on the LVM if it was stacked by different > logical_block_size disk. > > Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> > --- > block/blk-settings.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index 3e7038e475ee..e4664280edb4 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -497,8 +497,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); > t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); > t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors); > - t->max_write_same_sectors = min(t->max_write_same_sectors, > - b->max_write_same_sectors); > t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, > b->max_write_zeroes_sectors); > t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn); > @@ -537,6 +535,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > } > } > > + /* If the logical block size is different, forbid write same */ > + if (t->logical_block_size != b->logical_block_size && > + t->max_write_same_sectors != UINT_MAX) > + t->max_write_same_sectors = 0; > + else > + t->max_write_same_sectors = min(t->max_write_same_sectors, > + b->max_write_same_sectors); > + > t->logical_block_size = max(t->logical_block_size, > b->logical_block_size); > > -- > 2.14.4 > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hi, Thanks for your reply. BUG_ON is because the 'bio_iovec(bio).bv_len' not same with 'sdp->sector_size'. Just as below reproducer, If the 'golden' is stacked by disks 'sda'(logical_block_size=512) and 'sdb'(logical_block_size=4096), call 'blkdev_issue_write_same' on it will BUG_ON because of the bio->bi_io_vec->bv_len = bdev_logical_block_size('golden'), and the bdev_logical_block_size('golden') = max_logical_block_size(sda, sdb). There are 2 solutions about the problem: 1. Disable the WRITE SAME for that situation: Just like this patch. 2. Improve the WRITE SAME to adapt that situation: Reorganization the 'WRITE SAME' bio for the disks which logical_block_size is smaller than the logical volume. For now, I think we should close the 'WRITE SAME' for this situation. Reproducer: 1. Start qemu with parameter: -drive file=/home/qemu/disk/512k.img,if=none,format=raw,cache=none,id=data.1,discard=on \ -device scsi-hd,drive=data.1,id=vdata.1 \ -drive file=/home/qemu/disk/4096k.img,if=none,format=raw,cache=none,id=data.2,discard=on \ -device scsi-hd,drive=data.2,id=vdata.2,logical_block_size=4096,physical_block_size=4096 \ 2. Create LV: vgremove golden -y vgcreate golden /dev/sda /dev/sdb lvcreate -L1G -n testvol -i 2 golden -y 3. Call 'blkdev_issue_write_same' on the device: #include <linux/module.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/fs.h> #include <linux/slab.h> #include <linux/blkdev.h> static int __init hello_module_init(void) { struct block_device *dev; int err; printk(KERN_ERR "/dev/golden/testvol\n"); dev = lookup_bdev("/dev/golden/testvol"); if (IS_ERR(dev)) { pr_warn("lookup blk error!\n"); return 0; } err = blkdev_get(dev, FMODE_WRITE | FMODE_READ | FMODE_EXCL, (void *)&err); if (err) { pr_warn("get blk error!\n"); return 0; } err = blkdev_issue_write_same(dev, 8, 0xf8, GFP_NOIO, ZERO_PAGE(0)); printk(KERN_ERR "blkdev_issue_write_same return %d\n", err); return 0; } static void __exit hello_module_exit(void) { pr_warn("\n"); } module_init(hello_module_init); module_exit(hello_module_exit); MODULE_DESCRIPTION("hello module"); MODULE_LICENSE("GPL"); On 1/26/2019 7:17 PM, John Dorminy wrote: > Hi. I have read a bit of DM code and spent an hour reviewing this... I > didn't get to the point of knowing what the right fix for the problem > is, and I may have a wrong understanding, but I have two thoughts > about the patch: > > I don't think this is the right solution for two reasons: > > In the first place, if it's an LVM-only issue, we should fix it only > for device-mapper devices. If this is the right way to fix it, > possibly the way to do that would be to change DM calls to > blk_queue_max_write_same_sectors() to only set the max sectors to more > than 0 if and only if the logical block sizes match. > > In the second place, I don't think this is the problem. The line of > code that's calling BUG_ON is, I think, > BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size); > > This is because write_same bios are supposed to have a single sector > of data at the beginning of a page in their bio_iovec(bio) [I think, > based on a commit message I've read] -- in other words, bio_iovec(bio) > is supposed to say something like { .page = something, .offset = 0, > .len = 1 native sector }. But clearly, since the BUG_ON is being > called, one of these is not true. Have you added a log statement right > before the BUG_ON() to print out bio_offset(bio) and > bio_iovec(bio).bv_len to see which value defies expectations? > > I would be happy to help you trace through this or attempt to > reproduce it myself -- what stack of devices can you reproduce this > on? Is this a dm-linear device on top of a disk? Does it have a > filesystem on top, and if so, what filesystem? > > Thank you! > > John Dorminy > > > On Fri, Jan 25, 2019 at 9:53 AM Zhang Xiaoxu <zhangxiaoxu5@huawei.com> wrote: >> >> If the lvm is stacked by different logical_block_size disks, >> when WRITE SAME on it, will bug_on: >> >> kernel BUG at drivers/scsi/sd.c:968! >> invalid opcode: 0000 [#1] SMP PTI >> CPU: 11 PID: 525 Comm: kworker/11:1H Tainted: G O 5.0.0-rc3+ #2 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 >> Workqueue: kblockd blk_mq_run_work_fn >> RIP: 0010:sd_init_command+0x7aa/0xdb0 >> Code: 30 75 00 00 c7 85 44 01 00 00 18 00 00 00 0f 85 fa 04 00 00 48 83 c4 40 48 >> 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b6 ca fe ff <0f> 0b 41 bc 09 >> RSP: 0018:ffffb55f80ddbca0 EFLAGS: 00010206 >> RAX: 0000000000001000 RBX: ffff9ed23fb927a0 RCX: 000000000000f000 >> RDX: ffff9ed23f0a8400 RSI: ffff9ed27bc79800 RDI: 0000000000000000 >> RBP: ffff9ed23fb92680 R08: ffff9ed27c8c0000 R09: ffff9ed23fb927d8 >> R10: 0000000000000000 R11: fefefefefefefeff R12: ffff9ed27bc79800 >> R13: ffff9ed2787a0000 R14: ffff9ed27bdf3400 R15: ffff9ed23fb927a0 >> FS: 0000000000000000(0000) GS:ffff9ed27c8c0000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007f6b14cf9341 CR3: 0000000069058000 CR4: 00000000000006e0 >> Call Trace: >> ? vp_notify+0x12/0x20 >> scsi_queue_rq+0x525/0xa30 >> blk_mq_dispatch_rq_list+0x8d/0x580 >> ? syscall_return_via_sysret+0x10/0x7f >> ? elv_rb_del+0x1f/0x30 >> ? deadline_remove_request+0x55/0xc0 >> blk_mq_do_dispatch_sched+0x76/0x110 >> blk_mq_sched_dispatch_requests+0xf9/0x170 >> __blk_mq_run_hw_queue+0x51/0xd0 >> process_one_work+0x195/0x380 >> worker_thread+0x30/0x390 >> ? process_one_work+0x380/0x380 >> kthread+0x113/0x130 >> ? kthread_park+0x90/0x90 >> ret_from_fork+0x35/0x40 >> Modules linked in: alloc(O+) >> ---[ end trace dc92ddeb2e6d1fe5 ]--- >> >> The logical_block_size of the LVM is the max value of the sub disks, >> it maybe different with one of the sub disk. when WRITE SAME on the >> disk, it will BUG_ON when setup WRITE SAME cmd. >> >> Close WRITE_SAME feature on the LVM if it was stacked by different >> logical_block_size disk. >> >> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> >> --- >> block/blk-settings.c | 10 ++++++++-- >> 1 file changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/block/blk-settings.c b/block/blk-settings.c >> index 3e7038e475ee..e4664280edb4 100644 >> --- a/block/blk-settings.c >> +++ b/block/blk-settings.c >> @@ -497,8 +497,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, >> t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); >> t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); >> t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors); >> - t->max_write_same_sectors = min(t->max_write_same_sectors, >> - b->max_write_same_sectors); >> t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, >> b->max_write_zeroes_sectors); >> t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn); >> @@ -537,6 +535,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, >> } >> } >> >> + /* If the logical block size is different, forbid write same */ >> + if (t->logical_block_size != b->logical_block_size && >> + t->max_write_same_sectors != UINT_MAX) >> + t->max_write_same_sectors = 0; >> + else >> + t->max_write_same_sectors = min(t->max_write_same_sectors, >> + b->max_write_same_sectors); >> + >> t->logical_block_size = max(t->logical_block_size, >> b->logical_block_size); >> >> -- >> 2.14.4 >> >> -- >> dm-devel mailing list >> dm-devel@redhat.com >> https://www.redhat.com/mailman/listinfo/dm-devel > > . > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sat, Jan 26 2019 at 6:17am -0500, John Dorminy <jdorminy@redhat.com> wrote: > Hi. I have read a bit of DM code and spent an hour reviewing this... I > didn't get to the point of knowing what the right fix for the problem > is, and I may have a wrong understanding, but I have two thoughts > about the patch: > > I don't think this is the right solution for two reasons: > > In the first place, if it's an LVM-only issue, we should fix it only > for device-mapper devices. If this is the right way to fix it, > possibly the way to do that would be to change DM calls to > blk_queue_max_write_same_sectors() to only set the max sectors to more > than 0 if and only if the logical block sizes match. There is no way this is specific to lvm (or DM). It may _seem_ that way because lvm/dm are in the business of creating stacked devices -- whereby exposing users to blk_stack_limits(). I'll have a closer look at this issue, hopefully tomorrow, but Zhang Xiaoxu's proposed fix looks bogus to me. Not disputing there is an issue, just feels like a different fix is needed. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Mike, >> In the first place, if it's an LVM-only issue, we should fix it only >> for device-mapper devices. If this is the right way to fix it, >> possibly the way to do that would be to change DM calls to >> blk_queue_max_write_same_sectors() to only set the max sectors to >> more than 0 if and only if the logical block sizes match. > > There is no way this is specific to lvm (or DM). It may _seem_ that way > because lvm/dm are in the business of creating stacked devices -- > whereby exposing users to blk_stack_limits(). > > I'll have a closer look at this issue, hopefully tomorrow, but Zhang > Xiaoxu's proposed fix looks bogus to me. Not disputing there is an > issue, just feels like a different fix is needed. It's caused by a remnant of the old bio payload hack in sd.c: BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size); We rounded up LBS when we created the DM device. And therefore the bv_len coming down is 4K. But one of the component devices has a LBS of 512 and fails this check. At first glance one could argue we should just nuke the BUG_ON since the sd code no longer relies on bv_len. However, the semantics for WRITE SAME are particularly challenging in this scenario. Say the filesystem wants to WRITE SAME a 4K PAGE consisting of 512 bytes of zeroes, followed by 512 bytes of ones, followed by 512 bytes of twos, etc. If a component device only has a 512-byte LBS, we would end up writing zeroes to the entire 4K block on that component device instead of the correct pattern. Not good. So disallowing WRITE SAME unless all component devices have the same LBS is the correct fix. That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a non-zero payload. The target code ends up manually iterating, if I remember correctly...
On Mon, Jan 28, 2019 at 11:54:08PM -0500, Martin K. Petersen wrote: > That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is > irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile > to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a > non-zero payload. The target code ends up manually iterating, if I > remember correctly... I had a series to remove it a while ago, but you wanted to keep it at that point. I can resurrect it. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 1/29/2019 4:52 PM, Christoph Hellwig wrote: > On Mon, Jan 28, 2019 at 11:54:08PM -0500, Martin K. Petersen wrote: >> That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is >> irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile >> to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a >> non-zero payload. The target code ends up manually iterating, if I >> remember correctly... > > I had a series to remove it a while ago, but you wanted to keep it at > that point. I can resurrect it. > Do you mean to remove the `blkdev_issue_write_same` interface? I think we should fix the problem first because other users may also encounter the same problem. > . > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Jan 28, 2019 at 11:54 PM Martin K. Petersen <martin.petersen@oracle.com> wrote: > We rounded up LBS when we created the DM device. And therefore the > bv_len coming down is 4K. But one of the component devices has a LBS of > 512 and fails this check. > > At first glance one could argue we should just nuke the BUG_ON since the > sd code no longer relies on bv_len. However, the semantics for WRITE > SAME are particularly challenging in this scenario. Say the filesystem > wants to WRITE SAME a 4K PAGE consisting of 512 bytes of zeroes, > followed by 512 bytes of ones, followed by 512 bytes of twos, etc. If a > component device only has a 512-byte LBS, we would end up writing zeroes > to the entire 4K block on that component device instead of the correct > pattern. Not good. > > So disallowing WRITE SAME unless all component devices have the same LBS > is the correct fix. Alternately, could possibly WRITE_SAME bios be accepted with the minimum sector size of the stack rather than the max, e.g. 512 in this example rather than 4k? They'd need to have a granularity of the larger sector size, though, presumabily necessitating new queue limits write_same_{granularity,block_size}, which might be too much work. For devices with bigger sectors, the block layer or DM would need to expand the small-sector payload to an appropriate larger-sector payload, but it would preserve the ability to use WRITE_SAME with non-zero payloads. (I use WRITE_SAME to fill devices with a particular pattern in order to catch failures to initialize disk structures appropriately, personally, but it's just for convenience/speed.) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 1/30/2019 10:08 PM, John Dorminy wrote: > Alternately, could possibly WRITE_SAME bios be accepted with the > minimum sector size of the stack rather than the max, e.g. 512 in this > example rather than 4k? They'd need to have a granularity of the > larger sector size, though, presumabily necessitating new queue limits > write_same_{granularity,block_size}, which might be too much work. For > devices with bigger sectors, the block layer or DM would need to > expand the small-sector payload to an appropriate larger-sector > payload, but it would preserve the ability to use WRITE_SAME with > non-zero payloads. > > (I use WRITE_SAME to fill devices with a particular pattern in order > to catch failures to initialize disk structures appropriately, > personally, but it's just for convenience/speed.) I think two LBSs will produce ambiguity. Reference spec Information technology - SCSI Block Commands – 4 (SBC-4) ISO/IEC 14776-324:201x BSR INCITS 506:201x 5.50 WRITE SAME (10) command The WRITE SAME (10) command (see table 145) requests that the device server **transfer a single logical block** from the Data-Out Buffer and for each LBA in the specified range of LBAs: -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
John, >> So disallowing WRITE SAME unless all component devices have the same LBS >> is the correct fix. > > Alternately, could possibly WRITE_SAME bios be accepted with the > minimum sector size of the stack rather than the max, e.g. 512 in this > example rather than 4k? They'd need to have a granularity of the > larger sector size, though, presumabily necessitating new queue limits > write_same_{granularity,block_size}, which might be too much work. I don't have a problem restricting the buffer contents to be consistent within a page. Or even change the upper layer semantics to specify the buffer contents using a single byte (0x00..0xff). But the issue of head and tail remains if there is a block size mismatch so it's important that we keep scaling the logical block size up when stacking and reject any bio that can't be honored on a 4Kn device. > (I use WRITE_SAME to fill devices with a particular pattern in order > to catch failures to initialize disk structures appropriately, > personally, but it's just for convenience/speed.) The intent was for stuff like MD to use it to initialize parity disks, etc. But adoption has been pretty slow. I don't have any problems keeping WRITE_SAME around if people are actually using it. It just seemed like most active users only cared about writing zeroes.
On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote: > (I use WRITE_SAME to fill devices with a particular pattern in order > to catch failures to initialize disk structures appropriately, > personally, but it's just for convenience/speed.) How do you use it? We don't have a user interface to generate REQ_OP_WRITE_SAME requests. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Jan 31, 2019 at 5:39 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote: > > (I use WRITE_SAME to fill devices with a particular pattern in order > > to catch failures to initialize disk structures appropriately, > > personally, but it's just for convenience/speed.) > > How do you use it? We don't have a user interface to generate > REQ_OP_WRITE_SAME requests. A not-checked-in test module, similar to Zhang Xiaoxu's reproducer, exposing a sysfs node to trigger filling a block device with a test pattern. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote: > > On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote: > > > (I use WRITE_SAME to fill devices with a particular pattern in order > > > to catch failures to initialize disk structures appropriately, > > > personally, but it's just for convenience/speed.) > > > > How do you use it? We don't have a user interface to generate > > REQ_OP_WRITE_SAME requests. > > A not-checked-in test module, similar to Zhang Xiaoxu's reproducer, > exposing a sysfs node to trigger filling a block device with a test > pattern. Any reason you don't just use SCSI/NVMe passthrough directly from userspace for that? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
I didn't know such a thing existed... does it work on any block device? Where do I read more about this? On Fri, Feb 1, 2019 at 2:35 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote: > > > On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote: > > > > (I use WRITE_SAME to fill devices with a particular pattern in order > > > > to catch failures to initialize disk structures appropriately, > > > > personally, but it's just for convenience/speed.) > > > > > > How do you use it? We don't have a user interface to generate > > > REQ_OP_WRITE_SAME requests. > > > > A not-checked-in test module, similar to Zhang Xiaoxu's reproducer, > > exposing a sysfs node to trigger filling a block device with a test > > pattern. > > Any reason you don't just use SCSI/NVMe passthrough directly from > userspace for that? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 2/1/19 3:09 PM, John Dorminy wrote: > I didn't know such a thing existed... does it work on any block > device? Where do I read more about this? Use sg_write_same(8) from package sg3_utils. For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000 --xferlen=512 /dev/sdwhatever' will read the pattern to write same from file 'foobarfile' with length explicitely set to 512 bytes (rather than derived from foobarfile's size) and write it 20000 times starting at LBA 0 to /dev/sdwhatever. > --lba= > > On Fri, Feb 1, 2019 at 2:35 AM Christoph Hellwig <hch@infradead.org> wrote: >> On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote: >>>> On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote: >>>>> (I use WRITE_SAME to fill devices with a particular pattern in order >>>>> to catch failures to initialize disk structures appropriately, >>>>> personally, but it's just for convenience/speed.) >>>> How do you use it? We don't have a user interface to generate >>>> REQ_OP_WRITE_SAME requests. >>> A not-checked-in test module, similar to Zhang Xiaoxu's reproducer, >>> exposing a sysfs node to trigger filling a block device with a test >>> pattern. >> Any reason you don't just use SCSI/NVMe passthrough directly from >> userspace for that? > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Feb 01, 2019 at 05:03:40PM +0100, Heinz Mauelshagen wrote: > On 2/1/19 3:09 PM, John Dorminy wrote: > > I didn't know such a thing existed... does it work on any block > > device? Where do I read more about this? > > > Use sg_write_same(8) from package sg3_utils. > > For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000 > --xferlen=512 /dev/sdwhatever' > > will read the pattern to write same from file 'foobarfile' with length > explicitely set to 512 bytes > (rather than derived from foobarfile's size) and write it 20000 times > starting at LBA 0 to /dev/sdwhatever. Yeah. Note that this will only work on SCSI disks (and maybe ATA for a very specific corner case). But for actual devices and not remappers the same is true of REQ_OP_WRITE_SAME. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Any progress about the problme? Should we disable the write same when stack the different LBA disks? On 2/2/2019 12:18 AM, Christoph Hellwig wrote: > On Fri, Feb 01, 2019 at 05:03:40PM +0100, Heinz Mauelshagen wrote: >> On 2/1/19 3:09 PM, John Dorminy wrote: >>> I didn't know such a thing existed... does it work on any block >>> device? Where do I read more about this? >> >> >> Use sg_write_same(8) from package sg3_utils. >> >> For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000 >> --xferlen=512 /dev/sdwhatever' >> >> will read the pattern to write same from file 'foobarfile' with length >> explicitely set to 512 bytes >> (rather than derived from foobarfile's size) and write it 20000 times >> starting at LBA 0 to /dev/sdwhatever. > > Yeah. Note that this will only work on SCSI disks (and maybe ATA > for a very specific corner case). But for actual devices and not > remappers the same is true of REQ_OP_WRITE_SAME. > > . > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
zhangxiaoxu, > Any progress about the problme? > Should we disable the write same when stack the different LBA disks? Yes, please.
Any comments about the patch? On 2/14/2019 10:31 AM, Martin K. Petersen wrote: > > zhangxiaoxu, > >> Any progress about the problme? >> Should we disable the write same when stack the different LBA disks? > > Yes, please. > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
ping. Anyone can help merge this patch? or any other solutions? On 2/14/2019 5:36 PM, zhangxiaoxu (A) wrote: > Any comments about the patch? > > On 2/14/2019 10:31 AM, Martin K. Petersen wrote: >> >> zhangxiaoxu, >> >>> Any progress about the problme? >>> Should we disable the write same when stack the different LBA disks? >> >> Yes, please. >> -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hi Zhang, > ping. > > Anyone can help merge this patch? or any other solutions? + /* If the logical block size is different, forbid write same */ + if (t->logical_block_size != b->logical_block_size && + t->max_write_same_sectors != UINT_MAX) + t->max_write_same_sectors = 0; + else + t->max_write_same_sectors = min(t->max_write_same_sectors, + b->max_write_same_sectors); + I am not particularly keen on this UINT_MAX magic. I would prefer to have the stacking driver default for lbs be set to 0 so the stacking function could avoid special-casing the first iteration. But I am not sure whether that would break any assumptions in DM/MD wrt. the logical block size being non-zero prior to calling the stacking function. Mike? Any comments? If we stick with the UINT_MAX check, the comment should at least point out why it's there.
diff --git a/block/blk-settings.c b/block/blk-settings.c index 3e7038e475ee..e4664280edb4 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -497,8 +497,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors); - t->max_write_same_sectors = min(t->max_write_same_sectors, - b->max_write_same_sectors); t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, b->max_write_zeroes_sectors); t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn); @@ -537,6 +535,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, } } + /* If the logical block size is different, forbid write same */ + if (t->logical_block_size != b->logical_block_size && + t->max_write_same_sectors != UINT_MAX) + t->max_write_same_sectors = 0; + else + t->max_write_same_sectors = min(t->max_write_same_sectors, + b->max_write_same_sectors); + t->logical_block_size = max(t->logical_block_size, b->logical_block_size);
If the lvm is stacked by different logical_block_size disks, when WRITE SAME on it, will bug_on: kernel BUG at drivers/scsi/sd.c:968! invalid opcode: 0000 [#1] SMP PTI CPU: 11 PID: 525 Comm: kworker/11:1H Tainted: G O 5.0.0-rc3+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:sd_init_command+0x7aa/0xdb0 Code: 30 75 00 00 c7 85 44 01 00 00 18 00 00 00 0f 85 fa 04 00 00 48 83 c4 40 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b6 ca fe ff <0f> 0b 41 bc 09 RSP: 0018:ffffb55f80ddbca0 EFLAGS: 00010206 RAX: 0000000000001000 RBX: ffff9ed23fb927a0 RCX: 000000000000f000 RDX: ffff9ed23f0a8400 RSI: ffff9ed27bc79800 RDI: 0000000000000000 RBP: ffff9ed23fb92680 R08: ffff9ed27c8c0000 R09: ffff9ed23fb927d8 R10: 0000000000000000 R11: fefefefefefefeff R12: ffff9ed27bc79800 R13: ffff9ed2787a0000 R14: ffff9ed27bdf3400 R15: ffff9ed23fb927a0 FS: 0000000000000000(0000) GS:ffff9ed27c8c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b14cf9341 CR3: 0000000069058000 CR4: 00000000000006e0 Call Trace: ? vp_notify+0x12/0x20 scsi_queue_rq+0x525/0xa30 blk_mq_dispatch_rq_list+0x8d/0x580 ? syscall_return_via_sysret+0x10/0x7f ? elv_rb_del+0x1f/0x30 ? deadline_remove_request+0x55/0xc0 blk_mq_do_dispatch_sched+0x76/0x110 blk_mq_sched_dispatch_requests+0xf9/0x170 __blk_mq_run_hw_queue+0x51/0xd0 process_one_work+0x195/0x380 worker_thread+0x30/0x390 ? process_one_work+0x380/0x380 kthread+0x113/0x130 ? kthread_park+0x90/0x90 ret_from_fork+0x35/0x40 Modules linked in: alloc(O+) ---[ end trace dc92ddeb2e6d1fe5 ]--- The logical_block_size of the LVM is the max value of the sub disks, it maybe different with one of the sub disk. when WRITE SAME on the disk, it will BUG_ON when setup WRITE SAME cmd. Close WRITE_SAME feature on the LVM if it was stacked by different logical_block_size disk. Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> --- block/blk-settings.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)