Message ID | 580e4e3-b6b3-e291-282e-b57be178cec1@redhat.com (mailing list archive) |
---|---|
State | Accepted, archived |
Delegated to: | Mike Snitzer |
Headers | show |
Series | dm-integrity: align the outgoing bio in integrity_recheck | expand |
On Thu, Mar 21, 2024 at 05:48:45PM +0100, Mikulas Patocka wrote: > It may be possible to set up dm-integrity with smaller sector size than > the logical sector size of the underlying device. In this situation, > dm-integrity guarantees that the outgoing bios have the same alignment as > incoming bios (so, if you create a filesystem with 4k block size, > dm-integrity would send 4k-aligned bios to the underlying device). > > This guarantee was broken when integrity_recheck was implemented. > integrity_recheck sends bio that is aligned to ic->sectors_per_block. So > if we set up integrity with 512-byte sector size on a device with logical > block size 4k, we would be sending unaligned bio. This triggered a bug in > one of our internal tests. > > This commit fixes it - it determines what's the actual alignment of the > incoming bio and then makes sure that the outgoing bio in > integrity_recheck has the same alignment. > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a failure") > Cc: stable@vger.kernel.org > > --- > drivers/md/dm-integrity.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > Index: linux-2.6/drivers/md/dm-integrity.c > =================================================================== > --- linux-2.6.orig/drivers/md/dm-integrity.c 2024-03-21 14:25:45.000000000 +0100 > +++ linux-2.6/drivers/md/dm-integrity.c 2024-03-21 17:47:39.000000000 +0100 > @@ -1699,7 +1699,6 @@ static noinline void integrity_recheck(s > struct bio_vec bv; > sector_t sector, logical_sector, area, offset; > struct page *page; > - void *buffer; > > get_area_and_offset(ic, dio->range.logical_sector, &area, &offset); > dio->metadata_block = get_metadata_sector_and_offset(ic, area, offset, > @@ -1708,13 +1707,14 @@ static noinline void integrity_recheck(s > logical_sector = dio->range.logical_sector; > > page = mempool_alloc(&ic->recheck_pool, GFP_NOIO); > - buffer = page_to_virt(page); > > __bio_for_each_segment(bv, bio, iter, dio->bio_details.bi_iter) { > unsigned pos = 0; > > do { > + sector_t alignment; > char *mem; > + char *buffer = page_to_virt(page); > int r; > struct dm_io_request io_req; > struct dm_io_region io_loc; > @@ -1727,6 +1727,14 @@ static noinline void integrity_recheck(s > io_loc.sector = sector; > io_loc.count = ic->sectors_per_block; > > + /* Align the bio to logical block size */ > + alignment = dio->range.logical_sector | bio_sectors(bio) | (PAGE_SIZE >> SECTOR_SHIFT); > + alignment &= -alignment; The above is less readable, :-( > + io_loc.sector = round_down(io_loc.sector, alignment); > + io_loc.count += sector - io_loc.sector; > + buffer += (sector - io_loc.sector) << SECTOR_SHIFT; > + io_loc.count = round_up(io_loc.count, alignment); I feel the above code isn't very reliable, what we need actually is to make sure that io's sector & size is aligned with dm's bdev_logical_block_size(bdev). Yeah, so far the max logical block size is 4k, but it may be increased in future and you can see the recent lsfmm proposal, so can we force it to be aligned with bdev_logical_block_size(bdev) here? Also can the above change work efficiently in case of 64K PAGE_SIZE? Thanks, Ming
On Fri, 22 Mar 2024, Ming Lei wrote: > On Thu, Mar 21, 2024 at 05:48:45PM +0100, Mikulas Patocka wrote: > > It may be possible to set up dm-integrity with smaller sector size than > > the logical sector size of the underlying device. In this situation, > > dm-integrity guarantees that the outgoing bios have the same alignment as > > incoming bios (so, if you create a filesystem with 4k block size, > > dm-integrity would send 4k-aligned bios to the underlying device). > > > > This guarantee was broken when integrity_recheck was implemented. > > integrity_recheck sends bio that is aligned to ic->sectors_per_block. So > > if we set up integrity with 512-byte sector size on a device with logical > > block size 4k, we would be sending unaligned bio. This triggered a bug in > > one of our internal tests. > > > > This commit fixes it - it determines what's the actual alignment of the > > incoming bio and then makes sure that the outgoing bio in > > integrity_recheck has the same alignment. > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a failure") > > Cc: stable@vger.kernel.org > > > > --- > > drivers/md/dm-integrity.c | 12 ++++++++++-- > > 1 file changed, 10 insertions(+), 2 deletions(-) > > > > Index: linux-2.6/drivers/md/dm-integrity.c > > =================================================================== > > --- linux-2.6.orig/drivers/md/dm-integrity.c 2024-03-21 14:25:45.000000000 +0100 > > +++ linux-2.6/drivers/md/dm-integrity.c 2024-03-21 17:47:39.000000000 +0100 > > @@ -1699,7 +1699,6 @@ static noinline void integrity_recheck(s > > struct bio_vec bv; > > sector_t sector, logical_sector, area, offset; > > struct page *page; > > - void *buffer; > > > > get_area_and_offset(ic, dio->range.logical_sector, &area, &offset); > > dio->metadata_block = get_metadata_sector_and_offset(ic, area, offset, > > @@ -1708,13 +1707,14 @@ static noinline void integrity_recheck(s > > logical_sector = dio->range.logical_sector; > > > > page = mempool_alloc(&ic->recheck_pool, GFP_NOIO); > > - buffer = page_to_virt(page); > > > > __bio_for_each_segment(bv, bio, iter, dio->bio_details.bi_iter) { > > unsigned pos = 0; > > > > do { > > + sector_t alignment; > > char *mem; > > + char *buffer = page_to_virt(page); > > int r; > > struct dm_io_request io_req; > > struct dm_io_region io_loc; > > @@ -1727,6 +1727,14 @@ static noinline void integrity_recheck(s > > io_loc.sector = sector; > > io_loc.count = ic->sectors_per_block; > > > > + /* Align the bio to logical block size */ > > + alignment = dio->range.logical_sector | bio_sectors(bio) | (PAGE_SIZE >> SECTOR_SHIFT); > > + alignment &= -alignment; > > The above is less readable, :-( It isolates the lowest bit from dio->range.logical_sector, bio_sectors(bio) and (PAGE_SIZE >> SECTOR_SHIFT). See for example this https://www.felixcloutier.com/x86/blsi > > + io_loc.sector = round_down(io_loc.sector, alignment); > > + io_loc.count += sector - io_loc.sector; > > + buffer += (sector - io_loc.sector) << SECTOR_SHIFT; > > + io_loc.count = round_up(io_loc.count, alignment); > > I feel the above code isn't very reliable, what we need actually is to > make sure that io's sector & size is aligned with dm's > bdev_logical_block_size(bdev). I thought about using bdev_logical_block_size. But it may be wrong if the device stack is reconfigured. So, I concluded that taking the alignment from the bio would be better. > Yeah, so far the max logical block size is 4k, but it may be increased > in future and you can see the recent lsfmm proposal, so can we force it to be > aligned with bdev_logical_block_size(bdev) here? > > Also can the above change work efficiently in case of 64K PAGE_SIZE? It doesn't work efficiently at all - this piece of code is only run in a pathological case where the user writes into a buffer while reading it (or when he reads multiple blocks into the same buffer), so I optimized it for size, not for performance. But yes, it works with 64K PAGE_SIZE. > Thanks, > Ming Mikulas
On Fri, Mar 22, 2024 at 11:30:33AM +0100, Mikulas Patocka wrote: > > > On Fri, 22 Mar 2024, Ming Lei wrote: > > > On Thu, Mar 21, 2024 at 05:48:45PM +0100, Mikulas Patocka wrote: > > > It may be possible to set up dm-integrity with smaller sector size than > > > the logical sector size of the underlying device. In this situation, > > > dm-integrity guarantees that the outgoing bios have the same alignment as > > > incoming bios (so, if you create a filesystem with 4k block size, > > > dm-integrity would send 4k-aligned bios to the underlying device). > > > > > > This guarantee was broken when integrity_recheck was implemented. > > > integrity_recheck sends bio that is aligned to ic->sectors_per_block. So > > > if we set up integrity with 512-byte sector size on a device with logical > > > block size 4k, we would be sending unaligned bio. This triggered a bug in > > > one of our internal tests. > > > > > > This commit fixes it - it determines what's the actual alignment of the > > > incoming bio and then makes sure that the outgoing bio in > > > integrity_recheck has the same alignment. > > > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > > Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a failure") > > > Cc: stable@vger.kernel.org > > > > > > --- > > > drivers/md/dm-integrity.c | 12 ++++++++++-- > > > 1 file changed, 10 insertions(+), 2 deletions(-) > > > > > > Index: linux-2.6/drivers/md/dm-integrity.c > > > =================================================================== > > > --- linux-2.6.orig/drivers/md/dm-integrity.c 2024-03-21 14:25:45.000000000 +0100 > > > +++ linux-2.6/drivers/md/dm-integrity.c 2024-03-21 17:47:39.000000000 +0100 > > > @@ -1699,7 +1699,6 @@ static noinline void integrity_recheck(s > > > struct bio_vec bv; > > > sector_t sector, logical_sector, area, offset; > > > struct page *page; > > > - void *buffer; > > > > > > get_area_and_offset(ic, dio->range.logical_sector, &area, &offset); > > > dio->metadata_block = get_metadata_sector_and_offset(ic, area, offset, > > > @@ -1708,13 +1707,14 @@ static noinline void integrity_recheck(s > > > logical_sector = dio->range.logical_sector; > > > > > > page = mempool_alloc(&ic->recheck_pool, GFP_NOIO); > > > - buffer = page_to_virt(page); > > > > > > __bio_for_each_segment(bv, bio, iter, dio->bio_details.bi_iter) { > > > unsigned pos = 0; > > > > > > do { > > > + sector_t alignment; > > > char *mem; > > > + char *buffer = page_to_virt(page); > > > int r; > > > struct dm_io_request io_req; > > > struct dm_io_region io_loc; > > > @@ -1727,6 +1727,14 @@ static noinline void integrity_recheck(s > > > io_loc.sector = sector; > > > io_loc.count = ic->sectors_per_block; > > > > > > + /* Align the bio to logical block size */ > > > + alignment = dio->range.logical_sector | bio_sectors(bio) | (PAGE_SIZE >> SECTOR_SHIFT); > > > + alignment &= -alignment; > > > > The above is less readable, :-( > > It isolates the lowest bit from dio->range.logical_sector, > bio_sectors(bio) and (PAGE_SIZE >> SECTOR_SHIFT). > > See for example this https://www.felixcloutier.com/x86/blsi Fine, but I have to say such usage isn't popular. > > > > + io_loc.sector = round_down(io_loc.sector, alignment); > > > + io_loc.count += sector - io_loc.sector; > > > + buffer += (sector - io_loc.sector) << SECTOR_SHIFT; > > > + io_loc.count = round_up(io_loc.count, alignment); > > > > I feel the above code isn't very reliable, what we need actually is to > > make sure that io's sector & size is aligned with dm's > > bdev_logical_block_size(bdev). > > I thought about using bdev_logical_block_size. But it may be wrong if the > device stack is reconfigured. So, I concluded that taking the alignment > from the bio would be better. If logical block becomes mismatched by reconfiguration, the whole DM stack can't work: - at the beginning, DM is over NVMe(512 bs), DM & NVMe lbs is 512 - later, nvme is reconfigured and its lbs becomes 4k, but DM's lbs can't be updated - then unaligned IO is submitted to NVMe So DM _never_ works with mis-matched logical block size because of reconfigure, and same with MD. > > > Yeah, so far the max logical block size is 4k, but it may be increased > > in future and you can see the recent lsfmm proposal, so can we force it to be > > aligned with bdev_logical_block_size(bdev) here? > > > > Also can the above change work efficiently in case of 64K PAGE_SIZE? > > It doesn't work efficiently at all - this piece of code is only run in a > pathological case where the user writes into a buffer while reading it (or > when he reads multiple blocks into the same buffer), so I optimized it for > size, not for performance. > > But yes, it works with 64K PAGE_SIZE. Fine, but I still think PAGE_SIZE is hard to follow than logical block size. Thanks, Ming
On Fri, 22 Mar 2024, Ming Lei wrote: > If logical block becomes mismatched by reconfiguration, the whole DM stack can't work: > > - at the beginning, DM is over NVMe(512 bs), DM & NVMe lbs is 512 > - later, nvme is reconfigured and its lbs becomes 4k, but DM's lbs can't > be updated > - then unaligned IO is submitted to NVMe > > So DM _never_ works with mis-matched logical block size because of > reconfigure, and same with MD. It can work. If a filesystem has 4k block size and it is used on a block device with misconfigured with 512-byte sector size, it will work. The buffer cache also prefers using page-sized blocks rather than logical block size, so it will work too. > > But yes, it works with 64K PAGE_SIZE. > > Fine, but I still think PAGE_SIZE is hard to follow than logical block > size. > > Thanks, > Ming The temporary buffer has one page, so I restricted the alignment to PAGE_SIZE. logical block size is always <= PAGE_SIZE. Mikulas
On Fri, Mar 22, 2024 at 01:32:30PM +0100, Mikulas Patocka wrote: > > > On Fri, 22 Mar 2024, Ming Lei wrote: > > > If logical block becomes mismatched by reconfiguration, the whole DM stack can't work: > > > > - at the beginning, DM is over NVMe(512 bs), DM & NVMe lbs is 512 > > - later, nvme is reconfigured and its lbs becomes 4k, but DM's lbs can't > > be updated > > - then unaligned IO is submitted to NVMe > > > > So DM _never_ works with mis-matched logical block size because of > > reconfigure, and same with MD. > > It can work. If a filesystem has 4k block size and it is used on a block Here 'work' means it supports any allowed block size, instead of 4KB only or if balabala... The DM device advertises its lbs as 512, application & FS should be free to read/write 512 aligned data, right? Thanks, Ming
On Fri, Mar 22 2024 at 8:03P -0400, Ming Lei <ming.lei@redhat.com> wrote: > On Fri, Mar 22, 2024 at 11:30:33AM +0100, Mikulas Patocka wrote: > > > > > > On Fri, 22 Mar 2024, Ming Lei wrote: > > > > > On Thu, Mar 21, 2024 at 05:48:45PM +0100, Mikulas Patocka wrote: > > > > It may be possible to set up dm-integrity with smaller sector size than > > > > the logical sector size of the underlying device. In this situation, > > > > dm-integrity guarantees that the outgoing bios have the same alignment as > > > > incoming bios (so, if you create a filesystem with 4k block size, > > > > dm-integrity would send 4k-aligned bios to the underlying device). > > > > > > > > This guarantee was broken when integrity_recheck was implemented. > > > > integrity_recheck sends bio that is aligned to ic->sectors_per_block. So > > > > if we set up integrity with 512-byte sector size on a device with logical > > > > block size 4k, we would be sending unaligned bio. This triggered a bug in > > > > one of our internal tests. > > > > > > > > This commit fixes it - it determines what's the actual alignment of the > > > > incoming bio and then makes sure that the outgoing bio in > > > > integrity_recheck has the same alignment. > > > > > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > > > Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a failure") > > > > Cc: stable@vger.kernel.org > > > > > > > > --- > > > > drivers/md/dm-integrity.c | 12 ++++++++++-- > > > > 1 file changed, 10 insertions(+), 2 deletions(-) > > > > > > > > Index: linux-2.6/drivers/md/dm-integrity.c > > > > =================================================================== > > > > --- linux-2.6.orig/drivers/md/dm-integrity.c 2024-03-21 14:25:45.000000000 +0100 > > > > +++ linux-2.6/drivers/md/dm-integrity.c 2024-03-21 17:47:39.000000000 +0100 > > > > @@ -1699,7 +1699,6 @@ static noinline void integrity_recheck(s > > > > struct bio_vec bv; > > > > sector_t sector, logical_sector, area, offset; > > > > struct page *page; > > > > - void *buffer; > > > > > > > > get_area_and_offset(ic, dio->range.logical_sector, &area, &offset); > > > > dio->metadata_block = get_metadata_sector_and_offset(ic, area, offset, > > > > @@ -1708,13 +1707,14 @@ static noinline void integrity_recheck(s > > > > logical_sector = dio->range.logical_sector; > > > > > > > > page = mempool_alloc(&ic->recheck_pool, GFP_NOIO); > > > > - buffer = page_to_virt(page); > > > > > > > > __bio_for_each_segment(bv, bio, iter, dio->bio_details.bi_iter) { > > > > unsigned pos = 0; > > > > > > > > do { > > > > + sector_t alignment; > > > > char *mem; > > > > + char *buffer = page_to_virt(page); > > > > int r; > > > > struct dm_io_request io_req; > > > > struct dm_io_region io_loc; > > > > @@ -1727,6 +1727,14 @@ static noinline void integrity_recheck(s > > > > io_loc.sector = sector; > > > > io_loc.count = ic->sectors_per_block; > > > > > > > > + /* Align the bio to logical block size */ > > > > + alignment = dio->range.logical_sector | bio_sectors(bio) | (PAGE_SIZE >> SECTOR_SHIFT); > > > > + alignment &= -alignment; > > > > > > The above is less readable, :-( > > > > It isolates the lowest bit from dio->range.logical_sector, > > bio_sectors(bio) and (PAGE_SIZE >> SECTOR_SHIFT). > > > > See for example this https://www.felixcloutier.com/x86/blsi > > Fine, but I have to say such usage isn't popular. Yeah, at a minimum it should have a comment explaining the optimization of combining and then getting lsbit. The non-ffs() optimized gcd() uses the same but comments it: /* Isolate lsbit of r */ r &= -r; > > > > + io_loc.sector = round_down(io_loc.sector, alignment); > > > > + io_loc.count += sector - io_loc.sector; > > > > + buffer += (sector - io_loc.sector) << SECTOR_SHIFT; > > > > + io_loc.count = round_up(io_loc.count, alignment); > > > > > > I feel the above code isn't very reliable, what we need actually is to > > > make sure that io's sector & size is aligned with dm's > > > bdev_logical_block_size(bdev). > > > > I thought about using bdev_logical_block_size. But it may be wrong if the > > device stack is reconfigured. So, I concluded that taking the alignment > > from the bio would be better. > > If logical block becomes mismatched by reconfiguration, the whole DM stack can't work: > > - at the beginning, DM is over NVMe(512 bs), DM & NVMe lbs is 512 > - later, nvme is reconfigured and its lbs becomes 4k, but DM's lbs can't > be updated > - then unaligned IO is submitted to NVMe > > So DM _never_ works with mis-matched logical block size because of > reconfigure, and same with MD. At some point we need to trust the queue_limits and DM takes considerable pain to validate the alignment when a dm-table is (re)loaded. But we could get into problems with deep(er) device stacks where an underlying DM device is reloaded but the upper level devices' queue_limits aren't restacked. Thankfully, in practice that generally doesn't occur! If it were to become a prevalent issue DM would need to grow validation that DM devices aren't changing their logic_block_size and overall alignment during runtime. > > > Yeah, so far the max logical block size is 4k, but it may be increased > > > in future and you can see the recent lsfmm proposal, so can we force it to be > > > aligned with bdev_logical_block_size(bdev) here? > > > > > > Also can the above change work efficiently in case of 64K PAGE_SIZE? > > > > It doesn't work efficiently at all - this piece of code is only run in a > > pathological case where the user writes into a buffer while reading it (or > > when he reads multiple blocks into the same buffer), so I optimized it for > > size, not for performance. > > > > But yes, it works with 64K PAGE_SIZE. > > Fine, but I still think PAGE_SIZE is hard to follow than logical block > size. Thanks for your review. I shared many of your review concerns (the math isn't apporachable, and why not just use logical_block_size in queue_limits?). That said, I'm OK with the code as-is because it has been tested to fix the reported misalignment issue. BUT, I would like to see follow-on cleanup in a separate commit, at a minimum there should be some helpful comments (to address the math and assumptions made, e.g. this recheck code is not fast path). Mike
Index: linux-2.6/drivers/md/dm-integrity.c =================================================================== --- linux-2.6.orig/drivers/md/dm-integrity.c 2024-03-21 14:25:45.000000000 +0100 +++ linux-2.6/drivers/md/dm-integrity.c 2024-03-21 17:47:39.000000000 +0100 @@ -1699,7 +1699,6 @@ static noinline void integrity_recheck(s struct bio_vec bv; sector_t sector, logical_sector, area, offset; struct page *page; - void *buffer; get_area_and_offset(ic, dio->range.logical_sector, &area, &offset); dio->metadata_block = get_metadata_sector_and_offset(ic, area, offset, @@ -1708,13 +1707,14 @@ static noinline void integrity_recheck(s logical_sector = dio->range.logical_sector; page = mempool_alloc(&ic->recheck_pool, GFP_NOIO); - buffer = page_to_virt(page); __bio_for_each_segment(bv, bio, iter, dio->bio_details.bi_iter) { unsigned pos = 0; do { + sector_t alignment; char *mem; + char *buffer = page_to_virt(page); int r; struct dm_io_request io_req; struct dm_io_region io_loc; @@ -1727,6 +1727,14 @@ static noinline void integrity_recheck(s io_loc.sector = sector; io_loc.count = ic->sectors_per_block; + /* Align the bio to logical block size */ + alignment = dio->range.logical_sector | bio_sectors(bio) | (PAGE_SIZE >> SECTOR_SHIFT); + alignment &= -alignment; + io_loc.sector = round_down(io_loc.sector, alignment); + io_loc.count += sector - io_loc.sector; + buffer += (sector - io_loc.sector) << SECTOR_SHIFT; + io_loc.count = round_up(io_loc.count, alignment); + r = dm_io(&io_req, 1, &io_loc, NULL); if (unlikely(r)) { dio->bi_status = errno_to_blk_status(r);
It may be possible to set up dm-integrity with smaller sector size than the logical sector size of the underlying device. In this situation, dm-integrity guarantees that the outgoing bios have the same alignment as incoming bios (so, if you create a filesystem with 4k block size, dm-integrity would send 4k-aligned bios to the underlying device). This guarantee was broken when integrity_recheck was implemented. integrity_recheck sends bio that is aligned to ic->sectors_per_block. So if we set up integrity with 512-byte sector size on a device with logical block size 4k, we would be sending unaligned bio. This triggered a bug in one of our internal tests. This commit fixes it - it determines what's the actual alignment of the incoming bio and then makes sure that the outgoing bio in integrity_recheck has the same alignment. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a failure") Cc: stable@vger.kernel.org --- drivers/md/dm-integrity.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)