Message ID | 20201201160709.31748-1-snitzer@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] block: use gcd() to fix chunk_sectors limit stacking | expand |
On Tue, Dec 1, 2020 at 11:07 AM Mike Snitzer <snitzer@redhat.com> wrote: > > commit 22ada802ede8 ("block: use lcm_not_zero() when stacking > chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must > reflect the most limited of all devices in the IO stack. > > Otherwise malformed IO may result. E.g.: prior to this fix, > ->chunk_sectors = lcm_not_zero(8, 128) would result in > blk_max_size_offset() splitting IO at 128 sectors rather than the > required more restrictive 8 sectors. > > And since commit 07d098e6bbad ("block: allow 'chunk_sectors' to be > non-power-of-2") care must be taken to properly stack chunk_sectors to > be compatible with the possibility that a non-power-of-2 chunk_sectors > may be stacked. This is why gcd() is used instead of reverting back > to using min_not_zero(). > > Fixes: 22ada802ede8 ("block: use lcm_not_zero() when stacking chunk_sectors") > Fixes: 07d098e6bbad ("block: allow 'chunk_sectors' to be non-power-of-2") > Cc: stable@vger.kernel.org > Reported-by: John Dorminy <jdorminy@redhat.com> > Reported-by: Bruce Johnston <bjohnsto@redhat.com> > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > block/blk-settings.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > v2: use gcd(), instead of min_not_zero(), as suggested by John Dorminy > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index 9741d1d83e98..659cdb8a07fe 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -547,7 +547,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > > t->io_min = max(t->io_min, b->io_min); > t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); > - t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors); > + > + /* Set non-power-of-2 compatible chunk_sectors boundary */ > + if (b->chunk_sectors) > + t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors); > > /* Physical block size a multiple of the logical block size? */ > if (t->physical_block_size & (t->logical_block_size - 1)) { > -- > 2.15.0 > Reviewed-by: John Dorminy <jdorminy@redhat.com> Thanks!
On 12/1/20 9:07 AM, Mike Snitzer wrote: > commit 22ada802ede8 ("block: use lcm_not_zero() when stacking > chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must > reflect the most limited of all devices in the IO stack. > > Otherwise malformed IO may result. E.g.: prior to this fix, > ->chunk_sectors = lcm_not_zero(8, 128) would result in > blk_max_size_offset() splitting IO at 128 sectors rather than the > required more restrictive 8 sectors. > > And since commit 07d098e6bbad ("block: allow 'chunk_sectors' to be > non-power-of-2") care must be taken to properly stack chunk_sectors to > be compatible with the possibility that a non-power-of-2 chunk_sectors > may be stacked. This is why gcd() is used instead of reverting back > to using min_not_zero(). Applied for 5.10, thanks.
Mike, > And since commit 07d098e6bbad ("block: allow 'chunk_sectors' to be > non-power-of-2") care must be taken to properly stack chunk_sectors to > be compatible with the possibility that a non-power-of-2 chunk_sectors > may be stacked. This is why gcd() is used instead of reverting back > to using min_not_zero(). This approach looks fine to me. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
On Tue, Dec 01, 2020 at 11:07:09AM -0500, Mike Snitzer wrote: > commit 22ada802ede8 ("block: use lcm_not_zero() when stacking > chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must > reflect the most limited of all devices in the IO stack. > > Otherwise malformed IO may result. E.g.: prior to this fix, > ->chunk_sectors = lcm_not_zero(8, 128) would result in > blk_max_size_offset() splitting IO at 128 sectors rather than the > required more restrictive 8 sectors. What is the user-visible result of splitting IO at 128 sectors? I understand it isn't related with correctness, because the underlying queue can split by its own chunk_sectors limit further. So is the issue too many further-splitting on queue with chunk_sectors 8? then CPU utilization is increased? Or other issue? > > And since commit 07d098e6bbad ("block: allow 'chunk_sectors' to be > non-power-of-2") care must be taken to properly stack chunk_sectors to > be compatible with the possibility that a non-power-of-2 chunk_sectors > may be stacked. This is why gcd() is used instead of reverting back > to using min_not_zero(). I guess gcd() won't be better because gcd(a,b) is <= max(a, b), so bio size is decreased much with gcd(a, b), and IO performance should be affected. Maybe worse than min_not_zero(a, b) which is often > gcd(a, b). Thanks, Ming
On Wed, Dec 02 2020 at 10:26pm -0500, Ming Lei <ming.lei@redhat.com> wrote: > On Tue, Dec 01, 2020 at 11:07:09AM -0500, Mike Snitzer wrote: > > commit 22ada802ede8 ("block: use lcm_not_zero() when stacking > > chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must > > reflect the most limited of all devices in the IO stack. > > > > Otherwise malformed IO may result. E.g.: prior to this fix, > > ->chunk_sectors = lcm_not_zero(8, 128) would result in > > blk_max_size_offset() splitting IO at 128 sectors rather than the > > required more restrictive 8 sectors. > > What is the user-visible result of splitting IO at 128 sectors? The VDO dm target fails because it requires IO it receives to be split as it advertised (8 sectors). > I understand it isn't related with correctness, because the underlying > queue can split by its own chunk_sectors limit further. So is the issue > too many further-splitting on queue with chunk_sectors 8? then CPU > utilization is increased? Or other issue? No, this is all about correctness. Seems you're confining the definition of the possible stacking so that the top-level device isn't allowed to have its own hard requirements on IO sizes it sends to its internal implementation. Just because the underlying device can split further doesn't mean that the top-level virtual driver can service larger IO sizes (not if the chunk_sectors stacking throws away the hint the virtual driver provided because it used lcm_not_zero). > > And since commit 07d098e6bbad ("block: allow 'chunk_sectors' to be > > non-power-of-2") care must be taken to properly stack chunk_sectors to > > be compatible with the possibility that a non-power-of-2 chunk_sectors > > may be stacked. This is why gcd() is used instead of reverting back > > to using min_not_zero(). > > I guess gcd() won't be better because gcd(a,b) is <= max(a, b), so bio > size is decreased much with gcd(a, b), and IO performance should be affected. > Maybe worse than min_not_zero(a, b) which is often > gcd(a, b). Doesn't matter, it is about correctness. We cannot stack up a chunk_sectors that violates requirements of a given layer. Mike
On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: > On Wed, Dec 02 2020 at 10:26pm -0500, > Ming Lei <ming.lei@redhat.com> wrote: > > > I understand it isn't related with correctness, because the underlying > > queue can split by its own chunk_sectors limit further. So is the issue > > too many further-splitting on queue with chunk_sectors 8? then CPU > > utilization is increased? Or other issue? > > No, this is all about correctness. > > Seems you're confining the definition of the possible stacking so that > the top-level device isn't allowed to have its own hard requirements on > IO sizes it sends to its internal implementation. Just because the > underlying device can split further doesn't mean that the top-level > virtual driver can service larger IO sizes (not if the chunk_sectors > stacking throws away the hint the virtual driver provided because it > used lcm_not_zero). I may be missing something obvious here, but if the lower layers split to their desired boundary already, why does this limit need to stack? Won't it also work if each layer sets their desired chunk_sectors without considering their lower layers? The commit that initially stacked chunk_sectors doesn't provide any explanation.
On Thu, Dec 03 2020 at 11:27am -0500, Keith Busch <kbusch@kernel.org> wrote: > On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: > > On Wed, Dec 02 2020 at 10:26pm -0500, > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > I understand it isn't related with correctness, because the underlying > > > queue can split by its own chunk_sectors limit further. So is the issue > > > too many further-splitting on queue with chunk_sectors 8? then CPU > > > utilization is increased? Or other issue? > > > > No, this is all about correctness. > > > > Seems you're confining the definition of the possible stacking so that > > the top-level device isn't allowed to have its own hard requirements on > > IO sizes it sends to its internal implementation. Just because the > > underlying device can split further doesn't mean that the top-level > > virtual driver can service larger IO sizes (not if the chunk_sectors > > stacking throws away the hint the virtual driver provided because it > > used lcm_not_zero). > > I may be missing something obvious here, but if the lower layers split > to their desired boundary already, why does this limit need to stack? The problematic scenario is when the topmost layer, or layers, are the more constrained. _That_ is why the top-level's chunk_sectors limit cannot be relaxed. For example (in extreme where chunk_sectors is stacked via gcd): dm VDO target (chunk_sectors=4K) on dm-thin (ideally chunk_sectors=1280K, reality chunk_sectors=128K) on 10+2 RAID6 (chunk_sectors=128K, io_opt=1280K) on raid members (chunk_sectors=0) Results in the following bottom up blk_stack_limits() stacking: gcd(128K, 0) = 128K -> but MD just sets chunk_sectors, no stacking is done afaik gcd(1280K, 128K) = 128K -> this one hurts dm-thin, needless splitting gcd(4K, 128K) = 4K -> vdo _must_ receive 4K IOs, hurts but "this is the way" ;) So this is one extreme that shows stacking chunk_sectors is _not_ helpful (if the resulting chunk_sectors were actually used as basis for splitting). Better for each layer to just impose its own chunk_sectors without concern for the layers below. Think I'd be fine with block core removing the chunk_sectors stacking from blk_stack_limits()... (and as you see below, I've been forced to revert to _not_ using stacked chunk_sectors based splitting in DM) > Won't it also work if each layer sets their desired chunk_sectors > without considering their lower layers? The commit that initially > stacked chunk_sectors doesn't provide any explanation. Yes, I think it would work. The current stacking doesn't have the luxury of knowing which layer a blk_stack_limits() maps too. BUT within a layer chunk_sectors really does need to be compatible/symbiotic. So it is unfortunately all or nothing as you build up the stack. And that all-or-nothing stacking of chunk_sectors is why I've now (just last night, based on further review by jdorminy) had to punt on using stacked chunk_sectors and revert DM back to doing its own fine-grained (and varied) splitting on a per DM target basis, see: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.10-rcX&id=6bb38bcc33bf3093c08bd1b71e4f20c82bb60dd1 Kind of depressing that I went so far down the rabbit hole, of wanting to lean on block core, that I lost sight of an important "tenet of DM": + * Does the target need to split IO even further? + * - varied (per target) IO splitting is a tenet of DM; this + * explains why stacked chunk_sectors based splitting via + * blk_max_size_offset() isn't possible here. And it is because of this that DM is forced to lean on human creation of an optimal IO stack.. which is prone to human error when a particular thinp "blocksize" is selected, etc. Mike
On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: > On Wed, Dec 02 2020 at 10:26pm -0500, > Ming Lei <ming.lei@redhat.com> wrote: > > > On Tue, Dec 01, 2020 at 11:07:09AM -0500, Mike Snitzer wrote: > > > commit 22ada802ede8 ("block: use lcm_not_zero() when stacking > > > chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must > > > reflect the most limited of all devices in the IO stack. > > > > > > Otherwise malformed IO may result. E.g.: prior to this fix, > > > ->chunk_sectors = lcm_not_zero(8, 128) would result in > > > blk_max_size_offset() splitting IO at 128 sectors rather than the > > > required more restrictive 8 sectors. > > > > What is the user-visible result of splitting IO at 128 sectors? > > The VDO dm target fails because it requires IO it receives to be split > as it advertised (8 sectors). OK, looks VDO's chunk_sector limit is one hard constraint, even though it is one DM device, so I guess you are talking about DM over VDO? Another reason should be that VDO doesn't use blk_queue_split(), otherwise it won't be a trouble, right? Frankly speaking, if the stacking driver/device has its own hard queue limit like normal hardware drive, the driver should be responsible for the splitting. > > > I understand it isn't related with correctness, because the underlying > > queue can split by its own chunk_sectors limit further. So is the issue > > too many further-splitting on queue with chunk_sectors 8? then CPU > > utilization is increased? Or other issue? > > No, this is all about correctness. > > Seems you're confining the definition of the possible stacking so that > the top-level device isn't allowed to have its own hard requirements on I just don't know this story, thanks for your clarification. As I mentioned above, if the stacking driver has its own hard queue limit, it should be the driver's responsibility to respect it via blk_queue_split() or whatever. Thanks, Ming
On Thu, Dec 03, 2020 at 08:27:38AM -0800, Keith Busch wrote: > On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: > > On Wed, Dec 02 2020 at 10:26pm -0500, > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > I understand it isn't related with correctness, because the underlying > > > queue can split by its own chunk_sectors limit further. So is the issue > > > too many further-splitting on queue with chunk_sectors 8? then CPU > > > utilization is increased? Or other issue? > > > > No, this is all about correctness. > > > > Seems you're confining the definition of the possible stacking so that > > the top-level device isn't allowed to have its own hard requirements on > > IO sizes it sends to its internal implementation. Just because the > > underlying device can split further doesn't mean that the top-level > > virtual driver can service larger IO sizes (not if the chunk_sectors > > stacking throws away the hint the virtual driver provided because it > > used lcm_not_zero). > > I may be missing something obvious here, but if the lower layers split > to their desired boundary already, why does this limit need to stack? > Won't it also work if each layer sets their desired chunk_sectors > without considering their lower layers? The commit that initially > stacked chunk_sectors doesn't provide any explanation. There could be several reasons: 1) some limits have to be stacking, such as logical block size, because lower layering may not handle un-aligned IO 2) performance reason, if every limits are stacked on topmost layer, in theory IO just needs to be splitted in top layer, and not need to be splitted further from all lower layer at all. But there should be exceptions in unusual case, such as, lowering queue's limit changed after the stacking limits are setup. 3) history reason, bio splitting is much younger than stacking queue limits. Maybe others? Thanks, Ming
On Thu, Dec 03 2020 at 8:12pm -0500, Ming Lei <ming.lei@redhat.com> wrote: > On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: > > On Wed, Dec 02 2020 at 10:26pm -0500, > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > On Tue, Dec 01, 2020 at 11:07:09AM -0500, Mike Snitzer wrote: > > > > commit 22ada802ede8 ("block: use lcm_not_zero() when stacking > > > > chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must > > > > reflect the most limited of all devices in the IO stack. > > > > > > > > Otherwise malformed IO may result. E.g.: prior to this fix, > > > > ->chunk_sectors = lcm_not_zero(8, 128) would result in > > > > blk_max_size_offset() splitting IO at 128 sectors rather than the > > > > required more restrictive 8 sectors. > > > > > > What is the user-visible result of splitting IO at 128 sectors? > > > > The VDO dm target fails because it requires IO it receives to be split > > as it advertised (8 sectors). > > OK, looks VDO's chunk_sector limit is one hard constraint, even though it > is one DM device, so I guess you are talking about DM over VDO? > > Another reason should be that VDO doesn't use blk_queue_split(), otherwise it > won't be a trouble, right? > > Frankly speaking, if the stacking driver/device has its own hard queue limit > like normal hardware drive, the driver should be responsible for the splitting. DM core does the splitting for VDO (just like any other DM target). In 5.9 I updated DM to use chunk_sectors, use blk_stack_limits() stacking of it, and also use blk_max_size_offset(). But all that block core code has shown itself to be too rigid for DM. I tried to force the issue by stacking DM targets' ti->max_io_len with chunk_sectors. But really I'd need to be able to pass in the per-target max_io_len to blk_max_size_offset() to salvage using it. Stacking chunk_sectors seems ill-conceived. One size-fits-all splitting is too rigid. > > > I understand it isn't related with correctness, because the underlying > > > queue can split by its own chunk_sectors limit further. So is the issue > > > too many further-splitting on queue with chunk_sectors 8? then CPU > > > utilization is increased? Or other issue? > > > > No, this is all about correctness. > > > > Seems you're confining the definition of the possible stacking so that > > the top-level device isn't allowed to have its own hard requirements on > > I just don't know this story, thanks for your clarification. > > As I mentioned above, if the stacking driver has its own hard queue > limit, it should be the driver's responsibility to respect it via > blk_queue_split() or whatever. Again, DM does its own splitting... that aspect of it isn't an issue. The problem is the basis for splitting cannot be the stacked up chunk_sectors. Mike
On Thu, Dec 03 2020 at 8:45pm -0500, Ming Lei <ming.lei@redhat.com> wrote: > On Thu, Dec 03, 2020 at 08:27:38AM -0800, Keith Busch wrote: > > On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: > > > On Wed, Dec 02 2020 at 10:26pm -0500, > > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > I understand it isn't related with correctness, because the underlying > > > > queue can split by its own chunk_sectors limit further. So is the issue > > > > too many further-splitting on queue with chunk_sectors 8? then CPU > > > > utilization is increased? Or other issue? > > > > > > No, this is all about correctness. > > > > > > Seems you're confining the definition of the possible stacking so that > > > the top-level device isn't allowed to have its own hard requirements on > > > IO sizes it sends to its internal implementation. Just because the > > > underlying device can split further doesn't mean that the top-level > > > virtual driver can service larger IO sizes (not if the chunk_sectors > > > stacking throws away the hint the virtual driver provided because it > > > used lcm_not_zero). > > > > I may be missing something obvious here, but if the lower layers split > > to their desired boundary already, why does this limit need to stack? > > Won't it also work if each layer sets their desired chunk_sectors > > without considering their lower layers? The commit that initially > > stacked chunk_sectors doesn't provide any explanation. > > There could be several reasons: > > 1) some limits have to be stacking, such as logical block size, because > lower layering may not handle un-aligned IO > > 2) performance reason, if every limits are stacked on topmost layer, in > theory IO just needs to be splitted in top layer, and not need to be > splitted further from all lower layer at all. But there should be exceptions > in unusual case, such as, lowering queue's limit changed after the stacking > limits are setup. > > 3) history reason, bio splitting is much younger than stacking queue > limits. > > Maybe others? Hannes didn't actually justify why he added chunk_sectors to blk_stack_limits: commit 987b3b26eb7b19960160505faf9b2f50ae77e14d Author: Hannes Reinecke <hare@suse.de> Date: Tue Oct 18 15:40:31 2016 +0900 block: update chunk_sectors in blk_stack_limits() Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com> Likely felt it needed for zoned or NVMe devices.. dunno. But given how we now have a model where block core, or DM core, will split as needed I don't think normalizing chunk_sectors (to the degree full use of blk_stack_limits does) and than using it as basis for splitting makes a lot of sense. Mike
On Thu, Dec 03, 2020 at 09:03:43PM -0500, Mike Snitzer wrote: > On Thu, Dec 03 2020 at 8:12pm -0500, > Ming Lei <ming.lei@redhat.com> wrote: > > > On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: > > > On Wed, Dec 02 2020 at 10:26pm -0500, > > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > On Tue, Dec 01, 2020 at 11:07:09AM -0500, Mike Snitzer wrote: > > > > > commit 22ada802ede8 ("block: use lcm_not_zero() when stacking > > > > > chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must > > > > > reflect the most limited of all devices in the IO stack. > > > > > > > > > > Otherwise malformed IO may result. E.g.: prior to this fix, > > > > > ->chunk_sectors = lcm_not_zero(8, 128) would result in > > > > > blk_max_size_offset() splitting IO at 128 sectors rather than the > > > > > required more restrictive 8 sectors. > > > > > > > > What is the user-visible result of splitting IO at 128 sectors? > > > > > > The VDO dm target fails because it requires IO it receives to be split > > > as it advertised (8 sectors). > > > > OK, looks VDO's chunk_sector limit is one hard constraint, even though it > > is one DM device, so I guess you are talking about DM over VDO? > > > > Another reason should be that VDO doesn't use blk_queue_split(), otherwise it > > won't be a trouble, right? > > > > Frankly speaking, if the stacking driver/device has its own hard queue limit > > like normal hardware drive, the driver should be responsible for the splitting. > > DM core does the splitting for VDO (just like any other DM target). > In 5.9 I updated DM to use chunk_sectors, use blk_stack_limits() > stacking of it, and also use blk_max_size_offset(). > > But all that block core code has shown itself to be too rigid for DM. I > tried to force the issue by stacking DM targets' ti->max_io_len with > chunk_sectors. But really I'd need to be able to pass in the per-target > max_io_len to blk_max_size_offset() to salvage using it. > > Stacking chunk_sectors seems ill-conceived. One size-fits-all splitting > is too rigid. DM/VDO knows exactly it is one hard chunk_sectors limit, and DM shouldn't play the stacking trick on VDO's chunk_sectors limit, should it? Thanks, Ming
On 2020/12/04 11:11, Mike Snitzer wrote: > On Thu, Dec 03 2020 at 8:45pm -0500, > Ming Lei <ming.lei@redhat.com> wrote: > >> On Thu, Dec 03, 2020 at 08:27:38AM -0800, Keith Busch wrote: >>> On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: >>>> On Wed, Dec 02 2020 at 10:26pm -0500, >>>> Ming Lei <ming.lei@redhat.com> wrote: >>>> >>>>> I understand it isn't related with correctness, because the underlying >>>>> queue can split by its own chunk_sectors limit further. So is the issue >>>>> too many further-splitting on queue with chunk_sectors 8? then CPU >>>>> utilization is increased? Or other issue? >>>> >>>> No, this is all about correctness. >>>> >>>> Seems you're confining the definition of the possible stacking so that >>>> the top-level device isn't allowed to have its own hard requirements on >>>> IO sizes it sends to its internal implementation. Just because the >>>> underlying device can split further doesn't mean that the top-level >>>> virtual driver can service larger IO sizes (not if the chunk_sectors >>>> stacking throws away the hint the virtual driver provided because it >>>> used lcm_not_zero). >>> >>> I may be missing something obvious here, but if the lower layers split >>> to their desired boundary already, why does this limit need to stack? >>> Won't it also work if each layer sets their desired chunk_sectors >>> without considering their lower layers? The commit that initially >>> stacked chunk_sectors doesn't provide any explanation. >> >> There could be several reasons: >> >> 1) some limits have to be stacking, such as logical block size, because >> lower layering may not handle un-aligned IO >> >> 2) performance reason, if every limits are stacked on topmost layer, in >> theory IO just needs to be splitted in top layer, and not need to be >> splitted further from all lower layer at all. But there should be exceptions >> in unusual case, such as, lowering queue's limit changed after the stacking >> limits are setup. >> >> 3) history reason, bio splitting is much younger than stacking queue >> limits. >> >> Maybe others? > > Hannes didn't actually justify why he added chunk_sectors to > blk_stack_limits: > > commit 987b3b26eb7b19960160505faf9b2f50ae77e14d > Author: Hannes Reinecke <hare@suse.de> > Date: Tue Oct 18 15:40:31 2016 +0900 > > block: update chunk_sectors in blk_stack_limits() > > Signed-off-by: Hannes Reinecke <hare@suse.com> > Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> > Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> > Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> > Signed-off-by: Jens Axboe <axboe@fb.com> > > Likely felt it needed for zoned or NVMe devices.. dunno. For zoned drives, chunk_sectors indicates the zone size so the stacking propagates that value to the upper layer, if said layer is also zoned. If it is not zoned (e.g. dm-zoned device), chunk_sectors can actually be 0: it would be the responsibility of that layer to not issue BIO that cross zone boundaries to the lower zoned layer. Since all of this depends on the upper layer zoned model, removing the stacking of chunk_sectors would be fine, as long as the target initialization code sets it based on the drive model being exposed. E.g.: * dm-linear on zoned dev will be zoned with the same zone size * dm-zoned on zoned dev is not zoned, so chunk_sectors can be 0 * dm-linear on RAID volume can have chunk_sectors set to the underlying volume chunk_sectors (stripe size), if dm-linear is aligned to stripes. * etc. > But given how we now have a model where block core, or DM core, will > split as needed I don't think normalizing chunk_sectors (to the degree > full use of blk_stack_limits does) and than using it as basis for > splitting makes a lot of sense. For zoned dev, I agree. DM-core can set chunk_sectors for the DM device based on its zone model for DM driver that supports zones (dm-linear, dm-flakey and dm-zoned).
On Thu, Dec 03 2020 at 10:59pm -0500, Ming Lei <ming.lei@redhat.com> wrote: > On Thu, Dec 03, 2020 at 09:03:43PM -0500, Mike Snitzer wrote: > > On Thu, Dec 03 2020 at 8:12pm -0500, > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote: > > > > On Wed, Dec 02 2020 at 10:26pm -0500, > > > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > > > On Tue, Dec 01, 2020 at 11:07:09AM -0500, Mike Snitzer wrote: > > > > > > commit 22ada802ede8 ("block: use lcm_not_zero() when stacking > > > > > > chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must > > > > > > reflect the most limited of all devices in the IO stack. > > > > > > > > > > > > Otherwise malformed IO may result. E.g.: prior to this fix, > > > > > > ->chunk_sectors = lcm_not_zero(8, 128) would result in > > > > > > blk_max_size_offset() splitting IO at 128 sectors rather than the > > > > > > required more restrictive 8 sectors. > > > > > > > > > > What is the user-visible result of splitting IO at 128 sectors? > > > > > > > > The VDO dm target fails because it requires IO it receives to be split > > > > as it advertised (8 sectors). > > > > > > OK, looks VDO's chunk_sector limit is one hard constraint, even though it > > > is one DM device, so I guess you are talking about DM over VDO? > > > > > > Another reason should be that VDO doesn't use blk_queue_split(), otherwise it > > > won't be a trouble, right? > > > > > > Frankly speaking, if the stacking driver/device has its own hard queue limit > > > like normal hardware drive, the driver should be responsible for the splitting. > > > > DM core does the splitting for VDO (just like any other DM target). > > In 5.9 I updated DM to use chunk_sectors, use blk_stack_limits() > > stacking of it, and also use blk_max_size_offset(). > > > > But all that block core code has shown itself to be too rigid for DM. I > > tried to force the issue by stacking DM targets' ti->max_io_len with > > chunk_sectors. But really I'd need to be able to pass in the per-target > > max_io_len to blk_max_size_offset() to salvage using it. > > > > Stacking chunk_sectors seems ill-conceived. One size-fits-all splitting > > is too rigid. > > DM/VDO knows exactly it is one hard chunk_sectors limit, and DM shouldn't play > the stacking trick on VDO's chunk_sectors limit, should it? Feel like I already answered this in detail but... correct, DM cannot and should not use stacked chunk_sectors as basis for splitting. Up until 5.9, where I changed DM core to set and then use chunk_sectors for splitting via blk_max_size_offset(), DM only used its own per-target ti->max_io_len in drivers/md/dm.c:max_io_len(). But I reverted back to DM's pre-5.9 splitting in this stable@ fix that I'll be sending to Linus today for 5.10-rcX: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.10-rcX&id=6bb38bcc33bf3093c08bd1b71e4f20c82bb60dd1 DM is now back to pre-5.9 behavior where it doesn't even consider chunk_sectors for splitting (NOTE: dm-zoned sets ti->max_io_len though so it is effectively achieves the same boundary splits via max_io_len). With that baseline established, what I'm now saying is: if DM, the most common limits stacking consumer, cannot benefit from stacked chunk_sectors then what stacked device does benefit? Could be block core's stacked chunk_sectors based splitting is good enough for others, just not yet seeing how. Feels like it predates blk_queue_split() and the stacking of chunk_sectors could/should be removed now. All said, I'm fine with leaving stacked chunk_sectors for others to care about... think I've raised enough awareness on this topic now ;) Mike
diff --git a/block/blk-settings.c b/block/blk-settings.c index 9741d1d83e98..659cdb8a07fe 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -547,7 +547,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->io_min = max(t->io_min, b->io_min); t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); - t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors); + + /* Set non-power-of-2 compatible chunk_sectors boundary */ + if (b->chunk_sectors) + t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors); /* Physical block size a multiple of the logical block size? */ if (t->physical_block_size & (t->logical_block_size - 1)) {
commit 22ada802ede8 ("block: use lcm_not_zero() when stacking chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must reflect the most limited of all devices in the IO stack. Otherwise malformed IO may result. E.g.: prior to this fix, ->chunk_sectors = lcm_not_zero(8, 128) would result in blk_max_size_offset() splitting IO at 128 sectors rather than the required more restrictive 8 sectors. And since commit 07d098e6bbad ("block: allow 'chunk_sectors' to be non-power-of-2") care must be taken to properly stack chunk_sectors to be compatible with the possibility that a non-power-of-2 chunk_sectors may be stacked. This is why gcd() is used instead of reverting back to using min_not_zero(). Fixes: 22ada802ede8 ("block: use lcm_not_zero() when stacking chunk_sectors") Fixes: 07d098e6bbad ("block: allow 'chunk_sectors' to be non-power-of-2") Cc: stable@vger.kernel.org Reported-by: John Dorminy <jdorminy@redhat.com> Reported-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- block/blk-settings.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) v2: use gcd(), instead of min_not_zero(), as suggested by John Dorminy