Message ID | 20240418070015.27781-1-hare@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | scsi_lib: Align max_sectors to kb | expand |
On Thu, Apr 18, 2024 at 09:00:15AM +0200, Hannes Reinecke wrote: > max_sectors can be modified via sysfs, but only in kb units. Yes. > Which leads to a misalignment on stacked devices if the original > max_sector size is an odd number. How? Note that we really should not stack max_sectors anyway, as it's only used for splitting in the lower device to start with.
On 4/18/24 09:03, Christoph Hellwig wrote: > On Thu, Apr 18, 2024 at 09:00:15AM +0200, Hannes Reinecke wrote: >> max_sectors can be modified via sysfs, but only in kb units. > > Yes. > >> Which leads to a misalignment on stacked devices if the original >> max_sector size is an odd number. > > How? > That's an issue we've been seeing during testing: https://lore.kernel.org/dm-devel/7742003e19b5a49398067dc0c59dfa8ddeffc3d7.camel@suse.com/ While this can be fixed in userspace (Martin Wilck provided a patchset to multipath-tools), what I find irritating is that we will always display the max_sectors setting in kb, even if the actual value is not kb aligned. _And_ we allow to modify that value (again in kb units). Which means that we _never_ are able to reset it to its original value. I would vastly prefer to have the actual values displayed in sysfs, ie either have a max_sectors_kb value for the block limits or have an max_sectors setting in sysfs, but we really should avoid this 'shift by 1' when displaying the value. > Note that we really should not stack max_sectors anyway, as it's only > used for splitting in the lower device to start with. If that's the case, why don't we inhibit the modification for max_sectors on the lower devices? Cheers, Hannes
On Thu, 2024-04-18 at 09:42 +0200, Hannes Reinecke wrote: > On 4/18/24 09:03, Christoph Hellwig wrote: > > On Thu, Apr 18, 2024 at 09:00:15AM +0200, Hannes Reinecke wrote: > > > max_sectors can be modified via sysfs, but only in kb units. > > > > Yes. > > > > > Which leads to a misalignment on stacked devices if the original > > > max_sector size is an odd number. > > > > How? > > > > That's an issue we've been seeing during testing: > https://lore.kernel.org/dm-devel/7742003e19b5a49398067dc0c59dfa8ddeffc3d7.camel@suse.com/ > > While this can be fixed in userspace (Martin Wilck provided a > patchset > to multipath-tools), what I find irritating is that we will always > display the max_sectors setting in kb, even if the actual value is > not > kb aligned. > _And_ we allow to modify that value (again in kb units). Which means > that we _never_ are able to reset it to its original value. User space has no way to determine whether the actual max_sectors value in the kernel is even or odd. By reading max_sectors_kb and writing it back, one may actually change the kernel-internal value by rounding it down to the next even number. This can cause breakage if the device being changed is a multipath path device. Wrt Hannes' patch: It would fix the issue on the kernel side, but user space would still have no means to determine whether this patch applied or not, except for checking the kernel version, which is unreliable. For user space, it would be more helpful to add a "max_sectors" sysfs attribute that exposes the actual value in blocks. > > Note that we really should not stack max_sectors anyway, as it's > > only > > used for splitting in the lower device to start with. > > If that's the case, why don't we inhibit the modification for > max_sectors on the lower devices? I vote for allowing changes to max_sectors_kb only for devices that don't have anything stacked on top, even though my late testing indicates that it's only a real problem with request-based dm aka multipath. After all, max_sectors only needs to be manipulated in rare situations, and would be generally recommended to do this in an udev rule early during boot. Regards, Martin
On Thu, Apr 18, 2024 at 09:42:06AM +0200, Hannes Reinecke wrote: > While this can be fixed in userspace (Martin Wilck provided a patchset > to multipath-tools), what I find irritating is that we will always > display the max_sectors setting in kb, even if the actual value is not > kb aligned. The problem you are running into it exactly a problem of pointlessly inheriting the value when we should not. Other secondary issues are that we: a) should allow the modification in the granularity that we actually internally use, that is sectors b) apparently some drivers still use the silly BLK_SAFE_MAX_SECTORS limits that is too low for just about any modern hardwre. >> Note that we really should not stack max_sectors anyway, as it's only >> used for splitting in the lower device to start with. > > If that's the case, why don't we inhibit the modification for max_sectors > on the lower devices? Why would we? It makes absolutly no sense to inherit these limits, the lower device will split anyway which is very much the point of the immutable bio_vec work.
On Thu, 2024-04-18 at 16:51 +0200, Christoph Hellwig wrote: > > > > Note that we really should not stack max_sectors anyway, as it's > > > only > > > used for splitting in the lower device to start with. > > > > If that's the case, why don't we inhibit the modification for > > max_sectors > > on the lower devices? > > Why would we? It makes absolutly no sense to inherit these limits, > the lower device will split anyway which is very much the point of > the immutable bio_vec work. > Sorry, I don't follow. With (request-based) dm-multipath on top of SCSI, we hit the "over max size limit" condition in blk_insert_cloned_request() [1], which will cause IO to fail at the dm level. So at least in this configuration, it's crucial that the upper device inherit the lower device's limits. Regards, Martin [1] https://elixir.bootlin.com/linux/latest/source/block/blk-mq.c#L3048
On Thu, Apr 18, 2024 at 06:46:06PM +0200, Martin Wilck wrote: > > Why would we? It makes absolutly no sense to inherit these limits, > > the lower device will split anyway which is very much the point of > > the immutable bio_vec work. > > > > Sorry, I don't follow. With (request-based) dm-multipath on top of > SCSI, we hit the "over max size limit" condition in > blk_insert_cloned_request() [1], which will cause IO to fail at the dm > level. So at least in this configuration, it's crucial that the upper > device inherit the lower device's limits. Oh, indeed. Request based multipath is different from everyone else. I really wish we could spend the effort to convert it to bio based and remove this special case..
(added Mike, Ben, and dm-devel) On Fri, 2024-04-19 at 08:20 +0200, Christoph Hellwig wrote: > On Thu, Apr 18, 2024 at 06:46:06PM +0200, Martin Wilck wrote: > > > Why would we? It makes absolutly no sense to inherit these > > > limits, > > > the lower device will split anyway which is very much the point > > > of > > > the immutable bio_vec work. > > > > > > > Sorry, I don't follow. With (request-based) dm-multipath on top of > > SCSI, we hit the "over max size limit" condition in > > blk_insert_cloned_request() [1], which will cause IO to fail at the > > dm > > level. So at least in this configuration, it's crucial that the > > upper > > device inherit the lower device's limits. > > Oh, indeed. Request based multipath is different from everyone else. > I really wish we could spend the effort to convert it to bio based > and remove this special case.. > Well, it's just a matter of setting "queue_mode" in the multipath configuration. But request-based has been the default since kernel v3.0 (f40c67f0f7e2 ("dm mpath: change to be request based")). While we got bio-based dm-multipath back in v4.8 (76e33fe4e2c4 ("dm mpath: reinstate bio-based support")), making it the default or even removing request-based dm (in order to free the block layer from limit- stacking requirements) will obviously require a broad discussion. Mike has pointed out in 76e33fe4e2c4 that many of the original reasons for introducing request-based dm-multipath have been obsoleted by faster hardware and improvements in the generic block layer. I am open for discussion on this subject, but with my distribution maintainer hat on, I don't expect the default being changed for end customers soon. Actually, with NVMe rising, I wonder if a major technology switch for dm-multipath (which is effectively SCSI multipath these days) makes sense at this point in time. Thanks, Martin
On Fri, Apr 19, 2024 at 10:27:53AM +0200, Martin Wilck wrote: > Well, it's just a matter of setting "queue_mode" in the multipath > configuration. Yes, and no. We'd probably at very last want a blk_plug callback to allow making decisions for entire large I/Os instead of on a per-bio basis. > Mike has pointed out in 76e33fe4e2c4 that many of the original reasons > for introducing request-based dm-multipath have been obsoleted by > faster hardware and improvements in the generic block layer. Yes. > I am open for discussion on this subject, but with my distribution > maintainer hat on, I don't expect the default being changed for end > customers soon. Actually, with NVMe rising, I wonder if a major > technology switch for dm-multipath (which is effectively SCSI multipath > these days) makes sense at this point in time. Well, it'll take ages until anything upstream gets into the enterprise distros, which array users tend to use anyway. But I'd really love to get the ball rolling upstream rather sooner than later even if there is no need for immediate action. The reason is that request based dm multipath causes a lot of special casing in the block core and dm, and that has become a significant maintenance burden. So if you have a chance to test the current bio based path, or help with testing a plug callback (I can try to write a patch for that, but I don't really have a good test setup) we can start to see where we are in practice in terms of still needing the request based stacking, and if there is a feasible path to move away from it.
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 2e28e2360c85..aad2ac1353d1 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1983,7 +1983,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q) blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize); } - blk_queue_max_hw_sectors(q, shost->max_sectors); + /* Align to kb to avoid conflicts with Sysfs settings */ + blk_queue_max_hw_sectors(q, shost->max_sectors & ~0x1); blk_queue_segment_boundary(q, shost->dma_boundary); dma_set_seg_boundary(dev, shost->dma_boundary);
max_sectors can be modified via sysfs, but only in kb units. Which leads to a misalignment on stacked devices if the original max_sector size is an odd number. So align the max_sectors setting to kb to avoid this issue. Reported-by: Martin Wilck <martin.wilck@suse.com> Signed-off-by: Hannes Reinecke <hare@kernel.org> --- drivers/scsi/scsi_lib.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)