Message ID | 20200514015452.1055278-1-damien.lemoal@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nvme: Fix io_opt limit setting | expand |
Damien, > results in blk_stack_limits() to return an error when the combined > devices have different but compatible physical sector sizes (e.g. 512B > sector SSD with 4KB sector disks). We'll need to get that stacking logic fixed up to take io_opt into account when scaling pbs/min. Just as a safety measure in case we don't catch devices reporting crazy values in the LLDs. > Fix this by not setting the optiomal IO size limit if the namespace optimal > does not report an optimal write size value. Setting io_opt to the logical block size in the NVMe driver is equivalent to telling the filesystems that they should not submit I/Os larger than one sector. That makes no sense. This change is correct. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
On Thu, May 14, 2020 at 10:54:52AM +0900, Damien Le Moal wrote: > Currently, a namespace io_opt queue limit is set by default to the > physical sector size of the namespace and to the the write optimal > size (NOWS) when the namespace reports this value. This causes problems > with block limits stacking in blk_stack_limits() when a namespace block > device is combined with an HDD which generally do not report any optimal > transfer size (io_opt limit is 0). The code: > > /* Optimal I/O a multiple of the physical block size? */ > if (t->io_opt & (t->physical_block_size - 1)) { > t->io_opt = 0; > t->misaligned = 1; > ret = -1; > } > > results in blk_stack_limits() to return an error when the combined > devices have different but compatible physical sector sizes (e.g. 512B > sector SSD with 4KB sector disks). > > Fix this by not setting the optiomal IO size limit if the namespace does > not report an optimal write size value. Won't this continue to break if a controller does report NOWS that's not a multiple of the physical block size of the device it's stacking with?
On 2020/05/14 12:40, Keith Busch wrote: > On Thu, May 14, 2020 at 10:54:52AM +0900, Damien Le Moal wrote: >> Currently, a namespace io_opt queue limit is set by default to the >> physical sector size of the namespace and to the the write optimal >> size (NOWS) when the namespace reports this value. This causes problems >> with block limits stacking in blk_stack_limits() when a namespace block >> device is combined with an HDD which generally do not report any optimal >> transfer size (io_opt limit is 0). The code: >> >> /* Optimal I/O a multiple of the physical block size? */ >> if (t->io_opt & (t->physical_block_size - 1)) { >> t->io_opt = 0; >> t->misaligned = 1; >> ret = -1; >> } >> >> results in blk_stack_limits() to return an error when the combined >> devices have different but compatible physical sector sizes (e.g. 512B >> sector SSD with 4KB sector disks). >> >> Fix this by not setting the optiomal IO size limit if the namespace does >> not report an optimal write size value. > > Won't this continue to break if a controller does report NOWS that's not > a multiple of the physical block size of the device it's stacking with? When io_opt stacking is handled, the physical sector size for the stacked device is already resolved to a common value. If the NOWS value cannot accommodate this resolved physical sector size, this is an incompatible stacking, so failing is OK in that case.
On Thu, May 14, 2020 at 03:47:56AM +0000, Damien Le Moal wrote: > On 2020/05/14 12:40, Keith Busch wrote: > > On Thu, May 14, 2020 at 10:54:52AM +0900, Damien Le Moal wrote: > >> Currently, a namespace io_opt queue limit is set by default to the > >> physical sector size of the namespace and to the the write optimal > >> size (NOWS) when the namespace reports this value. This causes problems > >> with block limits stacking in blk_stack_limits() when a namespace block > >> device is combined with an HDD which generally do not report any optimal > >> transfer size (io_opt limit is 0). The code: > >> > >> /* Optimal I/O a multiple of the physical block size? */ > >> if (t->io_opt & (t->physical_block_size - 1)) { > >> t->io_opt = 0; > >> t->misaligned = 1; > >> ret = -1; > >> } > >> > >> results in blk_stack_limits() to return an error when the combined > >> devices have different but compatible physical sector sizes (e.g. 512B > >> sector SSD with 4KB sector disks). > >> > >> Fix this by not setting the optiomal IO size limit if the namespace does > >> not report an optimal write size value. > > > > Won't this continue to break if a controller does report NOWS that's not > > a multiple of the physical block size of the device it's stacking with? > > When io_opt stacking is handled, the physical sector size for the stacked device > is already resolved to a common value. If the NOWS value cannot accommodate this > resolved physical sector size, this is an incompatible stacking, so failing is > OK in that case. I see, though it's not strictly incompatible as io_opt is merely a hint that could continue to work if the stacked limit was recalculated as: if (t->io_opt & (t->physical_block_size - 1)) t->io_opt = lcm(t->io_opt, t->physical_block_size); Regardless, your patch does make sense, but it does have a merge conflict with nvme-5.8.
On 2020/05/14 13:12, Keith Busch wrote: > On Thu, May 14, 2020 at 03:47:56AM +0000, Damien Le Moal wrote: >> On 2020/05/14 12:40, Keith Busch wrote: >>> On Thu, May 14, 2020 at 10:54:52AM +0900, Damien Le Moal wrote: >>>> Currently, a namespace io_opt queue limit is set by default to the >>>> physical sector size of the namespace and to the the write optimal >>>> size (NOWS) when the namespace reports this value. This causes problems >>>> with block limits stacking in blk_stack_limits() when a namespace block >>>> device is combined with an HDD which generally do not report any optimal >>>> transfer size (io_opt limit is 0). The code: >>>> >>>> /* Optimal I/O a multiple of the physical block size? */ >>>> if (t->io_opt & (t->physical_block_size - 1)) { >>>> t->io_opt = 0; >>>> t->misaligned = 1; >>>> ret = -1; >>>> } >>>> >>>> results in blk_stack_limits() to return an error when the combined >>>> devices have different but compatible physical sector sizes (e.g. 512B >>>> sector SSD with 4KB sector disks). >>>> >>>> Fix this by not setting the optiomal IO size limit if the namespace does >>>> not report an optimal write size value. >>> >>> Won't this continue to break if a controller does report NOWS that's not >>> a multiple of the physical block size of the device it's stacking with? >> >> When io_opt stacking is handled, the physical sector size for the stacked device >> is already resolved to a common value. If the NOWS value cannot accommodate this >> resolved physical sector size, this is an incompatible stacking, so failing is >> OK in that case. > > I see, though it's not strictly incompatible as io_opt is merely a hint > that could continue to work if the stacked limit was recalculated as: > > if (t->io_opt & (t->physical_block_size - 1)) > t->io_opt = lcm(t->io_opt, t->physical_block_size); > > Regardless, your patch does make sense, but it does have a merge > conflict with nvme-5.8. Ooops. I will rebase and resend. And maybe we should send your suggestion above as a proper patch ? >
On 2020-05-13 18:54, Damien Le Moal wrote: > @@ -1848,7 +1847,8 @@ static void nvme_update_disk_info(struct gendisk *disk, > */ > blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs)); > blk_queue_io_min(disk->queue, phys_bs); > - blk_queue_io_opt(disk->queue, io_opt); > + if (io_opt) > + blk_queue_io_opt(disk->queue, io_opt); The above change looks confusing to me. We want the NVMe driver to set io_opt, so why only call blk_queue_io_opt() if io_opt != 0? That means that the io_opt value will be left to any value set by the block layer core if io_opt == 0 instead of properly being set to zero. Thanks, Bart.
On 2020/05/14 13:47, Bart Van Assche wrote: > On 2020-05-13 18:54, Damien Le Moal wrote: >> @@ -1848,7 +1847,8 @@ static void nvme_update_disk_info(struct gendisk *disk, >> */ >> blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs)); >> blk_queue_io_min(disk->queue, phys_bs); >> - blk_queue_io_opt(disk->queue, io_opt); >> + if (io_opt) >> + blk_queue_io_opt(disk->queue, io_opt); > > The above change looks confusing to me. We want the NVMe driver to set > io_opt, so why only call blk_queue_io_opt() if io_opt != 0? That means > that the io_opt value will be left to any value set by the block layer > core if io_opt == 0 instead of properly being set to zero. OK. I will remove the "if". > > Thanks, > > Bart. >
On 5/14/20 3:54 AM, Damien Le Moal wrote: > Currently, a namespace io_opt queue limit is set by default to the > physical sector size of the namespace and to the the write optimal > size (NOWS) when the namespace reports this value. This causes problems > with block limits stacking in blk_stack_limits() when a namespace block > device is combined with an HDD which generally do not report any optimal > transfer size (io_opt limit is 0). The code: > > /* Optimal I/O a multiple of the physical block size? */ > if (t->io_opt & (t->physical_block_size - 1)) { > t->io_opt = 0; > t->misaligned = 1; > ret = -1; > } > > results in blk_stack_limits() to return an error when the combined > devices have different but compatible physical sector sizes (e.g. 512B > sector SSD with 4KB sector disks). > > Fix this by not setting the optiomal IO size limit if the namespace does > not report an optimal write size value. > > Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> > --- > drivers/nvme/host/core.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > Ah, so you beat me to it :-) Reviewed-by: Hannes Reinecke <hare@suse.de> Cheers, Hannes
Bart, > The above change looks confusing to me. We want the NVMe driver to set > io_opt, so why only call blk_queue_io_opt() if io_opt != 0? That means > that the io_opt value will be left to any value set by the block layer > core if io_opt == 0 instead of properly being set to zero. We do explicitly set it to 0 when allocating a queue. But no biggie.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f3c037f5a9ba..0729173053ed 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1809,7 +1809,7 @@ static void nvme_update_disk_info(struct gendisk *disk, { sector_t capacity = nvme_lba_to_sect(ns, le64_to_cpu(id->nsze)); unsigned short bs = 1 << ns->lba_shift; - u32 atomic_bs, phys_bs, io_opt; + u32 atomic_bs, phys_bs, io_opt = 0; if (ns->lba_shift > PAGE_SHIFT) { /* unsupported block size, set capacity to 0 later */ @@ -1832,12 +1832,11 @@ static void nvme_update_disk_info(struct gendisk *disk, atomic_bs = bs; } phys_bs = bs; - io_opt = bs; if (id->nsfeat & (1 << 4)) { /* NPWG = Namespace Preferred Write Granularity */ phys_bs *= 1 + le16_to_cpu(id->npwg); /* NOWS = Namespace Optimal Write Size */ - io_opt *= 1 + le16_to_cpu(id->nows); + io_opt = bs * (1 + le16_to_cpu(id->nows)); } blk_queue_logical_block_size(disk->queue, bs); @@ -1848,7 +1847,8 @@ static void nvme_update_disk_info(struct gendisk *disk, */ blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs)); blk_queue_io_min(disk->queue, phys_bs); - blk_queue_io_opt(disk->queue, io_opt); + if (io_opt) + blk_queue_io_opt(disk->queue, io_opt); if (ns->ms && !ns->ext && (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED))
Currently, a namespace io_opt queue limit is set by default to the physical sector size of the namespace and to the the write optimal size (NOWS) when the namespace reports this value. This causes problems with block limits stacking in blk_stack_limits() when a namespace block device is combined with an HDD which generally do not report any optimal transfer size (io_opt limit is 0). The code: /* Optimal I/O a multiple of the physical block size? */ if (t->io_opt & (t->physical_block_size - 1)) { t->io_opt = 0; t->misaligned = 1; ret = -1; } results in blk_stack_limits() to return an error when the combined devices have different but compatible physical sector sizes (e.g. 512B sector SSD with 4KB sector disks). Fix this by not setting the optiomal IO size limit if the namespace does not report an optimal write size value. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> --- drivers/nvme/host/core.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)