Message ID | be78dc2cfeecaafd171060fbebda2d268d2a94e5.camel@suse.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | mpt3sas fails to allocate budget_map and detects no devices | expand |
On Wed, Jan 05, 2022 at 06:00:41PM +0000, Martin Wilck wrote: > Hello Ming, Sreekanth, > > I'm observing a problem where mpt3sas fails to allocate the budget_map > for any SCSI device, because attempted allocation is larger than the > maximum possible. The issue is caused by the logic used in 020b0f0a3192 > ("scsi: core: Replace sdev->device_busy with sbitmap") > to calculate the bitmap size. This is observed with 5.16-rc8. > > The controller at hand has properties can_queue=29865 and > cmd_per_lun=7. The way these parameters are used in scsi_alloc_sdev()-> That two parameter looks bad, can_queue is too big, however cmd_per_lun is so small. > sbitmap_init_node(), this results in an sbitmap with 29865 maps, where > only a single bit is used per map. On x86_64, this results in an > attempt to allocate 29865 * 192 = 5734080 bytes for the sbitmap, which > is larger than PAGE_SIZE * (1 << (MAX_ORDER - 1)), and fails. Bart has posted one patch for fixing the issue: https://lore.kernel.org/linux-scsi/20211203231950.193369-2-bvanassche@acm.org/ but it isn't merged yet. Martin, can we merge the above patch for fixing this issue? Thanks, Ming
Ming, > Bart has posted one patch for fixing the issue: > > https://lore.kernel.org/linux-scsi/20211203231950.193369-2-bvanassche@acm.org/ > > but it isn't merged yet. > > Martin, can we merge the above patch for fixing this issue? Already merged: https://git.kernel.org/mkp/scsi/c/4bc3bffc1a88 Since it was part of the UFS series it went into the 5.17 queue, though. It would probably have been more appropriate as a separate patch for 5.16/scsi-fixes. Anyway. It'll go to Linus once the merge window opens.
On Thu, 2022-01-06 at 11:03 +0800, Ming Lei wrote: > On Wed, Jan 05, 2022 at 06:00:41PM +0000, Martin Wilck wrote: > > Hello Ming, Sreekanth, > > > > I'm observing a problem where mpt3sas fails to allocate the > > budget_map > > for any SCSI device, because attempted allocation is larger than > > the > > maximum possible. The issue is caused by the logic used in > > 020b0f0a3192 > > ("scsi: core: Replace sdev->device_busy with sbitmap") > > to calculate the bitmap size. This is observed with 5.16-rc8. > > > > The controller at hand has properties can_queue=29865 and > > cmd_per_lun=7. The way these parameters are used in > > scsi_alloc_sdev()-> > > That two parameter looks bad, can_queue is too big, however > cmd_per_lun > is so small. Right. @Sreekanth, can you comment on that? > > sbitmap_init_node(), this results in an sbitmap with 29865 maps, > > where > > only a single bit is used per map. On x86_64, this results in an > > attempt to allocate 29865 * 192 = 5734080 bytes for the sbitmap, > > which > > is larger than PAGE_SIZE * (1 << (MAX_ORDER - 1)), and fails. > > Bart has posted one patch for fixing the issue: > > https://lore.kernel.org/linux-scsi/20211203231950.193369-2-bvanassche@acm.org/ > > but it isn't merged yet. Thanks a lot for pointing that out. I had a faint remembrance of it but failed to locate it yesterday. This fixes the allocation failure, but not the fact that for cmd_per_lun = 7 (hard-coded in mpt3sas) only a single bit per sbitmap is used and the resulting relation between allocated and used memory is ridiculous. We'd still allocate 200kiB or 48 contiguous pages, out of which no more than 2048 bits / 256 bytes would be used (*); iow at least 99.87% of the allocated memory would be wasted. In the default case (queue depth left unchanged), we'd use only 2 bytes, and waste 99.9989%. When calculating the sbitmap shift, would you agree that it makes sense to start with the desired number of separate cache lines, as my proposed patch did? The core sbitmap code assumes that 4-8 separate cache lines are a reasonable value for moderate (4 <= d <= 512) bitmap depth. I believe we should aim for that in the SCSI code, too. Admittedly, the backside of that is that in the default case (queue depth unchaged), only a single cache line would be used in the mpt3sas scenario. Alternatively, we could inhibit increasing the device queue depth above a certain multiple of cmd_per_lun, and size the sbitmap by that limit. My gut feeling says that if cmd_per_lun == 7, it makes sense to use a limit of 32. That way the bitmap would fit into 2 pages; we'd still waste a lot, but it wouldn't matter much in absolute numbers. Thus we could forbid increasing the queue depth to more than the power of 2 above 4*cmd_per_lun. Does this make sense? Regards Martin (*) this calculation ignores the use of sb->map[i].depth. Taking it into account wouldn't change much.
On Thu, Jan 06, 2022 at 10:26:01AM +0000, Martin Wilck wrote: > On Thu, 2022-01-06 at 11:03 +0800, Ming Lei wrote: > > On Wed, Jan 05, 2022 at 06:00:41PM +0000, Martin Wilck wrote: > > > Hello Ming, Sreekanth, > > > > > > I'm observing a problem where mpt3sas fails to allocate the > > > budget_map > > > for any SCSI device, because attempted allocation is larger than > > > the > > > maximum possible. The issue is caused by the logic used in > > > 020b0f0a3192 > > > ("scsi: core: Replace sdev->device_busy with sbitmap") > > > to calculate the bitmap size. This is observed with 5.16-rc8. > > > > > > The controller at hand has properties can_queue=29865 and > > > cmd_per_lun=7. The way these parameters are used in > > > scsi_alloc_sdev()-> > > > > That two parameter looks bad, can_queue is too big, however > > cmd_per_lun > > is so small. > > Right. @Sreekanth, can you comment on that? > > > > sbitmap_init_node(), this results in an sbitmap with 29865 maps, > > > where > > > only a single bit is used per map. On x86_64, this results in an > > > attempt to allocate 29865 * 192 = 5734080 bytes for the sbitmap, > > > which > > > is larger than PAGE_SIZE * (1 << (MAX_ORDER - 1)), and fails. > > > > Bart has posted one patch for fixing the issue: > > > > https://lore.kernel.org/linux-scsi/20211203231950.193369-2-bvanassche@acm.org/ > > > > but it isn't merged yet. > > Thanks a lot for pointing that out. I had a faint remembrance of it but > failed to locate it yesterday. > > This fixes the allocation failure, but not the fact that for > cmd_per_lun = 7 (hard-coded in mpt3sas) only a single bit per sbitmap > is used and the resulting relation between allocated and used memory is > ridiculous. We'd still allocate 200kiB or 48 contiguous pages, out of > which no more than 2048 bits / 256 bytes would be used (*); iow at > least 99.87% of the allocated memory would be wasted. In the default > case (queue depth left unchanged), we'd use only 2 bytes, and waste > 99.9989%. The problem is that mpt3sas uses very weird .can_queue and .cmd_per_lun, which shouldn't be common. If .cmd_per_lun is too small, we have to spread the bits among as much as possible cache lines for avoiding cache line ping-pong, and this kind of sbitmap principle isn't wrong. Also we may reduce max queue depth of 1024 too, usually we don't need so big queue depth to saturate one SSD, and the number should be based on nvme's setting. > > When calculating the sbitmap shift, would you agree that it makes sense > to start with the desired number of separate cache lines, as my > proposed patch did? The core sbitmap code assumes that 4-8 separate > cache lines are a reasonable value for moderate (4 <= d <= 512) bitmap > depth. I believe we should aim for that in the SCSI code, too. Here the actual depth is .cmd_per_lun(7) which is too small, so each bit may have to consume one standalone cache line. > Admittedly, the backside of that is that in the default case (queue > depth unchaged), only a single cache line would be used in the mpt3sas > scenario. > > Alternatively, we could inhibit increasing the device queue depth above > a certain multiple of cmd_per_lun, and size the sbitmap by that limit. > My gut feeling says that if cmd_per_lun == 7, it makes sense to use a > limit of 32. That way the bitmap would fit into 2 pages; we'd still > waste a lot, but it wouldn't matter much in absolute numbers. > Thus we could forbid increasing the queue depth to more than the power > of 2 above 4*cmd_per_lun. Does this make sense? I'd suggest to fix mpt3sas for avoiding this memory waste. > > Regards > Martin > > (*) this calculation ignores the use of sb->map[i].depth. Taking it > into account wouldn't change much. Yeah, I have actually one patch to remove sb->map[].depth, which can reduce each map's size by 1/3. Thanks, Ming
On Thu, 2022-01-06 at 23:00 +0800, Ming Lei wrote: > On Thu, Jan 06, 2022 at 10:26:01AM +0000, Martin Wilck wrote: > > > > > Alternatively, we could inhibit increasing the device queue depth > > above > > a certain multiple of cmd_per_lun, and size the sbitmap by that > > limit. > > My gut feeling says that if cmd_per_lun == 7, it makes sense to use > > a > > limit of 32. That way the bitmap would fit into 2 pages; we'd still > > waste a lot, but it wouldn't matter much in absolute numbers. > > Thus we could forbid increasing the queue depth to more than the > > power > > of 2 above 4*cmd_per_lun. Does this make sense? > > I'd suggest to fix mpt3sas for avoiding this memory waste. Let's wait for Sreekanth's comment on that. mpt3sas is not the only driver using a low value. Qlogic drivers set cmd_per_lun=3, for example (with 3, our logic would use shift=6, so the issue I observed wouldn't occur - but it would be prone to cache line bouncing). > > (*) this calculation ignores the use of sb->map[i].depth. Taking it > > into account wouldn't change much. > > Yeah, I have actually one patch to remove sb->map[].depth, which can > reduce each map's size by 1/3. That sounds like a great idea to me. I've also been wondering whether it wouldn't be possible to use more than a single word in a cache line (given a high-enough number of cache lines). Martin
On Thu, Jan 06, 2022 at 03:22:53PM +0000, Martin Wilck wrote: > On Thu, 2022-01-06 at 23:00 +0800, Ming Lei wrote: > > On Thu, Jan 06, 2022 at 10:26:01AM +0000, Martin Wilck wrote: > > > > > > > > Alternatively, we could inhibit increasing the device queue depth > > > above > > > a certain multiple of cmd_per_lun, and size the sbitmap by that > > > limit. > > > My gut feeling says that if cmd_per_lun == 7, it makes sense to use > > > a > > > limit of 32. That way the bitmap would fit into 2 pages; we'd still > > > waste a lot, but it wouldn't matter much in absolute numbers. > > > Thus we could forbid increasing the queue depth to more than the > > > power > > > of 2 above 4*cmd_per_lun. Does this make sense? > > > > I'd suggest to fix mpt3sas for avoiding this memory waste. > > Let's wait for Sreekanth's comment on that. > > mpt3sas is not the only driver using a low value. Qlogic drivers set > cmd_per_lun=3, for example (with 3, our logic would use shift=6, so the > issue I observed wouldn't occur - but it would be prone to cache line > bouncing). But qlogic has smaller .can_queue which looks at most 512, .can_queue is the depth for allocating sbitmap, since each sdev->queue_depth is <= .can_queue. > > > > (*) this calculation ignores the use of sb->map[i].depth. Taking it > > > into account wouldn't change much. > > > > Yeah, I have actually one patch to remove sb->map[].depth, which can > > reduce each map's size by 1/3. > > That sounds like a great idea to me. I've also been wondering whether I will post the patch after running some benchmark to make sure no performance regression. > it wouldn't be possible to use more than a single word in a cache line > (given a high-enough number of cache lines). It seems possible to use more than one single word in one map in case that depth is high enough, so it is still nothing to with this case. Thanks, Ming
On Thu, 2022-01-06 at 23:41 +0800, Ming Lei wrote: > On Thu, Jan 06, 2022 at 03:22:53PM +0000, Martin Wilck wrote: > > > > > > I'd suggest to fix mpt3sas for avoiding this memory waste. > > > > Let's wait for Sreekanth's comment on that. > > > > mpt3sas is not the only driver using a low value. Qlogic drivers > > set > > cmd_per_lun=3, for example (with 3, our logic would use shift=6, so > > the > > issue I observed wouldn't occur - but it would be prone to cache > > line > > bouncing). > > But qlogic has smaller .can_queue which looks at most 512, .can_queue > is > the depth for allocating sbitmap, since each sdev->queue_depth is <= > .can_queue. I'm seeing here (on an old kernel, admittedly) cmd_per_lun=3 and can_queue=2038 for qla2xxx and cmd_per_lun=3 and can_queue=5884 for lpfc. Both drivers change the queue depth for devices to 64 in their slave_configure() methods. Many drivers do this, as it's recommended in scsi_host.h. That's quite bad in view of the current bitmap allocation logic - we lay out the bitmap assuming the depth used will be cmd_per_lun, but that doesn't match the actual depth when the device comes online. For qla2xxx, it means that we'd allocate the sbitmap with shift=6 (64 bits per word), thus using just a single cache line for 64 requests. Shift=4 (16 bits per word) would be the default shift for depth 64. Am I misreading the code? Perhaps we should only allocate a preliminary sbitmap in scsi_alloc_sdev, and reallocate it after slave_configure() has been called, to get the shift right for the driver's default settings? Thanks, Martin
On Thu, Jan 06, 2022 at 04:19:03PM +0000, Martin Wilck wrote: > On Thu, 2022-01-06 at 23:41 +0800, Ming Lei wrote: > > On Thu, Jan 06, 2022 at 03:22:53PM +0000, Martin Wilck wrote: > > > > > > > > I'd suggest to fix mpt3sas for avoiding this memory waste. > > > > > > Let's wait for Sreekanth's comment on that. > > > > > > mpt3sas is not the only driver using a low value. Qlogic drivers > > > set > > > cmd_per_lun=3, for example (with 3, our logic would use shift=6, so > > > the > > > issue I observed wouldn't occur - but it would be prone to cache > > > line > > > bouncing). > > > > But qlogic has smaller .can_queue which looks at most 512, .can_queue > > is > > the depth for allocating sbitmap, since each sdev->queue_depth is <= > > .can_queue. > > I'm seeing here (on an old kernel, admittedly) cmd_per_lun=3 and > can_queue=2038 for qla2xxx and cmd_per_lun=3 and can_queue=5884 for > lpfc. Both drivers change the queue depth for devices to 64 in their > slave_configure() methods. > > Many drivers do this, as it's recommended in scsi_host.h. That's quite > bad in view of the current bitmap allocation logic - we lay out the > bitmap assuming the depth used will be cmd_per_lun, but that doesn't > match the actual depth when the device comes online. For qla2xxx, it > means that we'd allocate the sbitmap with shift=6 (64 bits per word), > thus using just a single cache line for 64 requests. Shift=4 (16 bits > per word) would be the default shift for depth 64. > > Am I misreading the code? Perhaps we should only allocate a preliminary > sbitmap in scsi_alloc_sdev, and reallocate it after slave_configure() > has been called, to get the shift right for the driver's default > settings? That looks fine to reallocate it after ->slave_configure() returns, but we need to freeze the request queue for avoiding any in-flight scsi command. At that time, freeze should be quick enough. Thanks, Ming
On Fri, Jan 07, 2022 at 12:33:05AM +0800, Ming Lei wrote: > On Thu, Jan 06, 2022 at 04:19:03PM +0000, Martin Wilck wrote: > > On Thu, 2022-01-06 at 23:41 +0800, Ming Lei wrote: > > > On Thu, Jan 06, 2022 at 03:22:53PM +0000, Martin Wilck wrote: > > > > > > > > > > I'd suggest to fix mpt3sas for avoiding this memory waste. > > > > > > > > Let's wait for Sreekanth's comment on that. > > > > > > > > mpt3sas is not the only driver using a low value. Qlogic drivers > > > > set > > > > cmd_per_lun=3, for example (with 3, our logic would use shift=6, so > > > > the > > > > issue I observed wouldn't occur - but it would be prone to cache > > > > line > > > > bouncing). > > > > > > But qlogic has smaller .can_queue which looks at most 512, .can_queue > > > is > > > the depth for allocating sbitmap, since each sdev->queue_depth is <= > > > .can_queue. > > > > I'm seeing here (on an old kernel, admittedly) cmd_per_lun=3 and > > can_queue=2038 for qla2xxx and cmd_per_lun=3 and can_queue=5884 for > > lpfc. Both drivers change the queue depth for devices to 64 in their > > slave_configure() methods. > > > > Many drivers do this, as it's recommended in scsi_host.h. That's quite > > bad in view of the current bitmap allocation logic - we lay out the > > bitmap assuming the depth used will be cmd_per_lun, but that doesn't > > match the actual depth when the device comes online. For qla2xxx, it > > means that we'd allocate the sbitmap with shift=6 (64 bits per word), > > thus using just a single cache line for 64 requests. Shift=4 (16 bits > > per word) would be the default shift for depth 64. > > > > Am I misreading the code? Perhaps we should only allocate a preliminary > > sbitmap in scsi_alloc_sdev, and reallocate it after slave_configure() > > has been called, to get the shift right for the driver's default > > settings? > > That looks fine to reallocate it after ->slave_configure() returns, > but we need to freeze the request queue for avoiding any in-flight > scsi command. At that time, freeze should be quick enough. Hello Martin Wilck, Can you test the following change and report back the result? From 480a61a85e9669d3487ebee8db3d387df79279fc Mon Sep 17 00:00:00 2001 From: Ming Lei <ming.lei@redhat.com> Date: Mon, 10 Jan 2022 10:26:59 +0800 Subject: [PATCH] scsi: core: reallocate scsi device's budget map if default queue depth is changed Martin reported that sdev->queue_depth can often be changed in ->slave_configure(), and now we uses ->cmd_per_lun as initial queue depth for setting up sdev->budget_map. And some extreme ->cmd_per_lun or ->can_queue won't be used at default actually, if we they are used to allocate sdev->budget_map, huge memory may be consumed just because of bad ->cmd_per_lun. Fix the issue by reallocating sdev->budget_map after ->slave_configure() returns, at that time, queue_depth should be much more reasonable. Reported-by: Martin Wilck <martin.wilck@suse.com> Suggested-by: Martin Wilck <martin.wilck@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/scsi/scsi_scan.c | 56 ++++++++++++++++++++++++++++++++++++---- 1 file changed, 51 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 23e1c0acdeae..9593c9111611 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -214,6 +214,48 @@ static void scsi_unlock_floptical(struct scsi_device *sdev, SCSI_TIMEOUT, 3, NULL); } +static int scsi_realloc_sdev_budget_map(struct scsi_device *sdev, + unsigned int depth) +{ + int new_shift = sbitmap_calculate_shift(depth); + bool need_alloc = !sdev->budget_map.map; + bool need_free = false; + int ret; + struct sbitmap sb_back; + + /* + * realloc if new shift is calculated, which is caused by setting + * up one new default queue depth after calling ->slave_configure + */ + if (!need_alloc && new_shift != sdev->budget_map.shift) + need_alloc = need_free = true; + + if (!need_alloc) + return 0; + + /* + * Request queue has to be freezed for reallocating budget map, + * and here disk isn't added yet, so freezing is pretty fast + */ + if (need_free) { + blk_mq_freeze_queue(sdev->request_queue); + sb_back = sdev->budget_map; + } + ret = sbitmap_init_node(&sdev->budget_map, + scsi_device_max_queue_depth(sdev), + new_shift, GFP_KERNEL, + sdev->request_queue->node, false, true); + if (need_free) { + if (ret) + sdev->budget_map = sb_back; + else + sbitmap_free(&sb_back); + ret = 0; + blk_mq_unfreeze_queue(sdev->request_queue); + } + return ret; +} + /** * scsi_alloc_sdev - allocate and setup a scsi_Device * @starget: which target to allocate a &scsi_device for @@ -306,11 +348,7 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget, * default device queue depth to figure out sbitmap shift * since we use this queue depth most of times. */ - if (sbitmap_init_node(&sdev->budget_map, - scsi_device_max_queue_depth(sdev), - sbitmap_calculate_shift(depth), - GFP_KERNEL, sdev->request_queue->node, - false, true)) { + if (scsi_realloc_sdev_budget_map(sdev, depth)) { put_device(&starget->dev); kfree(sdev); goto out; @@ -1017,6 +1055,14 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result, } return SCSI_SCAN_NO_RESPONSE; } + + /* + * queue_depth is often changed in ->slave_configure, so + * setup budget map again for getting better memory uses + * since memory consumption of the map depends on queue + * depth heavily + */ + scsi_realloc_sdev_budget_map(sdev, sdev->queue_depth); } if (sdev->scsi_level >= SCSI_3)
On Mon, 2022-01-10 at 10:59 +0800, Ming Lei wrote: > > Hello Martin Wilck, > > Can you test the following change and report back the result? > > From 480a61a85e9669d3487ebee8db3d387df79279fc Mon Sep 17 00:00:00 > 2001 > From: Ming Lei <ming.lei@redhat.com> > Date: Mon, 10 Jan 2022 10:26:59 +0800 > Subject: [PATCH] scsi: core: reallocate scsi device's budget map if > default > queue depth is changed > > Martin reported that sdev->queue_depth can often be changed in > ->slave_configure(), and now we uses ->cmd_per_lun as initial queue > depth for setting up sdev->budget_map. And some extreme ->cmd_per_lun > or ->can_queue won't be used at default actually, if we they are used > to allocate sdev->budget_map, huge memory may be consumed just > because > of bad ->cmd_per_lun. > > Fix the issue by reallocating sdev->budget_map after - > >slave_configure() > returns, at that time, queue_depth should be much more reasonable. > > Reported-by: Martin Wilck <martin.wilck@suse.com> > Suggested-by: Martin Wilck <martin.wilck@suse.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> This looks good. I added a few pr_infos, and for the strange mpt3sas devices I reported, I get this: # first allocation with depth=7 (cmds_per_lun) Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: 7 0->0 (these numbers are: depth old_shift->new_shift) Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: map_nr = 1024 # after slave_alloc() with depth 254 Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: 254 0->5 Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: map_nr = 32 So the depth changed from 7 to 254, the shift from 0 to 5, and the memory size of the sbitmap was reduced by a factor of 32. Nice! Tested-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Martin Wilck <mwilck@suse.com> > --- > drivers/scsi/scsi_scan.c | 56 ++++++++++++++++++++++++++++++++++++-- > -- > 1 file changed, 51 insertions(+), 5 deletions(-) > > diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c > index 23e1c0acdeae..9593c9111611 100644 > --- a/drivers/scsi/scsi_scan.c > +++ b/drivers/scsi/scsi_scan.c > @@ -214,6 +214,48 @@ static void scsi_unlock_floptical(struct > scsi_device *sdev, > SCSI_TIMEOUT, 3, NULL); > } > > +static int scsi_realloc_sdev_budget_map(struct scsi_device *sdev, > + unsigned int depth) > +{ > + int new_shift = sbitmap_calculate_shift(depth); > + bool need_alloc = !sdev->budget_map.map; > + bool need_free = false; > + int ret; > + struct sbitmap sb_back; > + > + /* > + * realloc if new shift is calculated, which is caused by > setting > + * up one new default queue depth after calling - > >slave_configure > + */ > + if (!need_alloc && new_shift != sdev->budget_map.shift) > + need_alloc = need_free = true; > + > + if (!need_alloc) > + return 0; > + > + /* > + * Request queue has to be freezed for reallocating budget > map, > + * and here disk isn't added yet, so freezing is pretty fast > + */ > + if (need_free) { > + blk_mq_freeze_queue(sdev->request_queue); > + sb_back = sdev->budget_map; > + } > + ret = sbitmap_init_node(&sdev->budget_map, > + scsi_device_max_queue_depth(sdev), > + new_shift, GFP_KERNEL, > + sdev->request_queue->node, false, > true); > + if (need_free) { > + if (ret) > + sdev->budget_map = sb_back; > + else > + sbitmap_free(&sb_back); > + ret = 0; > + blk_mq_unfreeze_queue(sdev->request_queue); > + } > + return ret; > +} > + > /** > * scsi_alloc_sdev - allocate and setup a scsi_Device > * @starget: which target to allocate a &scsi_device for > @@ -306,11 +348,7 @@ static struct scsi_device > *scsi_alloc_sdev(struct scsi_target *starget, > * default device queue depth to figure out sbitmap shift > * since we use this queue depth most of times. > */ > - if (sbitmap_init_node(&sdev->budget_map, > - scsi_device_max_queue_depth(sdev), > - sbitmap_calculate_shift(depth), > - GFP_KERNEL, sdev->request_queue- > >node, > - false, true)) { > + if (scsi_realloc_sdev_budget_map(sdev, depth)) { > put_device(&starget->dev); > kfree(sdev); > goto out; > @@ -1017,6 +1055,14 @@ static int scsi_add_lun(struct scsi_device > *sdev, unsigned char *inq_result, > } > return SCSI_SCAN_NO_RESPONSE; > } > + > + /* > + * queue_depth is often changed in ->slave_configure, > so > + * setup budget map again for getting better memory > uses > + * since memory consumption of the map depends on > queue > + * depth heavily > + */ > + scsi_realloc_sdev_budget_map(sdev, sdev- > >queue_depth); > } > > if (sdev->scsi_level >= SCSI_3) > -- > 2.31.1 > > >
On Wed, 2022-01-12 at 17:59 +0100, Martin Wilck wrote: > On Mon, 2022-01-10 at 10:59 +0800, Ming Lei wrote: > > > > Hello Martin Wilck, > > > > Can you test the following change and report back the result? > > > > From 480a61a85e9669d3487ebee8db3d387df79279fc Mon Sep 17 00:00:00 > > 2001 > > From: Ming Lei <ming.lei@redhat.com> > > Date: Mon, 10 Jan 2022 10:26:59 +0800 > > Subject: [PATCH] scsi: core: reallocate scsi device's budget map if > > default > > queue depth is changed > > > > Martin reported that sdev->queue_depth can often be changed in > > ->slave_configure(), and now we uses ->cmd_per_lun as initial queue > > depth for setting up sdev->budget_map. And some extreme - > > >cmd_per_lun > > or ->can_queue won't be used at default actually, if we they are > > used > > to allocate sdev->budget_map, huge memory may be consumed just > > because > > of bad ->cmd_per_lun. > > > > Fix the issue by reallocating sdev->budget_map after - > > > slave_configure() > > returns, at that time, queue_depth should be much more reasonable. > > > > Reported-by: Martin Wilck <martin.wilck@suse.com> > > Suggested-by: Martin Wilck <martin.wilck@suse.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > This looks good. I added a few pr_infos, and for the strange mpt3sas > devices I reported, I get this: > > # first allocation with depth=7 (cmds_per_lun) > Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: 7 0- > >0 > (these numbers are: depth old_shift->new_shift) > Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: > map_nr = 1024 > > # after slave_alloc() with depth 254 > Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: 254 > 0->5 > Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: > map_nr = 32 > > So the depth changed from 7 to 254, the shift from 0 to 5, and the > memory size of the > sbitmap was reduced by a factor of 32. Nice! > > Tested-by: Martin Wilck <mwilck@suse.com> > Reviewed-by: Martin Wilck <mwilck@suse.com> So, how do we proceed with this patch? Regards Martin
On Tue, Jan 25, 2022 at 04:29:37PM +0000, Martin Wilck wrote: > On Wed, 2022-01-12 at 17:59 +0100, Martin Wilck wrote: > > On Mon, 2022-01-10 at 10:59 +0800, Ming Lei wrote: > > > > > > Hello Martin Wilck, > > > > > > Can you test the following change and report back the result? > > > > > > From 480a61a85e9669d3487ebee8db3d387df79279fc Mon Sep 17 00:00:00 > > > 2001 > > > From: Ming Lei <ming.lei@redhat.com> > > > Date: Mon, 10 Jan 2022 10:26:59 +0800 > > > Subject: [PATCH] scsi: core: reallocate scsi device's budget map if > > > default > > > queue depth is changed > > > > > > Martin reported that sdev->queue_depth can often be changed in > > > ->slave_configure(), and now we uses ->cmd_per_lun as initial queue > > > depth for setting up sdev->budget_map. And some extreme - > > > >cmd_per_lun > > > or ->can_queue won't be used at default actually, if we they are > > > used > > > to allocate sdev->budget_map, huge memory may be consumed just > > > because > > > of bad ->cmd_per_lun. > > > > > > Fix the issue by reallocating sdev->budget_map after - > > > > slave_configure() > > > returns, at that time, queue_depth should be much more reasonable. > > > > > > Reported-by: Martin Wilck <martin.wilck@suse.com> > > > Suggested-by: Martin Wilck <martin.wilck@suse.com> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > This looks good. I added a few pr_infos, and for the strange mpt3sas > > devices I reported, I get this: > > > > # first allocation with depth=7 (cmds_per_lun) > > Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: 7 0- > > >0 > > (these numbers are: depth old_shift->new_shift) > > Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: > > map_nr = 1024 > > > > # after slave_alloc() with depth 254 > > Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: 254 > > 0->5 > > Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: > > map_nr = 32 > > > > So the depth changed from 7 to 254, the shift from 0 to 5, and the > > memory size of the > > sbitmap was reduced by a factor of 32. Nice! > > > > Tested-by: Martin Wilck <mwilck@suse.com> > > Reviewed-by: Martin Wilck <mwilck@suse.com> > > So, how do we proceed with this patch? Looks 5.18/scsi is open now, I will submit one formal patch on linux-scsi soon. Thanks, Ming
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 23e1c0acdeae..3e51f8183b42 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -231,7 +231,8 @@ static void scsi_unlock_floptical(struct scsi_device *sdev, static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget, u64 lun, void *hostdata) { - unsigned int depth; + static const unsigned int BITMAP_CHUNKS = 16; + unsigned int depth, sb_map_chunks, sb_depth; struct scsi_device *sdev; struct request_queue *q; int display_failure_msg = 1, ret; @@ -298,17 +299,26 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget, __scsi_init_queue(sdev->host, q); WARN_ON_ONCE(!blk_get_queue(q)); - depth = sdev->host->cmd_per_lun ?: 1; - /* * Use .can_queue as budget map's depth because we have to * support adjusting queue depth from sysfs. Meantime use * default device queue depth to figure out sbitmap shift * since we use this queue depth most of times. + * Avoid extremely small shift if cmd_per_lun is small but + * greater than 3 (see sbitmap_calculate_shift()). + * Assume that usually no more than BITMAP_CHUNKS + * CPUs will access this bitmap simultaneously. */ + depth = sdev->host->cmd_per_lun ?: 1; + sb_map_chunks = max_t(unsigned int, 1U, + min_t(unsigned int, nr_cpu_ids, BITMAP_CHUNKS)); + sb_depth = max_t(unsigned int, + scsi_device_max_queue_depth(sdev) / sb_map_chunks, + depth); + if (sbitmap_init_node(&sdev->budget_map, scsi_device_max_queue_depth(sdev), - sbitmap_calculate_shift(depth), + sbitmap_calculate_shift(sb_depth), GFP_KERNEL, sdev->request_queue->node, false, true)) { put_device(&starget->dev);