Message ID | 20190620034812.3254-1-damien.lemoal@wdc.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | sd_zbc: Fix report zones buffer allocation | expand |
This looks like what we discussed last week, so:
Reviewed-by: Christoph Hellwig <hch@lst.de>
On 6/19/19 8:48 PM, Damien Le Moal wrote: > + /* > + * Limit the command buffer size to the arbitrary SD_ZBC_REPORT_SIZE > + * size (1MB), allowing up to 16383 zone descriptors being reported with > + * a single command. And make sure that this size does not exceed the > + * hardware capabilities. To avoid disk revalidation failures due to > + * memory allocation errors, retry the allocation with a smaller buffer > + * size if the allocation fails. > + */ > + bufsize = min_t(size_t, *buflen, SD_ZBC_REPORT_SIZE); > + bufsize = min_t(size_t, bufsize, > + queue_max_hw_sectors(disk->queue) << 9); > + for (order = get_order(bufsize); order >= 0; order--) { > + page = alloc_pages(gfp_mask, order); > + if (page) { > + *buflen = PAGE_SIZE << order; > + return page_address(page); > + } > + } Hi Damien, As you know Linux memory fragmentation tends to increase over time. The above code has the very unfortunate property that the more memory is fragmented the smaller the allocated buffer will become. I don't think that's how kernel code should work. Have you considered to use vmalloc() + blk_rq_map_sg() instead? See also efa_vmalloc_buf_to_sg() for an example of how to build a scatterlist for memory allocated by vmalloc(). Thanks, Bart.
Bart, On 2019/06/21 0:25, Bart Van Assche wrote: > On 6/19/19 8:48 PM, Damien Le Moal wrote: >> + /* >> + * Limit the command buffer size to the arbitrary SD_ZBC_REPORT_SIZE >> + * size (1MB), allowing up to 16383 zone descriptors being reported with >> + * a single command. And make sure that this size does not exceed the >> + * hardware capabilities. To avoid disk revalidation failures due to >> + * memory allocation errors, retry the allocation with a smaller buffer >> + * size if the allocation fails. >> + */ >> + bufsize = min_t(size_t, *buflen, SD_ZBC_REPORT_SIZE); >> + bufsize = min_t(size_t, bufsize, >> + queue_max_hw_sectors(disk->queue) << 9); >> + for (order = get_order(bufsize); order >= 0; order--) { >> + page = alloc_pages(gfp_mask, order); >> + if (page) { >> + *buflen = PAGE_SIZE << order; >> + return page_address(page); >> + } >> + } > > Hi Damien, > > As you know Linux memory fragmentation tends to increase over time. The > above code has the very unfortunate property that the more memory is > fragmented the smaller the allocated buffer will become. I don't think > that's how kernel code should work. Have you considered to use vmalloc() > + blk_rq_map_sg() instead? See also efa_vmalloc_buf_to_sg() for an > example of how to build a scatterlist for memory allocated by vmalloc(). The REPORT_ZONES command is executed using scsi_execute_req(). Can we pass a vmalloc-ed buffer to that function ? It does look like it since scsi_execute_rq calls bio_rq_map_kern() which then calls bio_map_kern() which goes through the list of pages of the buffer. Just would like to confirm I understand this correctly. My concern with using vmalloc is that the worst case scenario will result in all pages of the buffer being non contiguous. In this case, since the report zones command cannot be split, we would need to limit the allocation to max_segments * page size, and that can be pretty small for some HBAs. Another reason I did not pursue the vmalloc route is that the processing of the report zones reply to transform zone information into struct blkzone is really painful to do with a vmalloced buffer as every page of the buffer needs to be kmap()/kunmap(). The code was like that when REPORT ZONES was processed as a BIO/request, but it was a lot of code for not much to be done. Or is there a more elegant solution for in-kernel mapping of a vmalloc buffer ?
On 6/20/19 5:58 PM, Damien Le Moal wrote: > The REPORT_ZONES command is executed using scsi_execute_req(). Can we pass a > vmalloc-ed buffer to that function ? It does look like it since scsi_execute_rq > calls bio_rq_map_kern() which then calls bio_map_kern() which goes through the > list of pages of the buffer. Just would like to confirm I understand this correctly. > > My concern with using vmalloc is that the worst case scenario will result in all > pages of the buffer being non contiguous. In this case, since the report zones > command cannot be split, we would need to limit the allocation to max_segments * > page size, and that can be pretty small for some HBAs. > > Another reason I did not pursue the vmalloc route is that the processing of the > report zones reply to transform zone information into struct blkzone is really > painful to do with a vmalloced buffer as every page of the buffer needs to be > kmap()/kunmap(). The code was like that when REPORT ZONES was processed as a > BIO/request, but it was a lot of code for not much to be done. Or is there a > more elegant solution for in-kernel mapping of a vmalloc buffer ? Hi Damien, I don't think that bio_rq_map_kern() works with vmalloc-ed buffers. How about using is_vmalloc_addr() inside scsi_execute_req() to determine whether or not the buffer passed to that function has been allocated with vmalloc()? There may be other scsi_execute_req() callers that can benefit from passing a vmalloc-ed buffer to that function. Regarding the maximum segment size: is mpt3sas still the most popular HBA? Is the maximum segment size 128 for that driver? Is 128 * 4 KB = 512 KB big enough for the report zones buffer? Regarding the loop that calls sd_zbc_parse_report(): are you sure that that kmap()/kunmap() calls would be necessary in that loop? Thanks, Bart.
Bart, On 2019/06/21 12:46, Bart Van Assche wrote: > On 6/20/19 5:58 PM, Damien Le Moal wrote: >> The REPORT_ZONES command is executed using scsi_execute_req(). Can we pass a >> vmalloc-ed buffer to that function ? It does look like it since scsi_execute_rq >> calls bio_rq_map_kern() which then calls bio_map_kern() which goes through the >> list of pages of the buffer. Just would like to confirm I understand this correctly. >> >> My concern with using vmalloc is that the worst case scenario will result in all >> pages of the buffer being non contiguous. In this case, since the report zones >> command cannot be split, we would need to limit the allocation to max_segments * >> page size, and that can be pretty small for some HBAs. >> >> Another reason I did not pursue the vmalloc route is that the processing of the >> report zones reply to transform zone information into struct blkzone is really >> painful to do with a vmalloced buffer as every page of the buffer needs to be >> kmap()/kunmap(). The code was like that when REPORT ZONES was processed as a >> BIO/request, but it was a lot of code for not much to be done. Or is there a >> more elegant solution for in-kernel mapping of a vmalloc buffer ? > > Hi Damien, > > I don't think that bio_rq_map_kern() works with vmalloc-ed buffers. How > about using is_vmalloc_addr() inside scsi_execute_req() to determine > whether or not the buffer passed to that function has been allocated > with vmalloc()? There may be other scsi_execute_req() callers that can > benefit from passing a vmalloc-ed buffer to that function. Sure, we could do that. But since most (if not all) users of scsi_execute_req() need only a very small buffer, I am not sure if this would be very useful. > Regarding the maximum segment size: is mpt3sas still the most popular > HBA? Is the maximum segment size 128 for that driver? Is 128 * 4 KB = > 512 KB big enough for the report zones buffer? Sure, it works, but compared to the target 1MB allocation that my patch has, that doubles the number of requests needed for reporting the zones of an entire drive. Since this is however not used a lot, I guess it is okay. But still, I do not like very much slowing down revalidate() nor regular blkdev_report_zones() calls... > Regarding the loop that calls sd_zbc_parse_report(): are you sure that > that kmap()/kunmap() calls would be necessary in that loop? No I am not sure. I will check again. Best regards.
On 2019/06/21 0:25, Bart Van Assche wrote: > On 6/19/19 8:48 PM, Damien Le Moal wrote: >> + /* >> + * Limit the command buffer size to the arbitrary SD_ZBC_REPORT_SIZE >> + * size (1MB), allowing up to 16383 zone descriptors being reported with >> + * a single command. And make sure that this size does not exceed the >> + * hardware capabilities. To avoid disk revalidation failures due to >> + * memory allocation errors, retry the allocation with a smaller buffer >> + * size if the allocation fails. >> + */ >> + bufsize = min_t(size_t, *buflen, SD_ZBC_REPORT_SIZE); >> + bufsize = min_t(size_t, bufsize, >> + queue_max_hw_sectors(disk->queue) << 9); >> + for (order = get_order(bufsize); order >= 0; order--) { >> + page = alloc_pages(gfp_mask, order); >> + if (page) { >> + *buflen = PAGE_SIZE << order; >> + return page_address(page); >> + } >> + } > > Hi Damien, > > As you know Linux memory fragmentation tends to increase over time. The > above code has the very unfortunate property that the more memory is > fragmented the smaller the allocated buffer will become. I don't think > that's how kernel code should work. Have you considered to use vmalloc() > + blk_rq_map_sg() instead? See also efa_vmalloc_buf_to_sg() for an > example of how to build a scatterlist for memory allocated by vmalloc(). Bart, I have a fix along the lines you suggested, but since it modifies bio_rq_map_kern(), I would rather not push that as a bug fix this late in RC. Would you agree to accept the fix patch as is for now and I will send the more complete fix for 5.3 ? Note that this more complete fix also reworks the similar memory allocation for the struct blk_zone array used for zone revalidation. Put all together, the use of report zones uses only vmalloc-ed buffers and data structures, reduces pressure on the memory system and reducing chances of failures. Best regards.
On 6/22/19 12:44 AM, Damien Le Moal wrote: > I have a fix along the lines you suggested, but since it modifies > bio_rq_map_kern(), I would rather not push that as a bug fix this late in RC. > Would you agree to accept the fix patch as is for now and I will send the more > complete fix for 5.3 ? Note that this more complete fix also reworks the similar > memory allocation for the struct blk_zone array used for zone revalidation. Put > all together, the use of report zones uses only vmalloc-ed buffers and data > structures, reduces pressure on the memory system and reducing chances of failures. Hi Damien, I think it's up to Martin to decide how to proceed. Bart.
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 7334024b64f1..37469d77264e 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -103,6 +103,44 @@ static int sd_zbc_do_report_zones(struct scsi_disk *sdkp, unsigned char *buf, return 0; } +/** + * Arbitrary maximum report zones buffer size of 1MB, fitting 16383 x 64B zone + * descriptors plus the 64B report header. + */ +#define SD_ZBC_REPORT_SIZE (16384U * 64U) + +/** + * Allocate a buffer for report zones. + */ +static void *sd_zbc_alloc_report_buffer(struct gendisk *disk, size_t *buflen, + gfp_t gfp_mask) +{ + struct page *page; + size_t bufsize; + int order; + + /* + * Limit the command buffer size to the arbitrary SD_ZBC_REPORT_SIZE + * size (1MB), allowing up to 16383 zone descriptors being reported with + * a single command. And make sure that this size does not exceed the + * hardware capabilities. To avoid disk revalidation failures due to + * memory allocation errors, retry the allocation with a smaller buffer + * size if the allocation fails. + */ + bufsize = min_t(size_t, *buflen, SD_ZBC_REPORT_SIZE); + bufsize = min_t(size_t, bufsize, + queue_max_hw_sectors(disk->queue) << 9); + for (order = get_order(bufsize); order >= 0; order--) { + page = alloc_pages(gfp_mask, order); + if (page) { + *buflen = PAGE_SIZE << order; + return page_address(page); + } + } + + return NULL; +} + /** * sd_zbc_report_zones - Disk report zones operation. * @disk: The target disk @@ -118,9 +156,9 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector, gfp_t gfp_mask) { struct scsi_disk *sdkp = scsi_disk(disk); - unsigned int i, buflen, nrz = *nr_zones; + unsigned int i, nrz = *nr_zones; unsigned char *buf; - size_t offset = 0; + size_t buflen, offset = 0; int ret = 0; if (!sd_is_zoned(sdkp)) @@ -128,13 +166,11 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector, return -EOPNOTSUPP; /* - * Get a reply buffer for the number of requested zones plus a header, - * without exceeding the device maximum command size. For ATA disks, - * buffers must be aligned to 512B. + * Try to get a buffer that can fits the requested number of zones plus + * the command reply header, all 64B in size. */ - buflen = min(queue_max_hw_sectors(disk->queue) << 9, - roundup((nrz + 1) * 64, 512)); - buf = kmalloc(buflen, gfp_mask); + buflen = (nrz + 1) * 64; + buf = sd_zbc_alloc_report_buffer(disk, &buflen, gfp_mask); if (!buf) return -ENOMEM; @@ -153,7 +189,7 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector, *nr_zones = nrz; out_free_buf: - kfree(buf); + free_pages((unsigned long)buf, get_order(buflen)); return ret; }
During disk revalidation done with sd_revalidate(), the zones of a zoned disk zones are checked using the helper function blk_revalidate_disk_zones() if a zone configuration change is detected (change in the number of zones or zone size). The function blk_revalidate_disk_zones() issues report_zones calls that are very large, that is, to obtain zone information for all zones of the disk with a single command. The size of the report zones command buffer necessary for such large request generally is lower than the disk max_hw_sectors and KMALLOC_MAX_SIZE (4MB) but still very large (e.g. aboiut 3.5MB for a 15TB disk with 256MB zones). This large report zones reply buffer allocation with kmalloc succeeds on boot, but frequently fails at run time, especially for a system under memory pressure. This causes the disk revalidation to fail and the disk capacity to be changed to 0. This problem can be avoided with a more intelligent report zones buffer allocation. This patch introduces the arbitrary SD_ZBC_REPORT_SIZE allocation limit of 1MB allowing to fit 16383 zone descriptor for every report zone command execution, thus allowing a full zone report with 4 or 5 commands for most ZBC/ZAC disks today. This limit may be lowered to satisfy the disk max_hw_sectors limit. Furthermore, further reduce the likelyhood of a buffer allocation failure while guaranteeing progress in the zone report by retrying the buffer allocation with a smaller size in case kmalloc() fails. Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> --- drivers/scsi/sd_zbc.c | 54 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 45 insertions(+), 9 deletions(-)