Message ID | 20230306120127.21375-4-hare@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | brd: Allow to change block sizes | expand |
On Mon, Mar 06, 2023 at 01:01:25PM +0100, Hannes Reinecke wrote: > Add a module option 'rd_blksize' to allow the user to change > the sector size of the RAM disks. > > Signed-off-by: Hannes Reinecke <hare@suse.de> > --- > drivers/block/brd.c | 30 +++++++++++++++++++----------- > 1 file changed, 19 insertions(+), 11 deletions(-) > > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index fc41ea641dfb..11bac3c3f1b6 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -46,6 +46,8 @@ struct brd_device { > spinlock_t brd_lock; > struct radix_tree_root brd_folios; > u64 brd_nr_folios; > + unsigned int brd_sector_shift; > + unsigned int brd_sector_size; > }; Why not just do this first and initialize this to the defaults set without the module parameter. Then you don't need to declare over and over unsigned int rd_sectors, etc in tons of routines and just pass the brd which most routines get. Then most of this and the prior patch can be squeezed into one. The functional changes would be the addition of the module parameter. > @@ -410,7 +418,7 @@ static int brd_alloc(int i) > * otherwise fdisk will align on 1M. Regardless this call > * is harmless) > */ > - blk_queue_physical_block_size(disk->queue, PAGE_SIZE); > + blk_queue_physical_block_size(disk->queue, rd_blksize); And this was added for DAX support, but DAX support was long removed from brd see commit 7a862fbbdec6 ("brd: remove dax support"). This nugget was added by Boaz via commit c8fa31730fc72a8 ("brd: Request from fdisk 4k alignment"), but if we don't support DAX, can't we just kill that line? Current doc for blk_queue_physical_block_size() says: * This should be set to the lowest possible sector size that the * hardware can operate on without reverting to read-modify-write * operations. But since we're working directly with RAM, do we care? The comment above that line referring to direct_access should be killed at the very least. While you're at it, can you then also update Documentation/filesystems/dax.rst to remove the brd as an example driver with DAX support. That leaves us with only a few block drivers with DAX: - dcssblk: s390 dcss block device driver - pmem: NVDIMM persistent memory driver - some dm targets - fuse virtio_fs Wonder which is the right example these days for DAX, now that persistant memory has End of Life'd with Optane dead, curious also of the value of the above and DAX in general other than support for whatever made it out. Should we EOL DAX too? Luis
On Wed, Mar 08, 2023 at 07:12:40PM -0800, Luis Chamberlain wrote: > On Mon, Mar 06, 2023 at 01:01:25PM +0100, Hannes Reinecke wrote: > > Add a module option 'rd_blksize' to allow the user to change > > the sector size of the RAM disks. > > > > Signed-off-by: Hannes Reinecke <hare@suse.de> > > --- > > drivers/block/brd.c | 30 +++++++++++++++++++----------- > > 1 file changed, 19 insertions(+), 11 deletions(-) > > > > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > > index fc41ea641dfb..11bac3c3f1b6 100644 > > --- a/drivers/block/brd.c > > +++ b/drivers/block/brd.c > > @@ -46,6 +46,8 @@ struct brd_device { > > spinlock_t brd_lock; > > struct radix_tree_root brd_folios; > > u64 brd_nr_folios; > > + unsigned int brd_sector_shift; > > + unsigned int brd_sector_size; > > }; > > Why not just do this first and initialize this to the defaults set > without the module parameter. Then you don't need to declare over > and over unsigned int rd_sectors, etc in tons of routines and just > pass the brd which most routines get. > > Then most of this and the prior patch can be squeezed into one. > > The functional changes would be the addition of the module parameter. > > > @@ -410,7 +418,7 @@ static int brd_alloc(int i) > > * otherwise fdisk will align on 1M. Regardless this call > > * is harmless) > > */ > > - blk_queue_physical_block_size(disk->queue, PAGE_SIZE); > > + blk_queue_physical_block_size(disk->queue, rd_blksize); > > And this was added for DAX support, but DAX support was long removed > from brd see commit 7a862fbbdec6 ("brd: remove dax support"). This > nugget was added by Boaz via commit c8fa31730fc72a8 ("brd: Request from > fdisk 4k alignment"), but if we don't support DAX, can't we just kill > that line? > > Current doc for blk_queue_physical_block_size() says: > > * This should be set to the lowest possible sector size that the > * hardware can operate on without reverting to read-modify-write > * operations. > > But since we're working directly with RAM, do we care? > > The comment above that line referring to direct_access should be killed > at the very least. I was curious whether or not Martin's comments about "mkfs already considers the reported queue limits (for the filesystems most people use, anyway)" applies to say XFS [0] and whether or not the above helps guide mkfs that way, so I took a look now. [0] https://lore.kernel.org/all/yq1ttytbox9.fsf@ca-mkp.ca.oracle.com/ There's a default struct mkfs_default_params which sets the blocksize to 1 << XFS_DFL_BLOCKSIZE_LOG and so at least validate_blocksize() will use that if no parameter was passed on the cli. We also set the default sector size to XFS_MIN_SECTORSIZE. There are two aspects to what we use for the final mkfs, first we use validate_blocksize(), then validate_sectorsize(). The first will set the default block size to 1 << XFS_DFL_BLOCKSIZE_LOG (4096 bytes) if the user did not specify anything. Then validate_sectorsize() uses get_topology() to get the physical / logical block sizes and both: * blkid_topology_get_minimum_io_size() * blkid_topology_get_optimal_io_size() These come from utilil-linux: https://github.com/util-linux/util-linux.git https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-block So that's just /sys/block/<disk>/queue/physical_block_size /sys/block/<disk>/queue/logical_block_size Then: /sys/block/<disk>/queue/minimum_io_size The documentation suggests " For disk drives this is often the physical block size." /sys/block/<disk>/queue/optimal_io_size The documentation suggests this is "This is rarely reported for disk drives." From my review of xfs's mkfs is we essentially use the physical block size as a default sector size if set, otherwise we use the device's logical block size if set otherwise xfsprog's default and so 4096. And so the above comment referring to DAX could be killed and replaced with allowing userspace mkfs utils to use a more appropriate block size. Or just kill the comment all together and we update the docs for blk_queue_physical_block_size() instead. > While you're at it, can you then also update Documentation/filesystems/dax.rst > to remove the brd as an example driver with DAX support. > > That leaves us with only a few block drivers with DAX: > > - dcssblk: s390 dcss block device driver > - pmem: NVDIMM persistent memory driver > - some dm targets > - fuse virtio_fs > > Wonder which is the right example these days for DAX, now that > persistant memory has End of Life'd with Optane dead, curious also of > the value of the above and DAX in general other than support for > whatever made it out. Looking at this: https://lore.kernel.org/all/62ef05515b085_1b3c29434@dwillia2-xfh.jf.intel.com.notmuch/ > Should we EOL DAX too? I think the answer is no as a generic kernel thing, however, it probably means we should really update the DAX documentation to reflect what our real expecations are for its current and future uses. And that is to highlight the non-PMEM use cases as *those* probably are EOL'd now. Luis
Luis, > /sys/block/<disk>/queue/minimum_io_size > > The documentation suggests " For disk drives this is often the > physical block size." > > /sys/block/<disk>/queue/optimal_io_size > > The documentation suggests this is "This is rarely reported for disk > drives." min_io and opt_io are used to key mkfs.xfs' sunit/swidth. So if you're using a hardware RAID, MD, or DM, we'll attempt to align allocations on stripe boundaries. Back when that "rarely reported" blurb was written (2009), we did not have any individual disk drives which reported min_io/opt_io. Reporting those parameters was mostly a storage array thing. These days it's fairly common for both disk drives and SSDs to fill out these fields. > From my review of xfs's mkfs is we essentially use the physical block > size as a default sector size if set, otherwise we use the device's > logical block size if set otherwise xfsprog's default and so 4096. Yep.
On Mon, Mar 20, 2023 at 07:40:44PM -0400, Martin K. Petersen wrote: > > Luis, > > > /sys/block/<disk>/queue/minimum_io_size > > > > The documentation suggests " For disk drives this is often the > > physical block size." > > > > /sys/block/<disk>/queue/optimal_io_size > > > > The documentation suggests this is "This is rarely reported for disk > > drives." > > min_io and opt_io are used to key mkfs.xfs' sunit/swidth. So if you're > using a hardware RAID, MD, or DM, we'll attempt to align allocations on > stripe boundaries. > > Back when that "rarely reported" blurb was written (2009), we did not > have any individual disk drives which reported min_io/opt_io. Reporting > those parameters was mostly a storage array thing. These days it's > fairly common for both disk drives and SSDs to fill out these fields. Glad you mentioned this, I followed up in my review of these and I see even the names, swidth, sunit, are all "stripe" indicative. Based on what you are saying, it seems we may need to update docs to reflect actual / new uses. > > From my review of xfs's mkfs is we essentially use the physical block > > size as a default sector size if set, otherwise we use the device's > > logical block size if set otherwise xfsprog's default and so 4096. > > Yep. The use case for swidth / sunit on mkfs.xfs seemed pretty tied to striping, and it was not obvious or clear / if using it for hints could be used today as perhaps intended. At least all the naming and validation stuff seems to make it very "stripy" still. Luis
diff --git a/drivers/block/brd.c b/drivers/block/brd.c index fc41ea641dfb..11bac3c3f1b6 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -30,7 +30,7 @@ /* * Each block ramdisk device has a radix_tree brd_folios of folios that stores * the folios containing the block device's contents. A brd folio's ->index is - * its offset in PAGE_SIZE units. This is similar to, but in no way connected + * its offset in brd_blksize units. This is similar to, but in no way connected * with, the kernel's pagecache or buffer cache (which sit above our block * device). */ @@ -46,6 +46,8 @@ struct brd_device { spinlock_t brd_lock; struct radix_tree_root brd_folios; u64 brd_nr_folios; + unsigned int brd_sector_shift; + unsigned int brd_sector_size; }; /* @@ -55,7 +57,7 @@ static struct folio *brd_lookup_folio(struct brd_device *brd, sector_t sector) { pgoff_t idx; struct folio *folio; - unsigned int rd_sector_shift = PAGE_SHIFT - SECTOR_SHIFT; + unsigned int rd_sector_shift = brd->brd_sector_shift - SECTOR_SHIFT; /* * The folio lifetime is protected by the fact that we have opened the @@ -85,8 +87,8 @@ static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp) { pgoff_t idx; struct folio *folio; - unsigned int rd_sector_shift = PAGE_SHIFT - SECTOR_SHIFT; - unsigned int rd_sector_order = get_order(PAGE_SIZE); + unsigned int rd_sector_shift = brd->brd_sector_shift - SECTOR_SHIFT; + unsigned int rd_sector_order = get_order(brd->brd_sector_size); int ret = 0; folio = brd_lookup_folio(brd, sector); @@ -170,9 +172,9 @@ static void brd_free_folios(struct brd_device *brd) static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n, gfp_t gfp) { - unsigned int rd_sectors_shift = PAGE_SHIFT - SECTOR_SHIFT; + unsigned int rd_sectors_shift = brd->brd_sector_shift - SECTOR_SHIFT; unsigned int rd_sectors = 1 << rd_sectors_shift; - unsigned int rd_sector_size = PAGE_SIZE; + unsigned int rd_sector_size = brd->brd_sector_size; unsigned int offset = (sector & (rd_sectors - 1)) << SECTOR_SHIFT; size_t copy; int ret; @@ -196,9 +198,9 @@ static void copy_to_brd(struct brd_device *brd, const void *src, { struct folio *folio; void *dst; - unsigned int rd_sectors_shift = PAGE_SHIFT - SECTOR_SHIFT; + unsigned int rd_sectors_shift = brd->brd_sector_shift - SECTOR_SHIFT; unsigned int rd_sectors = 1 << rd_sectors_shift; - unsigned int rd_sector_size = PAGE_SIZE; + unsigned int rd_sector_size = brd->brd_sector_size; unsigned int offset = (sector & (rd_sectors - 1)) << SECTOR_SHIFT; size_t copy; @@ -231,9 +233,9 @@ static void copy_from_brd(void *dst, struct brd_device *brd, { struct folio *folio; void *src; - unsigned int rd_sectors_shift = PAGE_SHIFT - SECTOR_SHIFT; + unsigned int rd_sectors_shift = brd->brd_sector_shift - SECTOR_SHIFT; unsigned int rd_sectors = 1 << rd_sectors_shift; - unsigned int rd_sector_size = PAGE_SIZE; + unsigned int rd_sector_size = brd->brd_sector_size; unsigned int offset = (sector & (rd_sectors - 1)) << SECTOR_SHIFT; size_t copy; @@ -346,6 +348,10 @@ static int max_part = 1; module_param(max_part, int, 0444); MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices"); +static unsigned int rd_blksize = PAGE_SIZE; +module_param(rd_blksize, uint, 0444); +MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes."); + MODULE_LICENSE("GPL"); MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR); MODULE_ALIAS("rd"); @@ -382,6 +388,8 @@ static int brd_alloc(int i) return -ENOMEM; brd->brd_number = i; list_add_tail(&brd->brd_list, &brd_devices); + brd->brd_sector_shift = ilog2(rd_blksize); + brd->brd_sector_size = rd_blksize; spin_lock_init(&brd->brd_lock); INIT_RADIX_TREE(&brd->brd_folios, GFP_ATOMIC); @@ -410,7 +418,7 @@ static int brd_alloc(int i) * otherwise fdisk will align on 1M. Regardless this call * is harmless) */ - blk_queue_physical_block_size(disk->queue, PAGE_SIZE); + blk_queue_physical_block_size(disk->queue, rd_blksize); /* Tell the block layer that this is not a rotational device */ blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
Add a module option 'rd_blksize' to allow the user to change the sector size of the RAM disks. Signed-off-by: Hannes Reinecke <hare@suse.de> --- drivers/block/brd.c | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-)