Message ID | 20240607145902.1137853-12-kernel@pankajraghav.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | enable bs > ps in XFS | expand |
On Fri, Jun 07, 2024 at 02:59:02PM +0000, Pankaj Raghav (Samsung) wrote: > --- a/fs/xfs/libxfs/xfs_ialloc.c > +++ b/fs/xfs/libxfs/xfs_ialloc.c > @@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry( > igeo->ialloc_align = mp->m_dalign; > else > igeo->ialloc_align = 0; > + > + if (mp->m_sb.sb_blocksize > PAGE_SIZE) > + igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT; > + else > + igeo->min_folio_order = 0; > } The minimum folio order isn't really part of the inode (allocation) geometry, is it?
On Thu, Jun 13, 2024 at 10:47:25AM +0200, Christoph Hellwig wrote: > On Fri, Jun 07, 2024 at 02:59:02PM +0000, Pankaj Raghav (Samsung) wrote: > > --- a/fs/xfs/libxfs/xfs_ialloc.c > > +++ b/fs/xfs/libxfs/xfs_ialloc.c > > @@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry( > > igeo->ialloc_align = mp->m_dalign; > > else > > igeo->ialloc_align = 0; > > + > > + if (mp->m_sb.sb_blocksize > PAGE_SIZE) > > + igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT; > > + else > > + igeo->min_folio_order = 0; > > } > > The minimum folio order isn't really part of the inode (allocation) > geometry, is it? I suggested it last time around instead of calculating the same constant on every inode allocation. We're already storing in-memory strunct xfs_inode allocation init values in this structure. e.g. in xfs_inode_alloc() we see things like this: ip->i_diflags2 = mp->m_ino_geo.new_diflags2; So that we only calculate the default values for the filesystem once instead of on every inode allocation. This isn't unreasonable because xfs_inode_alloc() is a fairly hot path. The only other place we might store it is the struct xfs_mount, but given all the inode allocation constants are already in the embedded mp->m_ino_geo structure, it just seems like a much better idea to put it will all the other inode allocation constants than dump it randomly into the struct xfs_mount.... -Dave.
On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote: > > > + if (mp->m_sb.sb_blocksize > PAGE_SIZE) > > > + igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT; > > > + else > > > + igeo->min_folio_order = 0; > > > } > > > > The minimum folio order isn't really part of the inode (allocation) > > geometry, is it? > > I suggested it last time around instead of calculating the same > constant on every inode allocation. We're already storing in-memory > strunct xfs_inode allocation init values in this structure. e.g. in > xfs_inode_alloc() we see things like this: While new_diflags2 isn't exactly inode geometry, it at least is part of the inode allocation. Folio min order for file data has nothing to do with this at all. > The only other place we might store it is the struct xfs_mount, but > given all the inode allocation constants are already in the embedded > mp->m_ino_geo structure, it just seems like a much better idea to > put it will all the other inode allocation constants than dump it > randomly into the struct xfs_mount.... Well, it is very closely elated to say the m_blockmask field in struct xfs_mount. The again modern CPUs tend to get a you simple subtraction for free in most pipelines doing other things, so I'm not really sure it's worth caching for use in inode allocation to start with, but I don't care strongly about that.
On Mon, Jun 17, 2024 at 08:51:04AM +0200, Christoph Hellwig wrote: > On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote: > > > > + if (mp->m_sb.sb_blocksize > PAGE_SIZE) > > > > + igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT; > > > > + else > > > > + igeo->min_folio_order = 0; > > > > } > > > > > > The minimum folio order isn't really part of the inode (allocation) > > > geometry, is it? > > > > I suggested it last time around instead of calculating the same > > constant on every inode allocation. We're already storing in-memory > > strunct xfs_inode allocation init values in this structure. e.g. in > > xfs_inode_alloc() we see things like this: > > While new_diflags2 isn't exactly inode geometry, it at least is part > of the inode allocation. Folio min order for file data has nothing > to do with this at all. > > > The only other place we might store it is the struct xfs_mount, but > > given all the inode allocation constants are already in the embedded > > mp->m_ino_geo structure, it just seems like a much better idea to > > put it will all the other inode allocation constants than dump it > > randomly into the struct xfs_mount.... > > Well, it is very closely elated to say the m_blockmask field in > struct xfs_mount. The again modern CPUs tend to get a you simple > subtraction for free in most pipelines doing other things, so I'm > not really sure it's worth caching for use in inode allocation to > start with, but I don't care strongly about that. But there will also be an extra conditional apart from subtraction right? Initially it was something like this: @@ -73,6 +73,7 @@ xfs_inode_alloc( xfs_ino_t ino) { struct xfs_inode *ip; + int min_order = 0; /* * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL @@ -88,7 +89,8 @@ xfs_inode_alloc( /* VFS doesn't initialise i_mode or i_state! */ VFS_I(ip)->i_mode = 0; VFS_I(ip)->i_state = 0; - mapping_set_large_folios(VFS_I(ip)->i_mapping); + min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT); + mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER); XFS_STATS_INC(mp, vn_active); ASSERT(atomic_read(&ip->i_pincount) == 0); @@ -313,6 +315,7 @@ xfs_reinit_inode( dev_t dev = inode->i_rdev; kuid_t uid = inode->i_uid; kgid_t gid = inode->i_gid; + int min_order = 0; error = inode_init_always(mp->m_super, inode); @@ -323,7 +326,8 @@ xfs_reinit_inode( inode->i_rdev = dev; inode->i_uid = uid; inode->i_gid = gid; - mapping_set_large_folios(inode->i_mapping); + min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT); + mapping_set_folio_orders(inode->i_mapping, min_order, MAX_PAGECACHE_ORDER); return error; } It does introduce a conditional in the inode allocation hot path so I went with what Chinner proposed as it is something we use when we initialize an inode.
On Mon, Jun 17, 2024 at 08:51:04AM +0200, Christoph Hellwig wrote: > On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote: > > > > + if (mp->m_sb.sb_blocksize > PAGE_SIZE) > > > > + igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT; > > > > + else > > > > + igeo->min_folio_order = 0; > > > > } > > > > > > The minimum folio order isn't really part of the inode (allocation) > > > geometry, is it? > > > > I suggested it last time around instead of calculating the same > > constant on every inode allocation. We're already storing in-memory > > strunct xfs_inode allocation init values in this structure. e.g. in > > xfs_inode_alloc() we see things like this: > > While new_diflags2 isn't exactly inode geometry, it at least is part > of the inode allocation. Folio min order for file data has nothing > to do with this at all. Yet ip->i_diflags2 is *not* initialised in xfs_init_new_inode() when we physically allocate and initialise a new inode. It is set for all inodes when they are allocated in memory, regardless of their use. Whether that is the right thing to do or not is a separate issue - xfs_inode_from_disk() will overwrite it in every inode read case that isn't a create. Indeed, We could do the folio order initialisation in xfs_setup_inode() where we set up the mapping gfp mask, but that doesn't change the fact we set it up for every inode that is instantiated in memory or that we want it pre-calculated... > > The only other place we might store it is the struct xfs_mount, but > > given all the inode allocation constants are already in the embedded > > mp->m_ino_geo structure, it just seems like a much better idea to > > put it will all the other inode allocation constants than dump it > > randomly into the struct xfs_mount.... > > Well, it is very closely elated to say the m_blockmask field in > struct xfs_mount. Not really. The block mask is a property of the and used primarily for manipulating lengths in units of FSB to/from byte counts and vice versa. It is used all over the place, and the only guaranteed common structure that all those callers have access to is the xfs_mount. OTOH, the folio order is only used for regular files to tell the page cache how to behave. The scope of the folio order setup is the same as mapping_set_gfp_mask() - is it only used in one place and used for inode configuration. I may have called the structure "inode geometry" because that described what it contained when I first implemented it, but that doesn't mean that is all that is can contain. It contains static, precalculated inode configuration values, and that what we are adding here... > The again modern CPUs tend to get a you simple > subtraction for free in most pipelines doing other things, so I'm > not really sure it's worth caching for use in inode allocation to > start with, but I don't care strongly about that. It's not the cost of a subtraction that is the problem - precalculation is about avoiding a potential branch misprediction in a hot path that would stall the CPU. If there were no branches, it wouldn't be an issue, but this value cannot be calculated without at least one branch in the logic. -Dave.
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index 14c81f227c5b..1e76431d75a4 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry( igeo->ialloc_align = mp->m_dalign; else igeo->ialloc_align = 0; + + if (mp->m_sb.sb_blocksize > PAGE_SIZE) + igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT; + else + igeo->min_folio_order = 0; } /* Compute the location of the root directory inode that is laid out by mkfs. */ diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h index 34f104ed372c..e67a1c7cc0b0 100644 --- a/fs/xfs/libxfs/xfs_shared.h +++ b/fs/xfs/libxfs/xfs_shared.h @@ -231,6 +231,9 @@ struct xfs_ino_geometry { /* precomputed value for di_flags2 */ uint64_t new_diflags2; + /* minimum folio order of a page cache allocation */ + unsigned int min_folio_order; + }; #endif /* __XFS_SHARED_H__ */ diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 0953163a2d84..5ed3dc9e7d90 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -89,7 +89,8 @@ xfs_inode_alloc( /* VFS doesn't initialise i_mode or i_state! */ VFS_I(ip)->i_mode = 0; VFS_I(ip)->i_state = 0; - mapping_set_large_folios(VFS_I(ip)->i_mapping); + mapping_set_folio_min_order(VFS_I(ip)->i_mapping, + M_IGEO(mp)->min_folio_order); XFS_STATS_INC(mp, vn_active); ASSERT(atomic_read(&ip->i_pincount) == 0); @@ -324,7 +325,8 @@ xfs_reinit_inode( inode->i_rdev = dev; inode->i_uid = uid; inode->i_gid = gid; - mapping_set_large_folios(inode->i_mapping); + mapping_set_folio_min_order(inode->i_mapping, + M_IGEO(mp)->min_folio_order); return error; } diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 46cb0384143b..a99454208807 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -135,7 +135,6 @@ xfs_sb_validate_fsb_count( uint64_t max_index; uint64_t max_bytes; - ASSERT(PAGE_SHIFT >= sbp->sb_blocklog); ASSERT(sbp->sb_blocklog >= BBSHIFT); if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes)) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 27e9f749c4c7..b8a93a8f35ca 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1638,16 +1638,18 @@ xfs_fs_fill_super( goto out_free_sb; } - /* - * Until this is fixed only page-sized or smaller data blocks work. - */ if (mp->m_sb.sb_blocksize > PAGE_SIZE) { - xfs_warn(mp, - "File system with blocksize %d bytes. " - "Only pagesize (%ld) or less will currently work.", + if (!xfs_has_crc(mp)) { + xfs_warn(mp, +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.", mp->m_sb.sb_blocksize, PAGE_SIZE); - error = -ENOSYS; - goto out_free_sb; + error = -ENOSYS; + goto out_free_sb; + } + + xfs_warn(mp, +"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.", + mp->m_sb.sb_blocksize); } /* Ensure this filesystem fits in the page cache limits */