Message ID | 20231109210608.2252323-8-willy@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | More buffer_head cleanups | expand |
On Thu, Nov 09, 2023 at 09:06:08PM +0000, Matthew Wilcox (Oracle) wrote: > Replace the shift with a divide, which is probably cheaper than first > calculating the shift. No. The divs are almost certainly more expensive on most CPUs, especially when doing two of them. On x86_64 for example, it will be two div instructions instead of one bsr and two shr instructions. The two divs are much more costly. The block size is always a power of 2; we should take advantage of that. - Eric
On Thu, Nov 09, 2023 at 08:50:19PM -0800, Eric Biggers wrote: > On Thu, Nov 09, 2023 at 09:06:08PM +0000, Matthew Wilcox (Oracle) wrote: > > Replace the shift with a divide, which is probably cheaper than first > > calculating the shift. > > No. The divs are almost certainly more expensive on most CPUs, especially when > doing two of them. On x86_64 for example, it will be two div instructions > instead of one bsr and two shr instructions. The two divs are much more costly. > The block size is always a power of 2; we should take advantage of that. I just looked it up and it's more expensive than I thought, but I don't see a huge difference here. DIV has a 23-cycle latency on Skylake (just to pick a relatively recent CPU). But it has reciprocal throughput of 6, so two DIVs have latency of 29, not 46. Unless the second DIV depends on the result of the first (it doesn't). BSF (the ffs instruction) has latency 3. SHRD has latency 4, but the SHR is going to depend on the output of the BSF, so they necessarily add to 7 cycles latency. SHR has reciprocal latency of 2, so BSF,SHR,SHR will be 9 cycles. So I've cost 20 cycles, which is about the cost of missing in the L1 cache and hitting in the L2 cache. That doesn't feel too bad to me? Would you want to invest more engineering time in changing it?
On Fri, Nov 10, 2023 at 02:26:43PM +0000, Matthew Wilcox wrote:
> Would you want to invest more engineering time in changing it?
It seems logical to just keep the existing approach instead of spending time
trying to justify changing it to a less efficient one (divides).
- Eric
On Sat, Nov 11, 2023 at 10:06:13AM -0800, Eric Biggers wrote: > On Fri, Nov 10, 2023 at 02:26:43PM +0000, Matthew Wilcox wrote: > > Would you want to invest more engineering time in changing it? > > It seems logical to just keep the existing approach instead of spending time > trying to justify changing it to a less efficient one (divides). Except the existing approach doesn't work for block size > PAGE_SIZE
On Sun, Nov 12, 2023 at 04:52:56AM +0000, Matthew Wilcox wrote: > On Sat, Nov 11, 2023 at 10:06:13AM -0800, Eric Biggers wrote: > > On Fri, Nov 10, 2023 at 02:26:43PM +0000, Matthew Wilcox wrote: > > > Would you want to invest more engineering time in changing it? > > > > It seems logical to just keep the existing approach instead of spending time > > trying to justify changing it to a less efficient one (divides). > > Except the existing approach doesn't work for block size > PAGE_SIZE A shift does still work; the block size is still a power of 2, after all. '(sector_t)folio->index << (PAGE_SHIFT - bbits)' just needs to be changed to 'folio_pos(folio) >> bbits'. - Eric
diff --git a/fs/buffer.c b/fs/buffer.c index ef444ab53a9b..4eb44ccdc6be 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1742,19 +1742,6 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len) } EXPORT_SYMBOL(clean_bdev_aliases); -/* - * Size is a power-of-two in the range 512..PAGE_SIZE, - * and the case we care about most is PAGE_SIZE. - * - * So this *could* possibly be written with those - * constraints in mind (relevant mostly if some - * architecture has a slow bit-scan instruction) - */ -static inline int block_size_bits(unsigned int blocksize) -{ - return ilog2(blocksize); -} - static struct buffer_head *folio_create_buffers(struct folio *folio, struct inode *inode, unsigned int b_state) @@ -1807,7 +1794,7 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio, sector_t block; sector_t last_block; struct buffer_head *bh, *head; - unsigned int blocksize, bbits; + size_t blocksize; int nr_underway = 0; blk_opf_t write_flags = wbc_to_write_flags(wbc); @@ -1826,10 +1813,9 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio, bh = head; blocksize = bh->b_size; - bbits = block_size_bits(blocksize); - block = (sector_t)folio->index << (PAGE_SHIFT - bbits); - last_block = (i_size_read(inode) - 1) >> bbits; + block = div_u64(folio_pos(folio), blocksize); + last_block = div_u64(i_size_read(inode) - 1, blocksize); /* * Get all the dirty buffers mapped to disk addresses and @@ -2355,7 +2341,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) struct inode *inode = folio->mapping->host; sector_t iblock, lblock; struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; - unsigned int blocksize, bbits; + size_t blocksize; int nr, i; int fully_mapped = 1; bool page_error = false; @@ -2369,10 +2355,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) head = folio_create_buffers(folio, inode, 0); blocksize = head->b_size; - bbits = block_size_bits(blocksize); - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); - lblock = (limit+blocksize-1) >> bbits; + iblock = div_u64(folio_pos(folio), blocksize); + lblock = div_u64(limit + blocksize - 1, blocksize); bh = head; nr = 0; i = 0;
Both __block_write_full_folio() and block_read_full_folio() assumed that block size <= PAGE_SIZE. Replace the shift with a divide, which is probably cheaper than first calculating the shift. That lets us remove block_size_bits() as these were the last callers. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- fs/buffer.c | 27 ++++++--------------------- 1 file changed, 6 insertions(+), 21 deletions(-)