Message ID | 20241127054737.33351-1-bharata@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Large folios in block buffered IO path | expand |
On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote: > > Recently we discussed the scalability issues while running large > instances of FIO with buffered IO option on NVME block devices here: > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ > > One of the suggestions Chris Mason gave (during private discussions) was > to enable large folios in block buffered IO path as that could > improve the scalability problems and improve the lock contention > scenarios. > I have no basis to comment on the idea. However, it is pretty apparent whatever the situation it is being heavily disfigured by lock contention in blkdev_llseek: > perf-lock contention output > --------------------------- > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite > mix it looks like this: > > perf-lock contention default > contended total wait max wait avg wait type caller > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > 0xffffffff8f39e88f up_write+0x4f > 0xffffffff8f9d598e blkdev_llseek+0x4e > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 0xffffffff903dce5e do_syscall_64+0x7e > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 Admittedly I'm not familiar with this code, but at a quick glance the lock can be just straight up removed here? 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) 535 { 536 │ struct inode *bd_inode = bdev_file_inode(file); 537 │ loff_t retval; 538 │ 539 │ inode_lock(bd_inode); 540 │ retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); 541 │ inode_unlock(bd_inode); 542 │ return retval; 543 } At best it stabilizes the size for the duration of the call. Sounds like it helps nothing since if the size can change, the file offset will still be altered as if there was no locking? Suppose this cannot be avoided to grab the size for whatever reason. While the above fio invocation did not work for me, I ran some crapper which I had in my shell history and according to strace: [pid 271829] lseek(7, 0, SEEK_SET) = 0 [pid 271829] lseek(7, 0, SEEK_SET) = 0 [pid 271830] lseek(7, 0, SEEK_SET) = 0 ... the lseeks just rewind to the beginning, *definitely* not needing to know the size. One would have to check but this is most likely the case in your test as well. And for that there is 0 need to grab the size, and consequently the inode lock. > 134057198 14.27 h 35.93 ms 383.14 us spinlock clear_shadow_entries+0x57 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5e7967 clear_shadow_entries+0x57 > 0xffffffff8f5e90e3 mapping_try_invalidate+0x163 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae > 0xffffffff8f9faea1 blkdev_ioctl+0xc1 > 33351524 1.76 h 35.86 ms 190.43 us spinlock __remove_mapping+0x5d > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5ec71d __remove_mapping+0x5d > 0xffffffff8f5f9be6 remove_mapping+0x16 > 0xffffffff8f5e8f5b mapping_evict_folio+0x7b > 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 9448820 14.96 m 1.54 ms 95.01 us spinlock folio_lruvec_lock_irqsave+0x64 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64 > 0xffffffff8f5e587c folio_batch_move_lru+0x5c > 0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1 > 0xffffffff8f5e7593 deactivate_file_folio+0x43 > 0xffffffff8f5e90b7 mapping_try_invalidate+0x137 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 1488531 11.07 m 1.07 ms 446.39 us spinlock try_to_free_buffers+0x56 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f768c76 try_to_free_buffers+0x56 > 0xffffffff8f5cf647 filemap_release_folio+0x87 > 0xffffffff8f5e8f4c mapping_evict_folio+0x6c > 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 2556868 6.78 m 474.72 us 159.07 us spinlock blkdev_llseek+0x31 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 > 0xffffffff903f14c4 rwsem_down_write_slowpath+0x274 > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 2512627 3.75 m 450.96 us 89.55 us spinlock blkdev_llseek+0x31 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 > 0xffffffff903f12f0 rwsem_down_write_slowpath+0xa0 > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 908184 1.52 m 439.58 us 100.58 us spinlock blkdev_llseek+0x31 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 > 0xffffffff903f1367 rwsem_down_write_slowpath+0x117 > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 134 1.48 m 1.22 s 663.88 ms mutex bdev_release+0x69 > 0xffffffff903ef1de __mutex_lock.constprop.0+0x17e > 0xffffffff903ef863 __mutex_lock_slowpath+0x13 > 0xffffffff903ef8bb mutex_lock+0x3b > 0xffffffff8f9d5249 bdev_release+0x69 > 0xffffffff8f9d5921 blkdev_release+0x11 > 0xffffffff8f7089f3 __fput+0xe3 > 0xffffffff8f708c9b __fput_sync+0x1b > 0xffffffff8f6fe8ed __x64_sys_close+0x3d > > > perf-lock contention patched > contended total wait max wait avg wait type caller > > 1153627 40.15 h 48.67 s 125.30 ms rwsem:W blkdev_llseek+0x31 > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 0xffffffff903dce5e do_syscall_64+0x7e > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 > 276512439 39.19 h 46.90 ms 510.22 us spinlock clear_shadow_entries+0x57 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5e7967 clear_shadow_entries+0x57 > 0xffffffff8f5e90e3 mapping_try_invalidate+0x163 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae > 0xffffffff8f9faea1 blkdev_ioctl+0xc1 > 763119320 26.37 h 887.44 us 124.38 us spinlock rwsem_wake.isra.0+0x42 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > 0xffffffff8f39e88f up_write+0x4f > 0xffffffff8f9d598e blkdev_llseek+0x4e > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 33263910 2.87 h 29.43 ms 310.56 us spinlock __remove_mapping+0x5d > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5ec71d __remove_mapping+0x5d > 0xffffffff8f5f9be6 remove_mapping+0x16 > 0xffffffff8f5e8f5b mapping_evict_folio+0x7b > 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 58671816 2.50 h 519.68 us 153.45 us spinlock folio_lruvec_lock_irqsave+0x64 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64 > 0xffffffff8f5e587c folio_batch_move_lru+0x5c > 0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1 > 0xffffffff8f5e7593 deactivate_file_folio+0x43 > 0xffffffff8f5e90b7 mapping_try_invalidate+0x137 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 284 22.33 m 5.35 s 4.72 s mutex bdev_release+0x69 > 0xffffffff903ef1de __mutex_lock.constprop.0+0x17e > 0xffffffff903ef863 __mutex_lock_slowpath+0x13 > 0xffffffff903ef8bb mutex_lock+0x3b > 0xffffffff8f9d5249 bdev_release+0x69 > 0xffffffff8f9d5921 blkdev_release+0x11 > 0xffffffff8f7089f3 __fput+0xe3 > 0xffffffff8f708c9b __fput_sync+0x1b > 0xffffffff8f6fe8ed __x64_sys_close+0x3d > 2181469 21.38 m 1.15 ms 587.98 us spinlock try_to_free_buffers+0x56 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f768c76 try_to_free_buffers+0x56 > 0xffffffff8f5cf647 filemap_release_folio+0x87 > 0xffffffff8f5e8f4c mapping_evict_folio+0x6c > 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 454398 4.22 m 37.54 ms 557.13 us spinlock __remove_mapping+0x5d > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5ec71d __remove_mapping+0x5d > 0xffffffff8f5f4f04 shrink_folio_list+0xbc4 > 0xffffffff8f5f5a6b evict_folios+0x34b > 0xffffffff8f5f772f try_to_shrink_lruvec+0x20f > 0xffffffff8f5f79ef shrink_one+0x10f > 0xffffffff8f5fb975 shrink_node+0xb45 > 773 3.53 m 2.60 s 273.76 ms mutex __lru_add_drain_all+0x3a > 0xffffffff903ef1de __mutex_lock.constprop.0+0x17e > 0xffffffff903ef863 __mutex_lock_slowpath+0x13 > 0xffffffff903ef8bb mutex_lock+0x3b > 0xffffffff8f5e3d7a __lru_add_drain_all+0x3a > 0xffffffff8f5e77a0 lru_add_drain_all+0x10 > 0xffffffff8f9d3861 invalidate_bdev+0x31 > 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae > 0xffffffff8f9faea1 blkdev_ioctl+0xc1 > 1997851 3.09 m 651.65 us 92.83 us spinlock folio_lruvec_lock_irqsave+0x64 > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64 > 0xffffffff8f5e587c folio_batch_move_lru+0x5c > 0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1 > 0xffffffff8f5e5ae4 folio_add_lru+0x54 > 0xffffffff8f5d075d filemap_add_folio+0xcd > 0xffffffff8f5e30c0 page_cache_ra_order+0x220 > > Observations from perf-lock contention > -------------------------------------- > - Significant reduction of contention for inode_lock (inode->i_rwsem) > from blkdev_llseek() path. > - Significant increase in contention for inode->i_lock from invalidate > and remove_mapping paths. > - Significant increase in contention for lruvec spinlock from > deactive_file_folio path. > > Request comments on the above and I am specifically looking for inputs > on these: > > - Lock contention results and usefulness of large folios in bringing > down the contention in this specific case. > - If enabling large folios in block buffered IO path is a feasible > approach, inputs on doing this cleanly and correclty. > > Bharata B Rao (1): > block/ioctl: Add an ioctl to enable large folios for block buffered IO > path > > block/ioctl.c | 8 ++++++++ > include/uapi/linux/fs.h | 2 ++ > 2 files changed, 10 insertions(+) > > -- > 2.34.1 >
On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote: > > > > Recently we discussed the scalability issues while running large > > instances of FIO with buffered IO option on NVME block devices here: > > > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ > > > > One of the suggestions Chris Mason gave (during private discussions) was > > to enable large folios in block buffered IO path as that could > > improve the scalability problems and improve the lock contention > > scenarios. > > > > I have no basis to comment on the idea. > > However, it is pretty apparent whatever the situation it is being > heavily disfigured by lock contention in blkdev_llseek: > > > perf-lock contention output > > --------------------------- > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite > > mix it looks like this: > > > > perf-lock contention default > > contended total wait max wait avg wait type caller > > > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > > 0xffffffff8f39e88f up_write+0x4f > > 0xffffffff8f9d598e blkdev_llseek+0x4e > > 0xffffffff8f703322 ksys_lseek+0x72 > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c > > 0xffffffff903f18fb down_write+0x5b > > 0xffffffff8f9d5971 blkdev_llseek+0x31 > > 0xffffffff8f703322 ksys_lseek+0x72 > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > 0xffffffff903dce5e do_syscall_64+0x7e > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 > > Admittedly I'm not familiar with this code, but at a quick glance the > lock can be just straight up removed here? > > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > 535 { > 536 │ struct inode *bd_inode = bdev_file_inode(file); > 537 │ loff_t retval; > 538 │ > 539 │ inode_lock(bd_inode); > 540 │ retval = fixed_size_llseek(file, offset, whence, > i_size_read(bd_inode)); > 541 │ inode_unlock(bd_inode); > 542 │ return retval; > 543 } > > At best it stabilizes the size for the duration of the call. Sounds > like it helps nothing since if the size can change, the file offset > will still be altered as if there was no locking? > > Suppose this cannot be avoided to grab the size for whatever reason. > > While the above fio invocation did not work for me, I ran some crapper > which I had in my shell history and according to strace: > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > [pid 271830] lseek(7, 0, SEEK_SET) = 0 > > ... the lseeks just rewind to the beginning, *definitely* not needing > to know the size. One would have to check but this is most likely the > case in your test as well. > > And for that there is 0 need to grab the size, and consequently the inode lock. That is to say bare minimum this needs to be benchmarked before/after with the lock removed from the picture, like so: diff --git a/block/fops.c b/block/fops.c index 2d01c9007681..7f9e9e2f9081 100644 --- a/block/fops.c +++ b/block/fops.c @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = { static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) { struct inode *bd_inode = bdev_file_inode(file); - loff_t retval; - inode_lock(bd_inode); - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); - inode_unlock(bd_inode); - return retval; + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); } static int blkdev_fsync(struct file *filp, loff_t start, loff_t end, To be aborted if it blows up (but I don't see why it would).
On Wed 27-11-24 07:19:59, Mateusz Guzik wrote: > On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote: > > > > > > Recently we discussed the scalability issues while running large > > > instances of FIO with buffered IO option on NVME block devices here: > > > > > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ > > > > > > One of the suggestions Chris Mason gave (during private discussions) was > > > to enable large folios in block buffered IO path as that could > > > improve the scalability problems and improve the lock contention > > > scenarios. > > > > > > > I have no basis to comment on the idea. > > > > However, it is pretty apparent whatever the situation it is being > > heavily disfigured by lock contention in blkdev_llseek: > > > > > perf-lock contention output > > > --------------------------- > > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite > > > mix it looks like this: > > > > > > perf-lock contention default > > > contended total wait max wait avg wait type caller > > > > > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 > > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > > > 0xffffffff8f39e88f up_write+0x4f > > > 0xffffffff8f9d598e blkdev_llseek+0x4e > > > 0xffffffff8f703322 ksys_lseek+0x72 > > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 > > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c > > > 0xffffffff903f18fb down_write+0x5b > > > 0xffffffff8f9d5971 blkdev_llseek+0x31 > > > 0xffffffff8f703322 ksys_lseek+0x72 > > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > > 0xffffffff903dce5e do_syscall_64+0x7e > > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 > > > > Admittedly I'm not familiar with this code, but at a quick glance the > > lock can be just straight up removed here? > > > > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > > 535 { > > 536 │ struct inode *bd_inode = bdev_file_inode(file); > > 537 │ loff_t retval; > > 538 │ > > 539 │ inode_lock(bd_inode); > > 540 │ retval = fixed_size_llseek(file, offset, whence, > > i_size_read(bd_inode)); > > 541 │ inode_unlock(bd_inode); > > 542 │ return retval; > > 543 } > > > > At best it stabilizes the size for the duration of the call. Sounds > > like it helps nothing since if the size can change, the file offset > > will still be altered as if there was no locking? > > > > Suppose this cannot be avoided to grab the size for whatever reason. > > > > While the above fio invocation did not work for me, I ran some crapper > > which I had in my shell history and according to strace: > > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > > [pid 271830] lseek(7, 0, SEEK_SET) = 0 > > > > ... the lseeks just rewind to the beginning, *definitely* not needing > > to know the size. One would have to check but this is most likely the > > case in your test as well. > > > > And for that there is 0 need to grab the size, and consequently the inode lock. > > That is to say bare minimum this needs to be benchmarked before/after > with the lock removed from the picture, like so: Yeah, I've noticed this in the locking profiles as well and I agree bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4) get away without using inode lock in their llseek handler... Honza > diff --git a/block/fops.c b/block/fops.c > index 2d01c9007681..7f9e9e2f9081 100644 > --- a/block/fops.c > +++ b/block/fops.c > @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = { > static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > { > struct inode *bd_inode = bdev_file_inode(file); > - loff_t retval; > > - inode_lock(bd_inode); > - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > - inode_unlock(bd_inode); > - return retval; > + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > } > > static int blkdev_fsync(struct file *filp, loff_t start, loff_t end, > > To be aborted if it blows up (but I don't see why it would). > > -- > Mateusz Guzik <mjguzik gmail.com>
On Wed, Nov 27, 2024 at 01:02:35PM +0100, Jan Kara wrote: > On Wed 27-11-24 07:19:59, Mateusz Guzik wrote: > > On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote: > > > > > > > > Recently we discussed the scalability issues while running large > > > > instances of FIO with buffered IO option on NVME block devices here: > > > > > > > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ > > > > > > > > One of the suggestions Chris Mason gave (during private discussions) was > > > > to enable large folios in block buffered IO path as that could > > > > improve the scalability problems and improve the lock contention > > > > scenarios. > > > > > > > > > > I have no basis to comment on the idea. > > > > > > However, it is pretty apparent whatever the situation it is being > > > heavily disfigured by lock contention in blkdev_llseek: > > > > > > > perf-lock contention output > > > > --------------------------- > > > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite > > > > mix it looks like this: > > > > > > > > perf-lock contention default > > > > contended total wait max wait avg wait type caller > > > > > > > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 > > > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > > > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > > > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > > > > 0xffffffff8f39e88f up_write+0x4f > > > > 0xffffffff8f9d598e blkdev_llseek+0x4e > > > > 0xffffffff8f703322 ksys_lseek+0x72 > > > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 > > > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c > > > > 0xffffffff903f18fb down_write+0x5b > > > > 0xffffffff8f9d5971 blkdev_llseek+0x31 > > > > 0xffffffff8f703322 ksys_lseek+0x72 > > > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > > > 0xffffffff903dce5e do_syscall_64+0x7e > > > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 > > > > > > Admittedly I'm not familiar with this code, but at a quick glance the > > > lock can be just straight up removed here? > > > > > > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > > > 535 { > > > 536 │ struct inode *bd_inode = bdev_file_inode(file); > > > 537 │ loff_t retval; > > > 538 │ > > > 539 │ inode_lock(bd_inode); > > > 540 │ retval = fixed_size_llseek(file, offset, whence, > > > i_size_read(bd_inode)); > > > 541 │ inode_unlock(bd_inode); > > > 542 │ return retval; > > > 543 } > > > > > > At best it stabilizes the size for the duration of the call. Sounds > > > like it helps nothing since if the size can change, the file offset > > > will still be altered as if there was no locking? > > > > > > Suppose this cannot be avoided to grab the size for whatever reason. > > > > > > While the above fio invocation did not work for me, I ran some crapper > > > which I had in my shell history and according to strace: > > > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > > > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > > > [pid 271830] lseek(7, 0, SEEK_SET) = 0 > > > > > > ... the lseeks just rewind to the beginning, *definitely* not needing > > > to know the size. One would have to check but this is most likely the > > > case in your test as well. > > > > > > And for that there is 0 need to grab the size, and consequently the inode lock. > > > > That is to say bare minimum this needs to be benchmarked before/after > > with the lock removed from the picture, like so: > > Yeah, I've noticed this in the locking profiles as well and I agree > bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4) > get away without using inode lock in their llseek handler... nod. This should be removed.
On 27-Nov-24 11:49 AM, Mateusz Guzik wrote: > On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote: >>> >>> Recently we discussed the scalability issues while running large >>> instances of FIO with buffered IO option on NVME block devices here: >>> >>> https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ >>> >>> One of the suggestions Chris Mason gave (during private discussions) was >>> to enable large folios in block buffered IO path as that could >>> improve the scalability problems and improve the lock contention >>> scenarios. >>> >> >> I have no basis to comment on the idea. >> >> However, it is pretty apparent whatever the situation it is being >> heavily disfigured by lock contention in blkdev_llseek: >> >>> perf-lock contention output >>> --------------------------- >>> The lock contention data doesn't look all that conclusive but for 30% rwmixwrite >>> mix it looks like this: >>> >>> perf-lock contention default >>> contended total wait max wait avg wait type caller >>> >>> 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 >>> 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 >>> 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c >>> 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 >>> 0xffffffff8f39e88f up_write+0x4f >>> 0xffffffff8f9d598e blkdev_llseek+0x4e >>> 0xffffffff8f703322 ksys_lseek+0x72 >>> 0xffffffff8f7033a8 __x64_sys_lseek+0x18 >>> 0xffffffff8f20b983 x64_sys_call+0x1fb3 >>> 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 >>> 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c >>> 0xffffffff903f18fb down_write+0x5b >>> 0xffffffff8f9d5971 blkdev_llseek+0x31 >>> 0xffffffff8f703322 ksys_lseek+0x72 >>> 0xffffffff8f7033a8 __x64_sys_lseek+0x18 >>> 0xffffffff8f20b983 x64_sys_call+0x1fb3 >>> 0xffffffff903dce5e do_syscall_64+0x7e >>> 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 >> >> Admittedly I'm not familiar with this code, but at a quick glance the >> lock can be just straight up removed here? >> >> 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) >> 535 { >> 536 │ struct inode *bd_inode = bdev_file_inode(file); >> 537 │ loff_t retval; >> 538 │ >> 539 │ inode_lock(bd_inode); >> 540 │ retval = fixed_size_llseek(file, offset, whence, >> i_size_read(bd_inode)); >> 541 │ inode_unlock(bd_inode); >> 542 │ return retval; >> 543 } >> >> At best it stabilizes the size for the duration of the call. Sounds >> like it helps nothing since if the size can change, the file offset >> will still be altered as if there was no locking? >> >> Suppose this cannot be avoided to grab the size for whatever reason. >> >> While the above fio invocation did not work for me, I ran some crapper >> which I had in my shell history and according to strace: >> [pid 271829] lseek(7, 0, SEEK_SET) = 0 >> [pid 271829] lseek(7, 0, SEEK_SET) = 0 >> [pid 271830] lseek(7, 0, SEEK_SET) = 0 >> >> ... the lseeks just rewind to the beginning, *definitely* not needing >> to know the size. One would have to check but this is most likely the >> case in your test as well. >> >> And for that there is 0 need to grab the size, and consequently the inode lock. Here is the complete FIO cmdline I am using: fio -filename=/dev/nvme1n1p1 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=64k -numjobs=1 -runtime=3600 --time_based -group_reporting -name=mytest And that results in lseek patterns like these: lseek(6, 0, SEEK_SET) = 0 lseek(6, 131072, SEEK_SET) = 131072 lseek(6, 65536, SEEK_SET) = 65536 lseek(6, 196608, SEEK_SET) = 196608 lseek(6, 131072, SEEK_SET) = 131072 lseek(6, 393216, SEEK_SET) = 393216 lseek(6, 196608, SEEK_SET) = 196608 lseek(6, 458752, SEEK_SET) = 458752 lseek(6, 262144, SEEK_SET) = 262144 lseek(6, 1114112, SEEK_SET) = 1114112 The lseeks are interspersed with read and write calls. > > That is to say bare minimum this needs to be benchmarked before/after > with the lock removed from the picture, like so: > > diff --git a/block/fops.c b/block/fops.c > index 2d01c9007681..7f9e9e2f9081 100644 > --- a/block/fops.c > +++ b/block/fops.c > @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = { > static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > { > struct inode *bd_inode = bdev_file_inode(file); > - loff_t retval; > > - inode_lock(bd_inode); > - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > - inode_unlock(bd_inode); > - return retval; > + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > } > > static int blkdev_fsync(struct file *filp, loff_t start, loff_t end, > > To be aborted if it blows up (but I don't see why it would). Thanks for this fix, will try and get back with results. Regards, Bharata.
On Wed, Nov 27, 2024 at 1:18 PM Bharata B Rao <bharata@amd.com> wrote: > > On 27-Nov-24 11:49 AM, Mateusz Guzik wrote: > > That is to say bare minimum this needs to be benchmarked before/after > > with the lock removed from the picture, like so: > > > > diff --git a/block/fops.c b/block/fops.c > > index 2d01c9007681..7f9e9e2f9081 100644 > > --- a/block/fops.c > > +++ b/block/fops.c > > @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = { > > static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > > { > > struct inode *bd_inode = bdev_file_inode(file); > > - loff_t retval; > > > > - inode_lock(bd_inode); > > - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > > - inode_unlock(bd_inode); > > - return retval; > > + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > > } > > > > static int blkdev_fsync(struct file *filp, loff_t start, loff_t end, > > > > To be aborted if it blows up (but I don't see why it would). > > Thanks for this fix, will try and get back with results. > Please make sure to have results just with this change, no messing with folio sizes so that I have something for the patch submission.
On 27-Nov-24 5:58 PM, Mateusz Guzik wrote: > On Wed, Nov 27, 2024 at 1:18 PM Bharata B Rao <bharata@amd.com> wrote: >> >> On 27-Nov-24 11:49 AM, Mateusz Guzik wrote: >>> That is to say bare minimum this needs to be benchmarked before/after >>> with the lock removed from the picture, like so: >>> >>> diff --git a/block/fops.c b/block/fops.c >>> index 2d01c9007681..7f9e9e2f9081 100644 >>> --- a/block/fops.c >>> +++ b/block/fops.c >>> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = { >>> static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) >>> { >>> struct inode *bd_inode = bdev_file_inode(file); >>> - loff_t retval; >>> >>> - inode_lock(bd_inode); >>> - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); >>> - inode_unlock(bd_inode); >>> - return retval; >>> + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); >>> } >>> >>> static int blkdev_fsync(struct file *filp, loff_t start, loff_t end, >>> >>> To be aborted if it blows up (but I don't see why it would). >> >> Thanks for this fix, will try and get back with results. >> > > Please make sure to have results just with this change, no messing > with folio sizes so that I have something for the patch submission. The contention with inode_lock is gone after your above changes. The new top 10 contention data looks like this now: contended total wait max wait avg wait type caller 2441494015 172.15 h 1.72 ms 253.83 us spinlock folio_wait_bit_common+0xd5 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51 0xffffffffacdd1905 folio_wait_bit_common+0xd5 0xffffffffacdd2d0a filemap_get_pages+0x68a 0xffffffffacdd2e73 filemap_read+0x103 0xffffffffad1d67ba blkdev_read_iter+0x6a 0xffffffffacf06937 vfs_read+0x297 0xffffffffacf07653 ksys_read+0x73 25269947 1.58 h 1.72 ms 225.44 us spinlock folio_wake_bit+0x62 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c 0xffffffffacdcf322 folio_wake_bit+0x62 0xffffffffacdd2ca7 filemap_get_pages+0x627 0xffffffffacdd2e73 filemap_read+0x103 0xffffffffad1d67ba blkdev_read_iter+0x6a 0xffffffffacf06937 vfs_read+0x297 0xffffffffacf07653 ksys_read+0x73 44757761 1.05 h 1.55 ms 84.41 us spinlock folio_wake_bit+0x62 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c 0xffffffffacdcf322 folio_wake_bit+0x62 0xffffffffacdcf7bc folio_end_read+0x2c 0xffffffffacf6d4cf mpage_read_end_io+0x6f 0xffffffffad1d8abb bio_endio+0x12b 0xffffffffad1f07bd blk_mq_end_request_batch+0x12d 0xffffffffc05e4e9b nvme_pci_complete_batch+0xbb 66419830 53.00 m 685.29 us 47.88 us spinlock __filemap_add_folio+0x14c 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51 0xffffffffacdd037c __filemap_add_folio+0x14c 0xffffffffacdd072c filemap_add_folio+0x9c 0xffffffffacde2a4c page_cache_ra_unbounded+0x12c 0xffffffffacde31e0 page_cache_ra_order+0x340 0xffffffffacde33b8 page_cache_sync_ra+0x148 0xffffffffacdd27c5 filemap_get_pages+0x145 532 45.51 m 5.49 s 5.13 s mutex bdev_release+0x69 0xffffffffadbef1de __mutex_lock.constprop.0+0x17e 0xffffffffadbef863 __mutex_lock_slowpath+0x13 0xffffffffadbef8bb mutex_lock+0x3b 0xffffffffad1d5249 bdev_release+0x69 0xffffffffad1d5921 blkdev_release+0x11 0xffffffffacf089f3 __fput+0xe3 0xffffffffacf08c9b __fput_sync+0x1b 0xffffffffacefe8ed __x64_sys_close+0x3d 264989621 45.26 m 486.17 us 10.25 us spinlock raw_spin_rq_lock_nested+0x1d 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf5c7f _raw_spin_lock+0x3f 0xffffffffacb4f22d raw_spin_rq_lock_nested+0x1d 0xffffffffacb5b601 _raw_spin_rq_lock_irqsave+0x21 0xffffffffacb6f81b sched_balance_rq+0x60b 0xffffffffacb70784 sched_balance_newidle+0x1f4 0xffffffffacb70ae4 pick_next_task_fair+0x54 0xffffffffacb4e583 __pick_next_task+0x43 30026842 26.05 m 652.85 us 52.05 us spinlock __folio_mark_dirty+0x2b 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c 0xffffffffacde189b __folio_mark_dirty+0x2b 0xffffffffacf67503 mark_buffer_dirty+0x53 0xffffffffacf676e2 __block_commit_write+0x82 0xffffffffacf685fc block_write_end+0x3c 0xffffffffacfb577e iomap_write_end+0x13e 0xffffffffacfb674c iomap_file_buffered_write+0x29c 5085820 7.07 m 1.24 ms 83.42 us spinlock folio_wake_bit+0x62 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c 0xffffffffacdcf322 folio_wake_bit+0x62 0xffffffffacdcf7bc folio_end_read+0x2c 0xffffffffacf6d4cf mpage_read_end_io+0x6f 0xffffffffad1d8abb bio_endio+0x12b 0xffffffffad1ee8ca blk_update_request+0x1aa 0xffffffffad1ef7a4 blk_mq_end_request+0x24 8027370 1.90 m 76.21 us 14.23 us spinlock tick_do_update_jiffies64+0x29 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf5c7f _raw_spin_lock+0x3f 0xffffffffacc21c69 tick_do_update_jiffies64+0x29 0xffffffffacc21e8c tick_nohz_handler+0x13c 0xffffffffacc0a0cf __hrtimer_run_queues+0x10f 0xffffffffacc0afe8 hrtimer_interrupt+0xf8 0xffffffffacaa8f36 __sysvec_apic_timer_interrupt+0x56 0xffffffffadbe3953 sysvec_apic_timer_interrupt+0x93 4344269 1.08 m 600.67 us 14.90 us spinlock __filemap_add_folio+0x14c 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51 0xffffffffacdd037c __filemap_add_folio+0x14c 0xffffffffacdd072c filemap_add_folio+0x9c 0xffffffffacde2a4c page_cache_ra_unbounded+0x12c 0xffffffffacde31e0 page_cache_ra_order+0x340 0xffffffffacde3615 page_cache_async_ra+0x165 0xffffffffacdd2b9c filemap_get_pages+0x51c However a point of concern is that FIO bandwidth comes down drastically after the change. default inode_lock-fix rw=30% Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s Regards, Bharata.
On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote: > However a point of concern is that FIO bandwidth comes down drastically > after the change. > > default inode_lock-fix > rw=30% > Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s > Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s > Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s > Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s Something this dramatic usually only happens when you enable a debugging option. Can you recheck that you're running both A and B with the same debugging options both compiled in, and enabled?
On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@amd.com> wrote: > > The contention with inode_lock is gone after your above changes. The new > top 10 contention data looks like this now: > > contended total wait max wait avg wait type caller > > 2441494015 172.15 h 1.72 ms 253.83 us spinlock > folio_wait_bit_common+0xd5 > 0xffffffffadbf60a3 > native_queued_spin_lock_slowpath+0x1f3 > 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51 > 0xffffffffacdd1905 folio_wait_bit_common+0xd5 > 0xffffffffacdd2d0a filemap_get_pages+0x68a > 0xffffffffacdd2e73 filemap_read+0x103 > 0xffffffffad1d67ba blkdev_read_iter+0x6a > 0xffffffffacf06937 vfs_read+0x297 > 0xffffffffacf07653 ksys_read+0x73 > 25269947 1.58 h 1.72 ms 225.44 us spinlock > folio_wake_bit+0x62 > 0xffffffffadbf60a3 > native_queued_spin_lock_slowpath+0x1f3 > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c > 0xffffffffacdcf322 folio_wake_bit+0x62 > 0xffffffffacdd2ca7 filemap_get_pages+0x627 > 0xffffffffacdd2e73 filemap_read+0x103 > 0xffffffffad1d67ba blkdev_read_iter+0x6a > 0xffffffffacf06937 vfs_read+0x297 > 0xffffffffacf07653 ksys_read+0x73 > 44757761 1.05 h 1.55 ms 84.41 us spinlock > folio_wake_bit+0x62 > 0xffffffffadbf60a3 > native_queued_spin_lock_slowpath+0x1f3 > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c > 0xffffffffacdcf322 folio_wake_bit+0x62 > 0xffffffffacdcf7bc folio_end_read+0x2c > 0xffffffffacf6d4cf mpage_read_end_io+0x6f > 0xffffffffad1d8abb bio_endio+0x12b > 0xffffffffad1f07bd blk_mq_end_request_batch+0x12d > 0xffffffffc05e4e9b nvme_pci_complete_batch+0xbb [snip] > However a point of concern is that FIO bandwidth comes down drastically > after the change. > Nicely put :) > default inode_lock-fix > rw=30% > Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s > Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s > Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s > Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s > This means that the folio waiting stuff has poor scalability, but without digging into it I have no idea what can be done. The easy way out would be to speculatively spin before buggering off, but one would have to check what happens in real workloads -- presumably the lock owner can be off cpu for a long time (I presume there is no way to store the owner). The now-removed lock uses rwsems which behave better when contested and was pulling contention away from folios, artificially *helping* performance by having the folio bottleneck be exercised less. The right thing to do in the long run is still to whack the llseek lock acquire, but in the light of the above it can probably wait for better times.
On Thu, Nov 28, 2024 at 5:22 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@amd.com> wrote: > > > > The contention with inode_lock is gone after your above changes. The new > > top 10 contention data looks like this now: > > > > contended total wait max wait avg wait type caller > > > > 2441494015 172.15 h 1.72 ms 253.83 us spinlock > > folio_wait_bit_common+0xd5 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51 > > 0xffffffffacdd1905 folio_wait_bit_common+0xd5 > > 0xffffffffacdd2d0a filemap_get_pages+0x68a > > 0xffffffffacdd2e73 filemap_read+0x103 > > 0xffffffffad1d67ba blkdev_read_iter+0x6a > > 0xffffffffacf06937 vfs_read+0x297 > > 0xffffffffacf07653 ksys_read+0x73 > > 25269947 1.58 h 1.72 ms 225.44 us spinlock > > folio_wake_bit+0x62 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c > > 0xffffffffacdcf322 folio_wake_bit+0x62 > > 0xffffffffacdd2ca7 filemap_get_pages+0x627 > > 0xffffffffacdd2e73 filemap_read+0x103 > > 0xffffffffad1d67ba blkdev_read_iter+0x6a > > 0xffffffffacf06937 vfs_read+0x297 > > 0xffffffffacf07653 ksys_read+0x73 > > 44757761 1.05 h 1.55 ms 84.41 us spinlock > > folio_wake_bit+0x62 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c > > 0xffffffffacdcf322 folio_wake_bit+0x62 > > 0xffffffffacdcf7bc folio_end_read+0x2c > > 0xffffffffacf6d4cf mpage_read_end_io+0x6f > > 0xffffffffad1d8abb bio_endio+0x12b > > 0xffffffffad1f07bd blk_mq_end_request_batch+0x12d > > 0xffffffffc05e4e9b nvme_pci_complete_batch+0xbb > [snip] > > However a point of concern is that FIO bandwidth comes down drastically > > after the change. > > > > Nicely put :) > > > default inode_lock-fix > > rw=30% > > Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s > > Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s > > Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s > > Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s > > > > This means that the folio waiting stuff has poor scalability, but > without digging into it I have no idea what can be done. The easy way > out would be to speculatively spin before buggering off, but one would > have to check what happens in real workloads -- presumably the lock > owner can be off cpu for a long time (I presume there is no way to > store the owner). > > The now-removed lock uses rwsems which behave better when contested > and was pulling contention away from folios, artificially *helping* > performance by having the folio bottleneck be exercised less. > > The right thing to do in the long run is still to whack the llseek > lock acquire, but in the light of the above it can probably wait for > better times. WIlly mentioned the folio wait queue hash table could be grown, you can find it in mm/filemap.c: 1062 #define PAGE_WAIT_TABLE_BITS 8 1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) 1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned; 1065 1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio) 1067 { 1068 │ return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)]; 1069 } Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out On debian this ships with the bpfcc-tools package.
On 28-Nov-24 9:52 AM, Matthew Wilcox wrote: > On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote: >> However a point of concern is that FIO bandwidth comes down drastically >> after the change. >> >> default inode_lock-fix >> rw=30% >> Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s >> Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s >> Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s >> Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s > > Something this dramatic usually only happens when you enable a debugging > option. Can you recheck that you're running both A and B with the same > debugging options both compiled in, and enabled? It is the same kernel tree with and w/o Mateusz's inode_lock changes to block/fops.c. I see the config remains same for both the builds. Let me get a run for both base and patched case w/o running perf lock contention to check if that makes a difference. Regards, Bharata.
On Thu, Nov 28, 2024 at 05:22:41AM +0100, Mateusz Guzik wrote: > This means that the folio waiting stuff has poor scalability, but > without digging into it I have no idea what can be done. The easy way Actually the easy way is to change: #define PAGE_WAIT_TABLE_BITS 8 to a larger number. > out would be to speculatively spin before buggering off, but one would > have to check what happens in real workloads -- presumably the lock > owner can be off cpu for a long time (I presume there is no way to > store the owner). So ... - There's no space in struct folio to put a rwsem. - But we want to be able to sleep waiting for a folio to (eg) do I/O. This is the solution we have. For the read case, there are three important bits in folio->flags to pay attention to: - PG_locked. This is held during the read. - PG_uptodate. This is set if the read succeeded. - PG_waiters. This is set if anyone is waiting for PG_locked [*] The first thread comes along, allocates a folio, locks it, inserts it into the mapping. The second thread comes along, finds the folio, sees it's !uptodate, sets the waiter bit, adds itself to the waitqueue. The third thread, ditto. The read completes. In interrupt or maybe softirq context, the BIO completion sets the uptodate bit, clears the locked bit and tests the waiter bit. Since the waiter bit is set, it walks the waitqueue looking for waiters which match the locked bit and folio (see folio_wake_bit()). So there's not _much_ of a thundering herd problem here. Most likely the waitqueue is just too damn long with a lot of threads waiting for I/O. [*] oversimplification; don't worry about it.
Jan Kara <jack@suse.cz> writes: > On Wed 27-11-24 07:19:59, Mateusz Guzik wrote: >> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote: >> > >> > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote: >> > > >> > > Recently we discussed the scalability issues while running large >> > > instances of FIO with buffered IO option on NVME block devices here: >> > > >> > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ >> > > >> > > One of the suggestions Chris Mason gave (during private discussions) was >> > > to enable large folios in block buffered IO path as that could >> > > improve the scalability problems and improve the lock contention >> > > scenarios. >> > > >> > >> > I have no basis to comment on the idea. >> > >> > However, it is pretty apparent whatever the situation it is being >> > heavily disfigured by lock contention in blkdev_llseek: >> > >> > > perf-lock contention output >> > > --------------------------- >> > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite >> > > mix it looks like this: >> > > >> > > perf-lock contention default >> > > contended total wait max wait avg wait type caller >> > > >> > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 >> > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 >> > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c >> > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 >> > > 0xffffffff8f39e88f up_write+0x4f >> > > 0xffffffff8f9d598e blkdev_llseek+0x4e >> > > 0xffffffff8f703322 ksys_lseek+0x72 >> > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 >> > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 >> > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 >> > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c >> > > 0xffffffff903f18fb down_write+0x5b >> > > 0xffffffff8f9d5971 blkdev_llseek+0x31 >> > > 0xffffffff8f703322 ksys_lseek+0x72 >> > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 >> > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 >> > > 0xffffffff903dce5e do_syscall_64+0x7e >> > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 >> > >> > Admittedly I'm not familiar with this code, but at a quick glance the >> > lock can be just straight up removed here? >> > >> > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) >> > 535 { >> > 536 │ struct inode *bd_inode = bdev_file_inode(file); >> > 537 │ loff_t retval; >> > 538 │ >> > 539 │ inode_lock(bd_inode); >> > 540 │ retval = fixed_size_llseek(file, offset, whence, >> > i_size_read(bd_inode)); >> > 541 │ inode_unlock(bd_inode); >> > 542 │ return retval; >> > 543 } >> > >> > At best it stabilizes the size for the duration of the call. Sounds >> > like it helps nothing since if the size can change, the file offset >> > will still be altered as if there was no locking? >> > >> > Suppose this cannot be avoided to grab the size for whatever reason. >> > >> > While the above fio invocation did not work for me, I ran some crapper >> > which I had in my shell history and according to strace: >> > [pid 271829] lseek(7, 0, SEEK_SET) = 0 >> > [pid 271829] lseek(7, 0, SEEK_SET) = 0 >> > [pid 271830] lseek(7, 0, SEEK_SET) = 0 >> > >> > ... the lseeks just rewind to the beginning, *definitely* not needing >> > to know the size. One would have to check but this is most likely the >> > case in your test as well. >> > >> > And for that there is 0 need to grab the size, and consequently the inode lock. >> >> That is to say bare minimum this needs to be benchmarked before/after >> with the lock removed from the picture, like so: > > Yeah, I've noticed this in the locking profiles as well and I agree > bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4) > get away without using inode lock in their llseek handler... > Right, we don't need an inode_lock() for i_size_read(). i_size_write() still needs locking for serialization, mainly for 32bit SMP case, due to use of seqcounts. I guess it would be good to maybe add this in Documentation too rather than this info just hanging on top of i_size_write()? References =========== [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/locking.rst#n557 [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/fs.h#n932 [3]: https://lore.kernel.org/all/20061016162729.176738000@szeredi.hu/ -ritesh
On 28-Nov-24 10:07 AM, Bharata B Rao wrote: > On 28-Nov-24 9:52 AM, Matthew Wilcox wrote: >> On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote: >>> However a point of concern is that FIO bandwidth comes down drastically >>> after the change. >>> >>> default inode_lock-fix >>> rw=30% >>> Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s >>> Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s >>> Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s >>> Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s >> >> Something this dramatic usually only happens when you enable a debugging >> option. Can you recheck that you're running both A and B with the same >> debugging options both compiled in, and enabled? > > It is the same kernel tree with and w/o Mateusz's inode_lock changes to > block/fops.c. I see the config remains same for both the builds. > > Let me get a run for both base and patched case w/o running perf lock > contention to check if that makes a difference. Without perf lock contention default inode_lock-fix rw=30% Instance 1 r=54.6GiB/s,w=23.4GiB/s r=11.4GiB/s,w=4992MiB/s Instance 2 r=52.7GiB/s,w=22.6GiB/s r=11.4GiB/s,w=4981MiB/s Instance 3 r=53.3GiB/s,w=22.8GiB/s r=12.7GiB/s,w=5575MiB/s Instance 4 r=37.7GiB/s,w=16.2GiB/s r=10.4GiB/s,w=4581MiB/s > > Regards, > Bharata. >
On Thu, Nov 28, 2024 at 12:24 PM Bharata B Rao <bharata@amd.com> wrote: > > On 28-Nov-24 10:07 AM, Bharata B Rao wrote: > > On 28-Nov-24 9:52 AM, Matthew Wilcox wrote: > >> On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote: > >>> However a point of concern is that FIO bandwidth comes down drastically > >>> after the change. > >>> > >>> default inode_lock-fix > >>> rw=30% > >>> Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s > >>> Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s > >>> Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s > >>> Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s > >> > >> Something this dramatic usually only happens when you enable a debugging > >> option. Can you recheck that you're running both A and B with the same > >> debugging options both compiled in, and enabled? > > > > It is the same kernel tree with and w/o Mateusz's inode_lock changes to > > block/fops.c. I see the config remains same for both the builds. > > > > Let me get a run for both base and patched case w/o running perf lock > > contention to check if that makes a difference. > > Without perf lock contention > > default inode_lock-fix > rw=30% > Instance 1 r=54.6GiB/s,w=23.4GiB/s r=11.4GiB/s,w=4992MiB/s > Instance 2 r=52.7GiB/s,w=22.6GiB/s r=11.4GiB/s,w=4981MiB/s > Instance 3 r=53.3GiB/s,w=22.8GiB/s r=12.7GiB/s,w=5575MiB/s > Instance 4 r=37.7GiB/s,w=16.2GiB/s r=10.4GiB/s,w=4581MiB/s > per my other e-mail can you follow willy's suggestion and increase the hash? best case scenario this takes care of it and then some heuristic can be added how to autosize the thing. If someone feels like microoptimizing I also note there is magic infra to have the size hotpatchable into generated asm instead of it being read (see dentry cache as an example user).
On 29-Nov-24 5:01 AM, Mateusz Guzik wrote: > On Thu, Nov 28, 2024 at 12:24 PM Bharata B Rao <bharata@amd.com> wrote: >> >> On 28-Nov-24 10:07 AM, Bharata B Rao wrote: >>> On 28-Nov-24 9:52 AM, Matthew Wilcox wrote: >>>> On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote: >>>>> However a point of concern is that FIO bandwidth comes down drastically >>>>> after the change. >>>>> >>>>> default inode_lock-fix >>>>> rw=30% >>>>> Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s >>>>> Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s >>>>> Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s >>>>> Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s >>>> >>>> Something this dramatic usually only happens when you enable a debugging >>>> option. Can you recheck that you're running both A and B with the same >>>> debugging options both compiled in, and enabled? >>> >>> It is the same kernel tree with and w/o Mateusz's inode_lock changes to >>> block/fops.c. I see the config remains same for both the builds. >>> >>> Let me get a run for both base and patched case w/o running perf lock >>> contention to check if that makes a difference. >> >> Without perf lock contention >> >> default inode_lock-fix >> rw=30% >> Instance 1 r=54.6GiB/s,w=23.4GiB/s r=11.4GiB/s,w=4992MiB/s >> Instance 2 r=52.7GiB/s,w=22.6GiB/s r=11.4GiB/s,w=4981MiB/s >> Instance 3 r=53.3GiB/s,w=22.8GiB/s r=12.7GiB/s,w=5575MiB/s >> Instance 4 r=37.7GiB/s,w=16.2GiB/s r=10.4GiB/s,w=4581MiB/s >> > > per my other e-mail can you follow willy's suggestion and increase the hash? With Mateusz's inode_lock fix and PAGE_WAIT_TABLE_BITS value of 10, 14, 16 and 20. (Two values given with each instance below are FIO READ bw and WRITE bw) 10 14 16 20 rw=30% Instance 1 11.3GiB/s 14.2GiB/s 14.8GiB/s 14.9GiB/s 4965MiB/s 6225MiB/s 6487MiB/s 6552MiB/s Instance 2 12.3GiB/s 10.4GiB/s 10.9GiB/s 11.0GiB/s 5389MiB/s 4548MiB/s 4770MiB/s 4815MiB/s Instance 3 11.1GiB/s 12.3GiB/s 11.2GiB/s 13.5GiB/s 4864MiB/s 5410MiB/s 4923MiB/s 5927MiB/s Instance 4 12.3GiB/s 13.7GiB/s 13.0GiB/s 11.4GiB/s 5404MiB/s 6004MiB/s 5689MiB/s 5007MiB/s Number of hash buckets don't seem to matter all that much in this case. Regards, Bharata.