Message ID | 20221020032024.1804535-1-yury.norov@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Switch ARM to generic find_bit() API | expand |
On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote: > Hi Russell, all, > > I'd like to respin a patch that switches ARM to generic find_bit() > functions. > > Generic code works on par with arch or better, according to my > testing [1], and with recent improvements merged in v6.1, it should > be even faster. > > ARM already uses many generic find_bit() functions - those that it > doesn't implement. So we are talking about migrating a subset of the > API; most of find_bit() family has only generic implementation on ARM. > > The only concern about this migration is that ARM code supports > byte-aligned bitmap addresses, while generic code is optimized for > word-aligned bitmaps. > > In my practice, I've never seen unaligned bitmaps. But to check that on > ARM, I added a run-time check for bitmap alignment. I gave it run on > several architectures and found nothing. > > Can you please check that on your hardware and compare performance of > generic vs arch code for you? If everything is OK, I suggest switching > ARM to generic find_bit() completely. > > Thanks, > Yury > > [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/ I _really_ don't want to play around with this stuff right now... 6.0 appears to have a regression on arm32 early on during boot: [ 1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0 Booting 5.19 with the same filesystem works without issue and without even a fsck, but booting 6.0 always results in some problem that prevents it booting. Debugging this is not easy, because there also seems to be something up with the bloody serial console - sometimes I get nothing, other times I get nothing more than: [ 2.929502] EXT4-fs error (de and then the output stops. Is the console no longer synchronous? If it isn't, that's a huge mistake which can be seen right here with the partial message output... so I also need to work out how to make the console output synchronous again.
On Thu, Oct 20, 2022 at 05:51:34PM +0100, Russell King (Oracle) wrote: > On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote: > > Hi Russell, all, > > > > I'd like to respin a patch that switches ARM to generic find_bit() > > functions. > > > > Generic code works on par with arch or better, according to my > > testing [1], and with recent improvements merged in v6.1, it should > > be even faster. > > > > ARM already uses many generic find_bit() functions - those that it > > doesn't implement. So we are talking about migrating a subset of the > > API; most of find_bit() family has only generic implementation on ARM. > > > > The only concern about this migration is that ARM code supports > > byte-aligned bitmap addresses, while generic code is optimized for > > word-aligned bitmaps. > > > > In my practice, I've never seen unaligned bitmaps. But to check that on > > ARM, I added a run-time check for bitmap alignment. I gave it run on > > several architectures and found nothing. > > > > Can you please check that on your hardware and compare performance of > > generic vs arch code for you? If everything is OK, I suggest switching > > ARM to generic find_bit() completely. > > > > Thanks, > > Yury > > > > [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/ > > I _really_ don't want to play around with this stuff right now... 6.0 > appears to have a regression on arm32 early on during boot: > > [ 1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0 > > Booting 5.19 with the same filesystem works without issue and without > even a fsck, but booting 6.0 always results in some problem that > prevents it booting. > > Debugging this is not easy, because there also seems to be something > up with the bloody serial console - sometimes I get nothing, other > times I get nothing more than: > > [ 2.929502] EXT4-fs error (de > > and then the output stops. Is the console no longer synchronous? If it > isn't, that's a huge mistake which can be seen right here with the > partial message output... so I also need to work out how to make the > console output synchronous again. Got it. I you think that EXT4 problems are due to unaligned bitmaps, you can take 1st patch from this series to check. Thanks, Yury
On Thu, Oct 20, 2022 at 01:02:07PM -0700, Yury Norov wrote: > On Thu, Oct 20, 2022 at 05:51:34PM +0100, Russell King (Oracle) wrote: > > On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote: > > > Hi Russell, all, > > > > > > I'd like to respin a patch that switches ARM to generic find_bit() > > > functions. > > > > > > Generic code works on par with arch or better, according to my > > > testing [1], and with recent improvements merged in v6.1, it should > > > be even faster. > > > > > > ARM already uses many generic find_bit() functions - those that it > > > doesn't implement. So we are talking about migrating a subset of the > > > API; most of find_bit() family has only generic implementation on ARM. > > > > > > The only concern about this migration is that ARM code supports > > > byte-aligned bitmap addresses, while generic code is optimized for > > > word-aligned bitmaps. > > > > > > In my practice, I've never seen unaligned bitmaps. But to check that on > > > ARM, I added a run-time check for bitmap alignment. I gave it run on > > > several architectures and found nothing. > > > > > > Can you please check that on your hardware and compare performance of > > > generic vs arch code for you? If everything is OK, I suggest switching > > > ARM to generic find_bit() completely. > > > > > > Thanks, > > > Yury > > > > > > [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/ > > > > I _really_ don't want to play around with this stuff right now... 6.0 > > appears to have a regression on arm32 early on during boot: > > > > [ 1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0 > > > > Booting 5.19 with the same filesystem works without issue and without > > even a fsck, but booting 6.0 always results in some problem that > > prevents it booting. > > > > Debugging this is not easy, because there also seems to be something > > up with the bloody serial console - sometimes I get nothing, other > > times I get nothing more than: > > > > [ 2.929502] EXT4-fs error (de > > > > and then the output stops. Is the console no longer synchronous? If it > > isn't, that's a huge mistake which can be seen right here with the > > partial message output... so I also need to work out how to make the > > console output synchronous again. > > Got it. > > I you think that EXT4 problems are due to unaligned bitmaps, you can take > 1st patch from this series to check. Got to the bottom of it, it wasn't the bit array functions, it was DMA API issues. Okay, I've now tested the generic ops vs my updated optimised ops, and my version still comes out faster (based on three runs). The random-filled show less difference, but the sparse bitmaps show a much better win for my optimised code over the generic code where they exist: arm: [ 694.614773] find_next_bit: 40078 ns, 656 iterations gen: [ 88.611973] find_next_bit: 69943 ns, 655 iterations arm: [ 694.625436] find_next_zero_bit: 3939309 ns, 327025 iterations gen: [ 88.624227] find_next_zero_bit: 5529553 ns, 327026 iterations arm: [ 694.646236] find_first_bit: 7301363 ns, 656 iterations gen: [ 88.645313] find_first_bit: 7589120 ns, 655 iterations These figures appear to be pretty consistent across the three runs. For completness, here's the random-filled results: arm: [ 694.109190] find_next_bit: 2242618 ns, 163949 iterations gen: [ 88.167340] find_next_bit: 2632859 ns, 163743 iterations arm: [ 694.117968] find_next_zero_bit: 2049129 ns, 163732 iterations gen: [ 88.176912] find_next_zero_bit: 2844221 ns, 163938 iterations arm: [ 694.151421] find_first_bit: 17778911 ns, 16448 iterations gen: [ 88.211167] find_first_bit: 18596622 ns, 16401 iterations So, I don't see much reason to switch to the generic ops for these, not when we have such a significant speedup on the find_next_* functions for sparse-filled results..