Message ID | 20211128035704.270739-1-yury.norov@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | lib/bitmap: optimize bitmap_weight() usage | expand |
Excerpts from Yury Norov's message of November 28, 2021 1:56 pm: > In many cases people use bitmap_weight()-based functions like this: > > if (num_present_cpus() > 1) > do_something(); > > This may take considerable amount of time on many-cpus machines because > num_present_cpus() will traverse every word of underlying cpumask > unconditionally. > > We can significantly improve on it for many real cases if stop traversing > the mask as soon as we count present cpus to any number greater than 1: > > if (num_present_cpus_gt(1)) > do_something(); > > To implement this idea, the series adds bitmap_weight_{eq,gt,le} > functions together with corresponding wrappers in cpumask and nodemask. There would be no change to callers if you maintain counters like what is done for num_online_cpus() today. Maybe some fixes to arch code that does not use set_cpu_possible() etc APIs required, but AFAIKS it would be better to fix such cases anyway. Thanks, Nick
On Sat, Nov 27, 2021 at 07:56:55PM -0800, Yury Norov wrote: > In many cases people use bitmap_weight()-based functions like this: > > if (num_present_cpus() > 1) > do_something(); > > This may take considerable amount of time on many-cpus machines because > num_present_cpus() will traverse every word of underlying cpumask > unconditionally. > > We can significantly improve on it for many real cases if stop traversing > the mask as soon as we count present cpus to any number greater than 1: > > if (num_present_cpus_gt(1)) > do_something(); > > To implement this idea, the series adds bitmap_weight_{eq,gt,le} > functions together with corresponding wrappers in cpumask and nodemask. Having slept on it I have more structured thoughts: First, I like substituting bitmap_empty/full where possible - I think the change stands on its own, so could be split and sent as is. I don't like the proposed API very much. One problem is that it hides the comparison operator and makes call sites less readable: bitmap_weight(...) > N becomes: bitmap_weight_gt(..., N) and: bitmap_weight(...) <= N becomes: bitmap_weight_lt(..., N+1) or: !bitmap_weight_gt(..., N) I'd rather see something resembling memcmp() API that's known enough to be easier to grasp. For above examples: bitmap_weight_cmp(..., N) > 0 bitmap_weight_cmp(..., N) <= 0 ... This would also make the implementation easier in not having to copy and paste the code three times. Could also use a simple optimization reducing code size: #include <linux/overflow.h> int bitmap_weight_cmp(long *bits, size_t nbits, size_t cmp) { for (size_t i = 0; i < nbits / BITS_PER_LONG; ++i, ++bits) if (check_sub_overflow(cmp, popcount(*bits), &cmp)) return 1; nbits %= BITS_PER_LONG; if (nbits && check_sub_overflow(cmp, popcount(*bits & GENMASK(nbits)), &cmp)) return 1; return cmp ? -1 : 0; } Best Regards Michał Mirosław
On Sun, Nov 28, 2021 at 09:08:41PM +1000, Nicholas Piggin wrote: > Excerpts from Yury Norov's message of November 28, 2021 1:56 pm: > > In many cases people use bitmap_weight()-based functions like this: > > > > if (num_present_cpus() > 1) > > do_something(); > > > > This may take considerable amount of time on many-cpus machines because > > num_present_cpus() will traverse every word of underlying cpumask > > unconditionally. > > > > We can significantly improve on it for many real cases if stop traversing > > the mask as soon as we count present cpus to any number greater than 1: > > > > if (num_present_cpus_gt(1)) > > do_something(); > > > > To implement this idea, the series adds bitmap_weight_{eq,gt,le} > > functions together with corresponding wrappers in cpumask and nodemask. > > There would be no change to callers if you maintain counters like what > is done for num_online_cpus() today. Maybe some fixes to arch code that > does not use set_cpu_possible() etc APIs required, but AFAIKS it would > be better to fix such cases anyway. Thanks, Nick. I'll try to do this.
On Sun, Nov 28, 2021 at 07:03:41PM +0100, mirq-test@rere.qmqm.pl wrote: > On Sat, Nov 27, 2021 at 07:56:55PM -0800, Yury Norov wrote: > > In many cases people use bitmap_weight()-based functions like this: > > > > if (num_present_cpus() > 1) > > do_something(); > > > > This may take considerable amount of time on many-cpus machines because > > num_present_cpus() will traverse every word of underlying cpumask > > unconditionally. > > > > We can significantly improve on it for many real cases if stop traversing > > the mask as soon as we count present cpus to any number greater than 1: > > > > if (num_present_cpus_gt(1)) > > do_something(); > > > > To implement this idea, the series adds bitmap_weight_{eq,gt,le} > > functions together with corresponding wrappers in cpumask and nodemask. > > Having slept on it I have more structured thoughts: > > First, I like substituting bitmap_empty/full where possible - I think > the change stands on its own, so could be split and sent as is. Ok, I can do it. > I don't like the proposed API very much. One problem is that it hides > the comparison operator and makes call sites less readable: > > bitmap_weight(...) > N > > becomes: > > bitmap_weight_gt(..., N) > > and: > bitmap_weight(...) <= N > > becomes: > > bitmap_weight_lt(..., N+1) > or: > !bitmap_weight_gt(..., N) > > I'd rather see something resembling memcmp() API that's known enough > to be easier to grasp. For above examples: > > bitmap_weight_cmp(..., N) > 0 > bitmap_weight_cmp(..., N) <= 0 > ... bitmap_weight_cmp() cannot be efficient. Consider this example: bitmap_weight_lt(1000 0000 0000 0000, 1) == false ^ stop here bitmap_weight_cmp(1000 0000 0000 0000, 1) == 0 ^ stop here I agree that '_gt' is less verbose than '>', but the advantage of '_gt' over '>' is proportional to length of bitmap, and it means that this API should exist. > This would also make the implementation easier in not having to > copy and paste the code three times. Could also use a simple > optimization reducing code size: In the next version I'll reduce code duplication like this: bool bitmap_eq(..., N); bool bitmap_ge(..., N); #define bitmap_weight_gt(..., N) bitmap_weight_ge(..., N + 1) #define bitmap_weight_lt(..., N) !bitmap_weight_ge(..., N) #define bitmap_weight_le(..., N) !bitmap_weight_gt(..., N) Thanks, Yury
Dnia 29 listopada 2021 06:38:39 UTC, Yury Norov <yury.norov@gmail.com> napisał/a: >On Sun, Nov 28, 2021 at 07:03:41PM +0100, mirq-test@rere.qmqm.pl wrote: >> On Sat, Nov 27, 2021 at 07:56:55PM -0800, Yury Norov wrote: >> > In many cases people use bitmap_weight()-based functions like this: >> > >> > if (num_present_cpus() > 1) >> > do_something(); >> > >> > This may take considerable amount of time on many-cpus machines because >> > num_present_cpus() will traverse every word of underlying cpumask >> > unconditionally. >> > >> > We can significantly improve on it for many real cases if stop traversing >> > the mask as soon as we count present cpus to any number greater than 1: >> > >> > if (num_present_cpus_gt(1)) >> > do_something(); >> > >> > To implement this idea, the series adds bitmap_weight_{eq,gt,le} >> > functions together with corresponding wrappers in cpumask and nodemask. >> >> Having slept on it I have more structured thoughts: >> >> First, I like substituting bitmap_empty/full where possible - I think >> the change stands on its own, so could be split and sent as is. > >Ok, I can do it. > >> I don't like the proposed API very much. One problem is that it hides >> the comparison operator and makes call sites less readable: >> >> bitmap_weight(...) > N >> >> becomes: >> >> bitmap_weight_gt(..., N) >> >> and: >> bitmap_weight(...) <= N >> >> becomes: >> >> bitmap_weight_lt(..., N+1) >> or: >> !bitmap_weight_gt(..., N) >> >> I'd rather see something resembling memcmp() API that's known enough >> to be easier to grasp. For above examples: >> >> bitmap_weight_cmp(..., N) > 0 >> bitmap_weight_cmp(..., N) <= 0 >> ... > >bitmap_weight_cmp() cannot be efficient. Consider this example: > >bitmap_weight_lt(1000 0000 0000 0000, 1) == false > ^ > stop here > >bitmap_weight_cmp(1000 0000 0000 0000, 1) == 0 > ^ > stop here > >I agree that '_gt' is less verbose than '>', but the advantage of >'_gt' over '>' is proportional to length of bitmap, and it means >that this API should exist. Thank you for the example. Indeed, for less-than to be efficient here you would need to replace bitmap_weight_cmp(..., N) < 0 with bitmap_weight_cmp(..., N-1) <= 0 It would still be more readable, I think. Best Regards Michał Mirosław
On Mon, Nov 29, 2021 at 04:34:07PM +0000, Michał Mirosław wrote: > Dnia 29 listopada 2021 06:38:39 UTC, Yury Norov <yury.norov@gmail.com> napisał/a: > >On Sun, Nov 28, 2021 at 07:03:41PM +0100, mirq-test@rere.qmqm.pl wrote: > >> On Sat, Nov 27, 2021 at 07:56:55PM -0800, Yury Norov wrote: > >> > In many cases people use bitmap_weight()-based functions like this: > >> > > >> > if (num_present_cpus() > 1) > >> > do_something(); > >> > > >> > This may take considerable amount of time on many-cpus machines because > >> > num_present_cpus() will traverse every word of underlying cpumask > >> > unconditionally. > >> > > >> > We can significantly improve on it for many real cases if stop traversing > >> > the mask as soon as we count present cpus to any number greater than 1: > >> > > >> > if (num_present_cpus_gt(1)) > >> > do_something(); > >> > > >> > To implement this idea, the series adds bitmap_weight_{eq,gt,le} > >> > functions together with corresponding wrappers in cpumask and nodemask. > >> > >> Having slept on it I have more structured thoughts: > >> > >> First, I like substituting bitmap_empty/full where possible - I think > >> the change stands on its own, so could be split and sent as is. > > > >Ok, I can do it. > > > >> I don't like the proposed API very much. One problem is that it hides > >> the comparison operator and makes call sites less readable: > >> > >> bitmap_weight(...) > N > >> > >> becomes: > >> > >> bitmap_weight_gt(..., N) > >> > >> and: > >> bitmap_weight(...) <= N > >> > >> becomes: > >> > >> bitmap_weight_lt(..., N+1) > >> or: > >> !bitmap_weight_gt(..., N) > >> > >> I'd rather see something resembling memcmp() API that's known enough > >> to be easier to grasp. For above examples: > >> > >> bitmap_weight_cmp(..., N) > 0 > >> bitmap_weight_cmp(..., N) <= 0 > >> ... > > > >bitmap_weight_cmp() cannot be efficient. Consider this example: > > > >bitmap_weight_lt(1000 0000 0000 0000, 1) == false > > ^ > > stop here > > > >bitmap_weight_cmp(1000 0000 0000 0000, 1) == 0 > > ^ > > stop here > > > >I agree that '_gt' is less verbose than '>', but the advantage of > >'_gt' over '>' is proportional to length of bitmap, and it means > >that this API should exist. > > Thank you for the example. Indeed, for less-than to be efficient here you would need to replace > bitmap_weight_cmp(..., N) < 0 > with > bitmap_weight_cmp(..., N-1) <= 0 Indeed, thanks for pointing to it. > It would still be more readable, I think. To be honest, I'm not sure that bitmap_weight_cmp(..., N-1) <= 0 would be an obvious replacement for the original bitmap_weight(...) < N comparing to bitmap_weight_lt(..., N) I think the best thing I can do is to add bitmap_weight_cmp() as you suggested, and turn lt and others to be wrappers on it. This will let people choose a better function in each case. I also think that for v2 it would be better to drop the conversion for short bitmaps, except for switching to bitmap_empty(), because in that case readability wins over performance; if no objections. Thanks, Yury