mbox series

[0/9] lib/bitmap: optimize bitmap_weight() usage

Message ID 20211128035704.270739-1-yury.norov@gmail.com (mailing list archive)
Headers show
Series lib/bitmap: optimize bitmap_weight() usage | expand

Message

Yury Norov Nov. 28, 2021, 3:56 a.m. UTC
In many cases people use bitmap_weight()-based functions like this:

	if (num_present_cpus() > 1)
		do_something();

This may take considerable amount of time on many-cpus machines because
num_present_cpus() will traverse every word of underlying cpumask
unconditionally.

We can significantly improve on it for many real cases if stop traversing
the mask as soon as we count present cpus to any number greater than 1:

	if (num_present_cpus_gt(1))
		do_something();

To implement this idea, the series adds bitmap_weight_{eq,gt,le}
functions together with corresponding wrappers in cpumask and nodemask.

Yury Norov (9):
  lib/bitmap: add bitmap_weight_{eq,gt,le}
  lib/bitmap: implement bitmap_{empty,full} with bitmap_weight_eq()
  all: replace bitmap_weigth() with bitmap_{empty,full,eq,gt,le}
  tools: sync bitmap_weight() usage with the kernel
  lib/cpumask: add cpumask_weight_{eq,gt,le}
  lib/nodemask: add nodemask_weight_{eq,gt,le}
  lib/cpumask: add num_{possible,present,active}_cpus_{eq,gt,le}
  lib/nodemask: add num_node_state_eq()
  MAINTAINERS: add cpumask and nodemask files to BITMAP_API

 MAINTAINERS                                   |  4 ++
 arch/alpha/kernel/process.c                   |  2 +-
 arch/arc/kernel/smp.c                         |  2 +-
 arch/arm/kernel/machine_kexec.c               |  2 +-
 arch/arm/mach-exynos/exynos.c                 |  2 +-
 arch/arm/mm/cache-b15-rac.c                   |  2 +-
 arch/arm64/kernel/smp.c                       |  2 +-
 arch/arm64/mm/context.c                       |  2 +-
 arch/csky/mm/asid.c                           |  2 +-
 arch/csky/mm/context.c                        |  2 +-
 arch/ia64/kernel/setup.c                      |  2 +-
 arch/ia64/mm/tlb.c                            |  8 +--
 arch/mips/cavium-octeon/octeon-irq.c          |  4 +-
 arch/mips/kernel/crash.c                      |  2 +-
 arch/mips/kernel/i8253.c                      |  2 +-
 arch/mips/kernel/perf_event_mipsxx.c          |  4 +-
 arch/mips/kernel/rtlx-cmp.c                   |  2 +-
 arch/mips/kernel/smp.c                        |  4 +-
 arch/mips/kernel/vpe-cmp.c                    |  2 +-
 .../loongson2ef/common/cs5536/cs5536_mfgpt.c  |  2 +-
 arch/mips/mm/context.c                        |  2 +-
 arch/mips/mm/tlbex.c                          |  2 +-
 arch/nds32/kernel/perf_event_cpu.c            |  4 +-
 arch/nios2/kernel/cpuinfo.c                   |  2 +-
 arch/powerpc/kernel/smp.c                     |  2 +-
 arch/powerpc/kernel/watchdog.c                |  4 +-
 arch/powerpc/platforms/85xx/smp.c             |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |  4 +-
 arch/powerpc/sysdev/mpic.c                    |  2 +-
 arch/powerpc/xmon/xmon.c                      | 10 +--
 arch/riscv/kvm/vmid.c                         |  2 +-
 arch/s390/kernel/perf_cpum_cf.c               |  2 +-
 arch/sparc/kernel/mdesc.c                     |  6 +-
 arch/x86/events/amd/core.c                    |  2 +-
 arch/x86/kernel/alternative.c                 |  8 +--
 arch/x86/kernel/apic/apic.c                   |  4 +-
 arch/x86/kernel/apic/apic_flat_64.c           |  2 +-
 arch/x86/kernel/apic/probe_32.c               |  2 +-
 arch/x86/kernel/cpu/mce/dev-mcelog.c          |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        | 18 +++---
 arch/x86/kernel/hpet.c                        |  2 +-
 arch/x86/kernel/i8253.c                       |  2 +-
 arch/x86/kernel/kvm.c                         |  2 +-
 arch/x86/kernel/kvmclock.c                    |  2 +-
 arch/x86/kernel/smpboot.c                     |  4 +-
 arch/x86/kernel/tsc.c                         |  2 +-
 arch/x86/kvm/hyperv.c                         |  8 +--
 arch/x86/mm/amdtopology.c                     |  2 +-
 arch/x86/mm/mmio-mod.c                        |  2 +-
 arch/x86/mm/numa_emulation.c                  |  4 +-
 arch/x86/platform/uv/uv_nmi.c                 |  2 +-
 arch/x86/xen/smp_pv.c                         |  2 +-
 arch/x86/xen/spinlock.c                       |  2 +-
 drivers/acpi/numa/srat.c                      |  2 +-
 drivers/clk/samsung/clk-exynos4.c             |  2 +-
 drivers/clocksource/ingenic-timer.c           |  3 +-
 drivers/cpufreq/pcc-cpufreq.c                 |  2 +-
 drivers/cpufreq/qcom-cpufreq-hw.c             |  2 +-
 drivers/cpufreq/scmi-cpufreq.c                |  2 +-
 drivers/crypto/ccp/ccp-dev-v5.c               |  5 +-
 drivers/dma/mv_xor.c                          |  5 +-
 drivers/firmware/psci/psci_checker.c          |  2 +-
 drivers/gpu/drm/i810/i810_drv.c               |  2 +-
 drivers/gpu/drm/i915/i915_pmu.c               |  2 +-
 drivers/gpu/drm/msm/disp/mdp5/mdp5_smp.c      |  2 +-
 drivers/hv/channel_mgmt.c                     |  4 +-
 drivers/iio/adc/mxs-lradc-adc.c               |  3 +-
 drivers/iio/dummy/iio_simple_dummy_buffer.c   |  4 +-
 drivers/iio/industrialio-buffer.c             |  2 +-
 drivers/iio/industrialio-trigger.c            |  2 +-
 drivers/infiniband/hw/hfi1/affinity.c         | 13 ++--
 drivers/infiniband/hw/qib/qib_file_ops.c      |  2 +-
 drivers/infiniband/hw/qib/qib_iba7322.c       |  2 +-
 drivers/infiniband/sw/siw/siw_main.c          |  3 +-
 drivers/irqchip/irq-bcm6345-l1.c              |  2 +-
 drivers/irqchip/irq-gic.c                     |  2 +-
 drivers/memstick/core/ms_block.c              |  4 +-
 drivers/net/caif/caif_virtio.c                |  2 +-
 drivers/net/dsa/b53/b53_common.c              |  2 +-
 drivers/net/ethernet/broadcom/bcmsysport.c    |  6 +-
 .../cavium/liquidio/cn23xx_vf_device.c        |  2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |  2 +-
 .../net/ethernet/intel/ice/ice_virtchnl_pf.c  |  4 +-
 .../net/ethernet/intel/ixgbe/ixgbe_sriov.c    |  2 +-
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |  2 +-
 .../marvell/octeontx2/nic/otx2_ethtool.c      |  2 +-
 .../marvell/octeontx2/nic/otx2_flows.c        |  8 +--
 .../ethernet/marvell/octeontx2/nic/otx2_pf.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c      | 10 +--
 drivers/net/ethernet/mellanox/mlx4/eq.c       |  4 +-
 drivers/net/ethernet/mellanox/mlx4/main.c     |  2 +-
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_dev.c     |  3 +-
 drivers/net/ethernet/qlogic/qed/qed_rdma.c    |  4 +-
 drivers/net/ethernet/qlogic/qed/qed_roce.c    |  2 +-
 drivers/net/wireless/ath/ath9k/hw.c           |  2 +-
 drivers/net/wireless/marvell/mwifiex/main.c   |  4 +-
 drivers/net/wireless/st/cw1200/queue.c        |  3 +-
 drivers/nvdimm/region.c                       |  2 +-
 drivers/nvme/host/pci.c                       |  2 +-
 drivers/perf/arm-cci.c                        |  2 +-
 drivers/perf/arm_pmu.c                        |  6 +-
 drivers/perf/hisilicon/hisi_uncore_pmu.c      |  2 +-
 drivers/perf/thunderx2_pmu.c                  |  3 +-
 drivers/perf/xgene_pmu.c                      |  2 +-
 .../intel/speed_select_if/isst_if_common.c    |  6 +-
 drivers/pwm/pwm-pca9685.c                     |  2 +-
 drivers/scsi/lpfc/lpfc_init.c                 |  2 +-
 drivers/soc/bcm/brcmstb/biuctrl.c             |  2 +-
 drivers/soc/fsl/dpio/dpio-service.c           |  4 +-
 drivers/soc/fsl/qbman/qman_test_stash.c       |  2 +-
 drivers/spi/spi-dw-bt1.c                      |  2 +-
 drivers/staging/media/tegra-video/vi.c        |  2 +-
 drivers/thermal/intel/intel_powerclamp.c      | 10 ++-
 drivers/virt/acrn/hsm.c                       |  2 +-
 fs/ocfs2/cluster/heartbeat.c                  | 14 ++---
 fs/xfs/xfs_sysfs.c                            |  2 +-
 include/linux/bitmap.h                        | 45 ++++++++++---
 include/linux/cpumask.h                       | 55 ++++++++++++++++
 include/linux/kdb.h                           |  2 +-
 include/linux/nodemask.h                      | 29 +++++++++
 kernel/debug/kdb/kdb_bt.c                     |  2 +-
 kernel/irq/affinity.c                         |  2 +-
 kernel/padata.c                               |  2 +-
 kernel/printk/printk.c                        |  2 +-
 kernel/rcu/tree_nocb.h                        |  4 +-
 kernel/rcu/tree_plugin.h                      |  2 +-
 kernel/reboot.c                               |  4 +-
 kernel/sched/core.c                           | 10 +--
 kernel/sched/topology.c                       |  4 +-
 kernel/time/clockevents.c                     |  4 +-
 kernel/time/clocksource.c                     |  2 +-
 lib/bitmap.c                                  | 63 +++++++++++++++++++
 mm/mempolicy.c                                |  2 +-
 mm/page_alloc.c                               |  2 +-
 mm/percpu.c                                   |  6 +-
 mm/slab.c                                     |  2 +-
 mm/vmstat.c                                   |  4 +-
 tools/include/linux/bitmap.h                  | 42 ++++++++++---
 tools/lib/bitmap.c                            | 60 ++++++++++++++++++
 tools/perf/builtin-c2c.c                      |  4 +-
 tools/perf/util/pmu.c                         |  2 +-
 142 files changed, 490 insertions(+), 251 deletions(-)

Comments

Nicholas Piggin Nov. 28, 2021, 11:08 a.m. UTC | #1
Excerpts from Yury Norov's message of November 28, 2021 1:56 pm:
> In many cases people use bitmap_weight()-based functions like this:
> 
> 	if (num_present_cpus() > 1)
> 		do_something();
> 
> This may take considerable amount of time on many-cpus machines because
> num_present_cpus() will traverse every word of underlying cpumask
> unconditionally.
> 
> We can significantly improve on it for many real cases if stop traversing
> the mask as soon as we count present cpus to any number greater than 1:
> 
> 	if (num_present_cpus_gt(1))
> 		do_something();
> 
> To implement this idea, the series adds bitmap_weight_{eq,gt,le}
> functions together with corresponding wrappers in cpumask and nodemask.

There would be no change to callers if you maintain counters like what
is done for num_online_cpus() today. Maybe some fixes to arch code that
does not use set_cpu_possible() etc APIs required, but AFAIKS it would
be better to fix such cases anyway.

Thanks,
Nick
mirq-test@rere.qmqm.pl Nov. 28, 2021, 6:03 p.m. UTC | #2
On Sat, Nov 27, 2021 at 07:56:55PM -0800, Yury Norov wrote:
> In many cases people use bitmap_weight()-based functions like this:
> 
> 	if (num_present_cpus() > 1)
> 		do_something();
> 
> This may take considerable amount of time on many-cpus machines because
> num_present_cpus() will traverse every word of underlying cpumask
> unconditionally.
> 
> We can significantly improve on it for many real cases if stop traversing
> the mask as soon as we count present cpus to any number greater than 1:
> 
> 	if (num_present_cpus_gt(1))
> 		do_something();
> 
> To implement this idea, the series adds bitmap_weight_{eq,gt,le}
> functions together with corresponding wrappers in cpumask and nodemask.

Having slept on it I have more structured thoughts:

First, I like substituting bitmap_empty/full where possible - I think
the change stands on its own, so could be split and sent as is.

I don't like the proposed API very much. One problem is that it hides
the comparison operator and makes call sites less readable:

	bitmap_weight(...) > N

becomes:

	bitmap_weight_gt(..., N)

and:
	bitmap_weight(...) <= N

becomes:

	bitmap_weight_lt(..., N+1)
or:
	!bitmap_weight_gt(..., N)

I'd rather see something resembling memcmp() API that's known enough
to be easier to grasp. For above examples:

	bitmap_weight_cmp(..., N) > 0
	bitmap_weight_cmp(..., N) <= 0
	...

This would also make the implementation easier in not having to
copy and paste the code three times. Could also use a simple
optimization reducing code size:

#include <linux/overflow.h>

int bitmap_weight_cmp(long *bits, size_t nbits, size_t cmp)
{
	for (size_t i = 0; i < nbits / BITS_PER_LONG; ++i, ++bits)
		if (check_sub_overflow(cmp, popcount(*bits), &cmp))
			return 1;

	nbits %= BITS_PER_LONG;
	if (nbits && check_sub_overflow(cmp,
			popcount(*bits & GENMASK(nbits)), &cmp))
		return 1;

	return cmp ? -1 : 0;
}

Best Regards
Michał Mirosław
Yury Norov Nov. 28, 2021, 11:36 p.m. UTC | #3
On Sun, Nov 28, 2021 at 09:08:41PM +1000, Nicholas Piggin wrote:
> Excerpts from Yury Norov's message of November 28, 2021 1:56 pm:
> > In many cases people use bitmap_weight()-based functions like this:
> > 
> > 	if (num_present_cpus() > 1)
> > 		do_something();
> > 
> > This may take considerable amount of time on many-cpus machines because
> > num_present_cpus() will traverse every word of underlying cpumask
> > unconditionally.
> > 
> > We can significantly improve on it for many real cases if stop traversing
> > the mask as soon as we count present cpus to any number greater than 1:
> > 
> > 	if (num_present_cpus_gt(1))
> > 		do_something();
> > 
> > To implement this idea, the series adds bitmap_weight_{eq,gt,le}
> > functions together with corresponding wrappers in cpumask and nodemask.
> 
> There would be no change to callers if you maintain counters like what
> is done for num_online_cpus() today. Maybe some fixes to arch code that
> does not use set_cpu_possible() etc APIs required, but AFAIKS it would
> be better to fix such cases anyway.

Thanks, Nick. I'll try to do this.
Yury Norov Nov. 29, 2021, 6:38 a.m. UTC | #4
On Sun, Nov 28, 2021 at 07:03:41PM +0100, mirq-test@rere.qmqm.pl wrote:
> On Sat, Nov 27, 2021 at 07:56:55PM -0800, Yury Norov wrote:
> > In many cases people use bitmap_weight()-based functions like this:
> > 
> > 	if (num_present_cpus() > 1)
> > 		do_something();
> > 
> > This may take considerable amount of time on many-cpus machines because
> > num_present_cpus() will traverse every word of underlying cpumask
> > unconditionally.
> > 
> > We can significantly improve on it for many real cases if stop traversing
> > the mask as soon as we count present cpus to any number greater than 1:
> > 
> > 	if (num_present_cpus_gt(1))
> > 		do_something();
> > 
> > To implement this idea, the series adds bitmap_weight_{eq,gt,le}
> > functions together with corresponding wrappers in cpumask and nodemask.
> 
> Having slept on it I have more structured thoughts:
> 
> First, I like substituting bitmap_empty/full where possible - I think
> the change stands on its own, so could be split and sent as is.

Ok, I can do it.

> I don't like the proposed API very much. One problem is that it hides
> the comparison operator and makes call sites less readable:
> 
> 	bitmap_weight(...) > N
> 
> becomes:
> 
> 	bitmap_weight_gt(..., N)
> 
> and:
> 	bitmap_weight(...) <= N
> 
> becomes:
> 
> 	bitmap_weight_lt(..., N+1)
> or:
> 	!bitmap_weight_gt(..., N)
> 
> I'd rather see something resembling memcmp() API that's known enough
> to be easier to grasp. For above examples:
> 
> 	bitmap_weight_cmp(..., N) > 0
> 	bitmap_weight_cmp(..., N) <= 0
> 	...

bitmap_weight_cmp() cannot be efficient. Consider this example:

bitmap_weight_lt(1000 0000 0000 0000, 1) == false
                 ^
                 stop here

bitmap_weight_cmp(1000 0000 0000 0000, 1) == 0
                                 ^
                                 stop here

I agree that '_gt' is less verbose than '>', but the advantage of 
'_gt' over '>' is proportional to length of bitmap, and it means
that this API should exist.

> This would also make the implementation easier in not having to
> copy and paste the code three times. Could also use a simple
> optimization reducing code size:

In the next version I'll reduce code duplication like this:

bool bitmap_eq(..., N);
bool bitmap_ge(..., N);

#define bitmap_weight_gt(..., N)  bitmap_weight_ge(..., N + 1)
#define bitmap_weight_lt(..., N) !bitmap_weight_ge(..., N)
#define bitmap_weight_le(..., N) !bitmap_weight_gt(..., N)

Thanks,
Yury
Michał Mirosław Nov. 29, 2021, 4:34 p.m. UTC | #5
Dnia 29 listopada 2021 06:38:39 UTC, Yury Norov <yury.norov@gmail.com> napisał/a:
>On Sun, Nov 28, 2021 at 07:03:41PM +0100, mirq-test@rere.qmqm.pl wrote:
>> On Sat, Nov 27, 2021 at 07:56:55PM -0800, Yury Norov wrote:
>> > In many cases people use bitmap_weight()-based functions like this:
>> > 
>> > 	if (num_present_cpus() > 1)
>> > 		do_something();
>> > 
>> > This may take considerable amount of time on many-cpus machines because
>> > num_present_cpus() will traverse every word of underlying cpumask
>> > unconditionally.
>> > 
>> > We can significantly improve on it for many real cases if stop traversing
>> > the mask as soon as we count present cpus to any number greater than 1:
>> > 
>> > 	if (num_present_cpus_gt(1))
>> > 		do_something();
>> > 
>> > To implement this idea, the series adds bitmap_weight_{eq,gt,le}
>> > functions together with corresponding wrappers in cpumask and nodemask.
>> 
>> Having slept on it I have more structured thoughts:
>> 
>> First, I like substituting bitmap_empty/full where possible - I think
>> the change stands on its own, so could be split and sent as is.
>
>Ok, I can do it.
>
>> I don't like the proposed API very much. One problem is that it hides
>> the comparison operator and makes call sites less readable:
>> 
>> 	bitmap_weight(...) > N
>> 
>> becomes:
>> 
>> 	bitmap_weight_gt(..., N)
>> 
>> and:
>> 	bitmap_weight(...) <= N
>> 
>> becomes:
>> 
>> 	bitmap_weight_lt(..., N+1)
>> or:
>> 	!bitmap_weight_gt(..., N)
>> 
>> I'd rather see something resembling memcmp() API that's known enough
>> to be easier to grasp. For above examples:
>> 
>> 	bitmap_weight_cmp(..., N) > 0
>> 	bitmap_weight_cmp(..., N) <= 0
>> 	...
>
>bitmap_weight_cmp() cannot be efficient. Consider this example:
>
>bitmap_weight_lt(1000 0000 0000 0000, 1) == false
>                 ^
>                 stop here
>
>bitmap_weight_cmp(1000 0000 0000 0000, 1) == 0
>                                 ^
>                                 stop here
>
>I agree that '_gt' is less verbose than '>', but the advantage of 
>'_gt' over '>' is proportional to length of bitmap, and it means
>that this API should exist.

Thank you for the example. Indeed, for less-than to be efficient here you would need to replace
 bitmap_weight_cmp(..., N) < 0
with
 bitmap_weight_cmp(..., N-1) <= 0

It would still be more readable, I think.

Best Regards
Michał Mirosław
Yury Norov Dec. 2, 2021, 12:31 a.m. UTC | #6
On Mon, Nov 29, 2021 at 04:34:07PM +0000, Michał Mirosław wrote:
> Dnia 29 listopada 2021 06:38:39 UTC, Yury Norov <yury.norov@gmail.com> napisał/a:
> >On Sun, Nov 28, 2021 at 07:03:41PM +0100, mirq-test@rere.qmqm.pl wrote:
> >> On Sat, Nov 27, 2021 at 07:56:55PM -0800, Yury Norov wrote:
> >> > In many cases people use bitmap_weight()-based functions like this:
> >> > 
> >> > 	if (num_present_cpus() > 1)
> >> > 		do_something();
> >> > 
> >> > This may take considerable amount of time on many-cpus machines because
> >> > num_present_cpus() will traverse every word of underlying cpumask
> >> > unconditionally.
> >> > 
> >> > We can significantly improve on it for many real cases if stop traversing
> >> > the mask as soon as we count present cpus to any number greater than 1:
> >> > 
> >> > 	if (num_present_cpus_gt(1))
> >> > 		do_something();
> >> > 
> >> > To implement this idea, the series adds bitmap_weight_{eq,gt,le}
> >> > functions together with corresponding wrappers in cpumask and nodemask.
> >> 
> >> Having slept on it I have more structured thoughts:
> >> 
> >> First, I like substituting bitmap_empty/full where possible - I think
> >> the change stands on its own, so could be split and sent as is.
> >
> >Ok, I can do it.
> >
> >> I don't like the proposed API very much. One problem is that it hides
> >> the comparison operator and makes call sites less readable:
> >> 
> >> 	bitmap_weight(...) > N
> >> 
> >> becomes:
> >> 
> >> 	bitmap_weight_gt(..., N)
> >> 
> >> and:
> >> 	bitmap_weight(...) <= N
> >> 
> >> becomes:
> >> 
> >> 	bitmap_weight_lt(..., N+1)
> >> or:
> >> 	!bitmap_weight_gt(..., N)
> >> 
> >> I'd rather see something resembling memcmp() API that's known enough
> >> to be easier to grasp. For above examples:
> >> 
> >> 	bitmap_weight_cmp(..., N) > 0
> >> 	bitmap_weight_cmp(..., N) <= 0
> >> 	...
> >
> >bitmap_weight_cmp() cannot be efficient. Consider this example:
> >
> >bitmap_weight_lt(1000 0000 0000 0000, 1) == false
> >                 ^
> >                 stop here
> >
> >bitmap_weight_cmp(1000 0000 0000 0000, 1) == 0
> >                                 ^
> >                                 stop here
> >
> >I agree that '_gt' is less verbose than '>', but the advantage of 
> >'_gt' over '>' is proportional to length of bitmap, and it means
> >that this API should exist.
> 
> Thank you for the example. Indeed, for less-than to be efficient here you would need to replace
>  bitmap_weight_cmp(..., N) < 0
> with
>  bitmap_weight_cmp(..., N-1) <= 0

Indeed, thanks for pointing to it.
 
> It would still be more readable, I think.

To be honest, I'm not sure that
        bitmap_weight_cmp(..., N-1) <= 0
would be an obvious replacement for the original
        bitmap_weight(...) < N
comparing to 
        bitmap_weight_lt(..., N)

I think the best thing I can do is to add bitmap_weight_cmp() as
you suggested, and turn lt and others to be wrappers on it. This
will let people choose a better function in each case.

I also think that for v2 it would be better to drop the conversion
for short bitmaps, except for switching to bitmap_empty(), because
in that case readability wins over performance; if no objections. 

Thanks,
Yury