mbox series

[V17,0/9] arm64/perf: Enable branch stack sampling

Message ID 20240405024639.1179064-1-anshuman.khandual@arm.com (mailing list archive)
Headers show
Series arm64/perf: Enable branch stack sampling | expand

Message

Anshuman Khandual April 5, 2024, 2:46 a.m. UTC
This series enables perf branch stack sampling support on arm64 platform
via a new arch feature called Branch Record Buffer Extension (BRBE). All
the relevant register definitions could be accessed here.

https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers

This series applies on 6.9-rc2.

Also this series is being hosted below for quick access, review and test.

https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)

There are still some open questions regarding handling multiple perf events
with different privilege branch filters getting on the same PMU, supporting
guest branch stack tracing from the host etc. Finally also looking for some
suggestions regarding supporting BRBE inside the guest. The series has been
re-organized completely as suggested earlier.

- Anshuman

========== Perf Branch Stack Sampling Support (arm64 platforms) ===========

Currently arm64 platform does not support perf branch stack sampling. Hence
any event requesting for branch stack records i.e PERF_SAMPLE_BRANCH_STACK
marked in event->attr.sample_type, will be rejected in armpmu_event_init().

static int armpmu_event_init(struct perf_event *event)
{
	........
        /* does not support taken branch sampling */
        if (has_branch_stack(event))
                return -EOPNOTSUPP;
	........
}

$perf record -j any,u,k ls
Error:
cycles:P: PMU Hardware or event type doesn't support branch stack sampling.

-------------------- CONFIG_ARM64_BRBE and FEAT_BRBE ----------------------

After this series, perf branch stack sampling feature gets enabled on arm64
platforms where FEAT_BRBE HW feature is supported, and CONFIG_ARM64_BRBE is
also selected during build. Let's observe all all possible scenarios here.

1. Feature not built (!CONFIG_ARM64_BRBE):

Falls back to the current behaviour i.e event gets rejected.

2. Feature built but HW not supported (CONFIG_ARM64_BRBE && !FEAT_BRBE):

Falls back to the current behaviour i.e event gets rejected.

3. Feature built and HW supported (CONFIG_ARM64_BRBE && FEAT_BRBE):

Platform supports branch stack sampling requests. Let's observe through a
simple example here.

$perf record -j any_call,u,k,save_type ls

[Please refer perf-record man pages for all possible branch filter options]

$perf report
-------------------------- Snip ----------------------
# Overhead  Command  Source Shared Object  Source Symbol                                 Target Symbol                                 Basic Block Cycles
# ........  .......  ....................  ............................................  ............................................  ..................
#
     3.52%  ls       [kernel.kallsyms]     [k] sched_clock_noinstr                       [k] arch_counter_get_cntpct                   16
     3.52%  ls       [kernel.kallsyms]     [k] sched_clock                               [k] sched_clock_noinstr                       9
     1.85%  ls       [kernel.kallsyms]     [k] sched_clock_cpu                           [k] sched_clock                               5
     1.80%  ls       [kernel.kallsyms]     [k] irqtime_account_irq                       [k] sched_clock_cpu                           20
     1.58%  ls       [kernel.kallsyms]     [k] gic_handle_irq                            [k] generic_handle_domain_irq                 19
     1.58%  ls       [kernel.kallsyms]     [k] call_on_irq_stack                         [k] gic_handle_irq                            9
     1.58%  ls       [kernel.kallsyms]     [k] do_interrupt_handler                      [k] call_on_irq_stack                         23
     1.58%  ls       [kernel.kallsyms]     [k] generic_handle_domain_irq                 [k] __irq_resolve_mapping                     6
     1.58%  ls       [kernel.kallsyms]     [k] __irq_resolve_mapping                     [k] __rcu_read_lock                           10
-------------------------- Snip ----------------------

$perf report -D | grep cycles
-------------------------- Snip ----------------------
.....  1: ffff800080dd3334 -> ffff800080dd759c 39 cycles  P   0 IND_CALL
.....  2: ffff800080ffaea0 -> ffff800080ffb688 16 cycles  P   0 IND_CALL
.....  3: ffff800080139918 -> ffff800080ffae64 9  cycles  P   0 CALL
.....  4: ffff800080dd3324 -> ffff8000801398f8 7  cycles  P   0 CALL
.....  5: ffff8000800f8548 -> ffff800080dd330c 21 cycles  P   0 IND_CALL
.....  6: ffff8000800f864c -> ffff8000800f84ec 6  cycles  P   0 CALL
.....  7: ffff8000800f86dc -> ffff8000800f8638 11 cycles  P   0 CALL
.....  8: ffff8000800f86d4 -> ffff800081008630 16 cycles  P   0 CALL
-------------------------- Snip ----------------------

perf script and other tooling can also be applied on the captured perf.data
Similarly branch stack sampling records can be collected via direct system
call i.e perf_event_open() method after setting 'struct perf_event_attr' as
required.

event->attr.sample_type |= PERF_SAMPLE_BRANCH_STACK
event->attr.branch_sample_type |= PERF_SAMPLE_BRANCH_<FILTER_1> |
				  PERF_SAMPLE_BRANCH_<FILTER_2> |
				  PERF_SAMPLE_BRANCH_<FILTER_3> |
				  ...............................

But all branch filters might not be supported on the platform.

----------------------- BRBE Branch Filters Support -----------------------

- Following branch filters are supported on arm64.

	PERF_SAMPLE_BRANCH_USER		/* Branch privilege filters */
	PERF_SAMPLE_BRANCH_KERNEL
	PERF_SAMPLE_BRANCH_HV

	PERF_SAMPLE_BRANCH_ANY		/* Branch type filters */
	PERF_SAMPLE_BRANCH_ANY_CALL
	PERF_SAMPLE_BRANCH_ANY_RETURN
	PERF_SAMPLE_BRANCH_IND_CALL
	PERF_SAMPLE_BRANCH_COND
	PERF_SAMPLE_BRANCH_IND_JUMP
	PERF_SAMPLE_BRANCH_CALL

	PERF_SAMPLE_BRANCH_NO_FLAGS	/* Branch record flags */
	PERF_SAMPLE_BRANCH_NO_CYCLES
	PERF_SAMPLE_BRANCH_TYPE_SAVE
	PERF_SAMPLE_BRANCH_HW_INDEX
	PERF_SAMPLE_BRANCH_PRIV_SAVE

- Following branch filters are not supported on arm64.

	PERF_SAMPLE_BRANCH_ABORT_TX
	PERF_SAMPLE_BRANCH_IN_TX
	PERF_SAMPLE_BRANCH_NO_TX
	PERF_SAMPLE_BRANCH_CALL_STACK

Events requesting above non-supported branch filters get rejected.

------------------ Possible 'branch_sample_type' Mismatch -----------------

Branch stack sampling attributes 'event->attr.branch_sample_type' generally
remain the same for all the events during a perf record session.

$perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]

event_1->attr.branch_sample_type == event_2->attr.branch_sample_type

This 'branch_sample_type' is used to configure the BRBE hardware, when both
events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
remain the same for all events. Let's consider the following example

$perf record -e cycles:u -e instructions:k -j any,save_type ls

cycles->attr.branch_sample_type != instructions->attr.branch_sample_type

Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
BRBE hardware with 'branch_sample_type' from last event to be added in the
PMU and hence captured branch records only get passed on to matching events
during a PMU interrupt.

static int
armpmu_add(struct perf_event *event, int flags)
{
	........
	if (has_branch_stack(event)) {
		/*
		 * Reset branch records buffer if a new task event gets
		 * scheduled on a PMU which might have existing records.
		 * Otherwise older branch records present in the buffer
		 * might leak into the new task event.
		 */
		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
			hw_events->brbe_context = event->ctx;
			if (armpmu->branch_reset)
				armpmu->branch_reset();
		}
		hw_events->brbe_users++;
Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
	}
	........
}

Instead of overriding existing 'branch_sample_type', both could be merged.

--------------------------- Virtualisation support ------------------------

- Branch stack sampling is not currently supported inside the guest (TODO)

	- FEAT_BRBE advertised as absent via clearing ID_AA64DFR0_EL1.BRBE
	- Future support in guest requires emulating FEAT_BRBE

- Branch stack sampling the guest is not supported in the host      (TODO)

	- Tracing the guest with event->attr.exclude_guest = 0
	- There are multiple challenges involved regarding mixing events
	  with mismatched branch_sample_type and exclude_guest and passing
	  on captured BRBE records to intended events during PMU interrupt

- Guest access for BRBE registers and instructions has been blocked

- BRBE state save is not required for VHE host (EL2) guest (EL1) transition

- BRBE state is saved for NVHE host (EL1) guest (EL1) transition

-------------------------------- Testing ---------------------------------

- Cross compiled for both arm64 and arm32 platforms
- Passes all branch tests with 'perf test branch' on arm64

-------------------------------- Questions -------------------------------

- Instead of configuring the BRBE HW with branch_sample_type from the last
  event to be added on the PMU as proposed, could those be merged together
  e.g all privilege requests ORed, to form a common BRBE configuration and
  all events get branch records after a PMU interrupt ?

Changes in V17:

- Added back Reviewed-by tags from Mark Brown
- Updated the commit message regarding the field BRBINFx_EL1_TYPE_IMPDEF_TRAP_EL3
- Added leading 0s for all values as BRBIDR0_EL1.NUMREC is a 8 bit field
- Added leading 0s for all values as BRBFCR_EL1.BANK is a 2 bit field
- Reordered BRBCR_EL1/BRBCR_EL12/BRBCR_EL2 registers as per sysreg encodings
- Renamed s/FIRST/BANK_0 and s/SECOND/BANK_1 in BRBFCR_EL1.BANK
- Renamed s/UNCOND_DIRECT/DIRECT_UNCOND in BRBINFx_EL1.TYPE
- Renamed s/COND_DIRECT/DIRECT_COND in BRBINFx_EL1.TYPE
- Dropped __SYS_BRBINF/__SYS_BRBSRC/__SYS_BRBTGT and their expansions
- Moved all existing BRBE registers from sysreg.h header to tools/sysreg format
- Updated the commit message including about sys_insn_descs[]
- Changed KVM to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
- Moved the BRBE instructions into sys_insn_descs[] array
- ARM PMUV3 changes have been moved into the BRBE driver patch instead
- Moved down branch_stack_add() in armpmu_add() after event's basic checks
- Added new callbacks in struct arm_pmu e.g branch_stack_[init|add|del]()
- Renamed struct arm_pmu callback branch_reset() as branch_stack_reset()
- Dropped the comment in armpmu_event_init()
- Renamed 'pmu_hw_events' elements from 'brbe_' to more generic 'branch_'
- Separated out from the BRBE driver implementation patch
- Dropped the comment in __init_el2_brbe()
- Updated __init_el2_brbe() with BRBCR_EL2.MPRED requirements
- Updated __init_el2_brbe() with __check_hvhe() constructs
- Updated booting.rst regarding MPRED, MDCR_EL3 and fine grained control
- Dropped Documentation/arch/arm64/brbe.rst
- Renamed armv8pmu_branch_reset() as armv8pmu_branch_stack_reset()
- Separated out booting.rst and EL2 boot requirements into a new patch
- Dropped process_branch_aborts() completely
- Added an warning if transaction states get detected unexpectedly
- Dropped enum brbe_bank_idx from the driver
- Defined armv8pmu_branch_stack_init/add/del() callbacks in the driver
- Changed BRBE driver to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
- Dropped isb() call sites in  __debug_[save|restore]_brbe()
- Changed to [read|write]_sysreg_el1() accessors in __debug_[save|restore]_brbe()

Changes in V16

https://lore.kernel.org/all/20240125094119.2542332-1-anshuman.khandual@arm.com/

- Updated BRBINFx_EL1.TYPE = 0b110000 as field IMPDEF_TRAP_EL3
- Updated BRBCR_ELx[9] as field FZPSS
- Updated BRBINFINJ_EL1 to use sysreg field BRBINFx_EL1
- Added BRB_INF_SRC_TGT_EL1 macro for corresponding BRB_[INF|SRC|TGT] expansion
- Renamed arm_brbe.h as arm_pmuv3_branch.h
- Updated perf_sample_save_brstack()'s new argument requirements with NULL
- Fixed typo (s/informations/information) in Documentation/arch/arm64/brbe.rst
- Added SPDX-License-Identifier in Documentation/arch/arm64/brbe.rst
- Added new PERF_SAMPLE_BRANCH_COUNTERS into BRBE_EXCLUDE_BRANCH_FILTERS
- Dropped BRBFCR_EL1 and BRBCR_EL1 from enum vcpu_sysreg
- Reverted back the KVM NVHE patch - use host_debug_state based 'brbcr_el1'
  element and dropped the previous dependency on Jame's coresight series

Changes in V15:

https://lore.kernel.org/all/20231201053906.1261704-1-anshuman.khandual@arm.com/

- Added a comment for armv8pmu_branch_probe() regarding single cpu probe
- Added a text in brbe.rst regarding single cpu probe
- Dropped runtime BRBE enable for setting DEBUG_STATE_SAVE_BRBE
- Dropped zero_branch_stack based zero branch records mechanism
- Replaced BRBFCR_EL1_DEFAULT_CONFIG with BRBFCR_EL1_CONFIG_MASK
- Added BRBFCR_EL1_CONFIG_MASK masking in branch_type_to_brbfcr()
- Moved BRBE helpers from arm_brbe.h into arm_brbe.c
- Moved armv8_pmu_xxx() declaration inside arm_brbe.h for arm64 (CONFIG_ARM64_BRBE)
- Moved armv8_pmu_xxx() stub definitions inside arm_brbe.h for arm32 (!CONFIG_ARM64_BRBE)
- Included arm_brbe.h header both in arm_pmuv3.c and arm_brbe.c
- Dropped BRBE custom pr_fmt()
- Dropped CONFIG_PERF_EVENTS wrapping from header entries
- Flush branch records when a cpu bound event follows a task bound event
- Dropped BRBFCR_EL1 from __debug_save_brbe()/__debug_restore_brbe()
- Always save the live SYS_BRBCR_EL1 in host context and then check if
  BRBE was enabled before resetting SYS_BRBCR_EL1 for the host

Changes in V14:

https://lore.kernel.org/all/20231114051329.327572-1-anshuman.khandual@arm.com/

- This series has been reorganised as suggested during V13
- There are just eight patches now i.e 5 enablement and 3 perf branch tests

- Fixed brackets problem in __SYS_BRBINFO/BRBSRC/BRBTGT() macros
- Renamed the macro i.e s/__SYS_BRBINFO/__SYS_BRBINF/
- Renamed s/BRB_IALL/BRB_IALL_INSN and s/BRBE_INJ/BRB_INJ_INSN
- Moved BRB_IALL_INSN and SYS_BRB_INSN instructions to sysreg patch
- Changed E1BRE as ExBRE in sysreg fields inside BRBCR_ELx
- Used BRBCR_ELx for defining all BRBCR_EL1, BRBCR_EL2, and BRBCR_EL12 (new)

- Folded the following three patches into a single patch i.e [PATCH 3/8]

  drivers: perf: arm_pmu: Add new sched_task() callback
  arm64/perf: Add branch stack support in struct arm_pmu
  arm64/perf: Add branch stack support in struct pmu_hw_events
  arm64/perf: Add branch stack support in ARMV8 PMU
  arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()

- All armv8pmu_branch_xxxx() stub definitions have been moved inside
  include/linux/perf/arm_pmuv3.h for easy access from both arm32 and arm64
- Added brbe_users, brbe_context and brbe_sample_type in struct pmu_hw_events
- Added comments for all the above new elements in struct pmu_hw_events
- Added branch_reset() and sched_task() callbacks
- Changed and optimized branch records processing during a PMU IRQ
- NO branch records get captured for event with mismatched brbe_sample_type
- Branch record context is tracked from armpmu_del() & armpmu_add()
- Branch record hardware is driven from armv8pmu_start() & armv8pmu_stop()
- Dropped NULL check for 'pmu_ctx' inside armv8pmu_sched_task()
- Moved down PERF_ATTACH_TASK_DATA assignment with a preceding comment
- In conflicting branch sample type requests, first event takes precedence

- Folded the following five patches from V13 into a single patch i.e
  [PATCH 4/8]

  arm64/perf: Enable branch stack events via FEAT_BRBE
  arm64/perf: Add struct brbe_regset helper functions
  arm64/perf: Implement branch records save on task sched out
  arm64/perf: Implement branch records save on PMU IRQ

- Fixed the year in copyright statement
- Added Documentation/arch/arm64/brbe.rst
- Updated Documentation/arch/arm64/booting.rst (BRBCR_EL2.CC for EL1 entry)
- Added __init_el2_brbe() which enables branch record cycle count support
- Disabled EL2 traps in __init_el2_fgt() while accessing BRBE registers and
  executing instructions
- Changed CONFIG_ARM64_BRBE user visible description
- Fixed a typo in CONFIG_ARM64_BRBE config option description text
- Added BUILD_BUG_ON() co-relating BRBE_BANK_MAX_ENTRIES and MAX_BRANCH_RECORDS
- Dropped arm64_create_brbe_task_ctx_kmem_cache()
- Moved down comment for PERF_SAMPLE_BRANCH_KERNEL in branch_type_to_brbcr()
- Renamed BRBCR_ELx_DEFAULT_CONFIG as BRBCR_ELx_CONFIG_MASK
- Replaced BRBCR_ELx_DEFAULT_TS with BRBCR_ELx_TS_MASK in BRBCR_ELx_CONFIG_MASK
- Replaced BRBCR_ELx_E1BRE instances with BRBCR_ELx_ExBRE

- Added BRBE specific branch stack sampling perf test patches into the series
- Added a patch to prevent guest accesses into BRBE registers and instructions
- Added a patch to save the BRBE host context in NVHE environment
- Updated most commit messages

Changes in V13:

https://lore.kernel.org/all/20230711082455.215983-1-anshuman.khandual@arm.com/
https://lore.kernel.org/all/20230622065351.1092893-1-anshuman.khandual@arm.com/

- Added branch callback stubs for aarch32 pmuv3 based platforms
- Updated the comments for capture_brbe_regset()
- Deleted the comments in __read_brbe_regset()
- Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()

Changes in V12:

https://lore.kernel.org/all/20230615133239.442736-1-anshuman.khandual@arm.com/

- Replaced branch types with complete DIRECT/INDIRECT prefixes/suffixes
- Replaced branch types with complete INSN/ALIGN prefixes/suffixes
- Replaced return branch types as simple RET/ERET
- Replaced time field GST_PHYSICAL as GUEST_PHYSICAL
- Added 0 padding for BRBIDR0_EL1.NUMREC enum values
- Dropped helper arm_pmu_branch_stack_supported()
- Renamed armv8pmu_branch_valid() as armv8pmu_branch_attr_valid()
- Separated perf_task_ctx_cache setup from arm_pmu private allocation
- Collected changes to branch_records_alloc() in a single patch [5/10]
- Reworked and cleaned up branch_records_alloc()
- Reworked armv8pmu_branch_read() with new loop iterations in patch [6/10]
- Reworked capture_brbe_regset() with new loop iterations in patch [8/10]
- Updated the comment in branch_type_to_brbcr()
- Fixed the comment before stitch_stored_live_entries()
- Fixed BRBINFINJ_EL1 definition for VALID_FULL enum field
- Factored out helper __read_brbe_regset() from capture_brbe_regset()
- Dropped the helper copy_brbe_regset()
- Simplified stitch_stored_live_entries() with memcpy(), memmove()
- Reworked armv8pmu_probe_pmu() to bail out early with !probe.present
- Rework brbe_attributes_probe() without 'struct brbe_hw_attr'
- Dropped 'struct brbe_hw_attr' argument from capture_brbe_regset()
- Dropped 'struct brbe_hw_attr' argument from brbe_branch_save()
- Dropped arm_pmu->private and added arm_pmu->reg_trbidr instead

Changes in V11:

https://lore.kernel.org/all/20230531040428.501523-1-anshuman.khandual@arm.com/

- Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data

Changes in V10:

https://lore.kernel.org/all/20230517022410.722287-1-anshuman.khandual@arm.com/

- Rebased the series on v6.4-rc2
- Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
- Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
- Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()

Changes in V9:

https://lore.kernel.org/all/20230315051444.1683170-1-anshuman.khandual@arm.com/

- Fixed build problem with has_branch_stack() in arm64 header
- BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
- Renamed all BRBINF_EL1 call sites as BRBINFx_EL1
- Dropped static const char branch_filter_error_msg[]
- Implemented a positive list check for BRBE supported perf branch filters
- Added a comment in armv8pmu_handle_irq()
- Implemented per-cpu allocation for struct branch_record records
- Skipped looping through bank 1 if an invalid record is detected in bank 0
- Added comment in armv8pmu_branch_read() explaining prohibited region etc
- Added comment warning about erroneously marking transactions as aborted
- Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
- Dropped the last argument (idx) in capture_brbe_flags()
- Dropped the brbcr argument from capture_brbe_flags()
- Used perf_sample_save_brstack() to capture branch records for perf_sample_data
- Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
- Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
- Implemented event task context based branch records save mechanism

Changes in V8:

https://lore.kernel.org/all/20230123125956.1350336-1-anshuman.khandual@arm.com/

- Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
- Added a comment and line break before arm_pmu->private element
- Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
- Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
- Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
- Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
- Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
- Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
- Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
- Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()
- Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
- Renamed brbe_valid() as brbe_record_is_complete()
- Renamed brbe_source() as brbe_record_is_source_only()
- Renamed brbe_target() as brbe_record_is_target_only()
- Inverted checks for !brbe_record_is_[target|source]_only() for info capture
- Replaced 'fetch' with 'get' in all helpers that extract field value
- Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
- Dropped select_brbe_bank_index() completely, added capture_branch_entry()
- Process captured branch entries in two separate loops one for each BRBE bank
- Moved branch_records_alloc() inside armv8pmu_probe_pmu()
- Added a forward declaration for the helper has_branch_stack()
- Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
- Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call

Changes in V7:

https://lore.kernel.org/all/20230105031039.207972-1-anshuman.khandual@arm.com/

- Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
- Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
- Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
- Defined BRBCR_EL1_DEFAULT_TS in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
- Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
- Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
- Also set BRBE in paused state in armv8pmu_branch_disable()
- Dropped brbe_paused(), set_brbe_paused() helpers
- Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
- Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
- Added valid_brbe_[cc, format, version]() helpers
- Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
- Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
- Defined enum brbe_bank_idx with possible values for BRBE bank indices
- Changed armpmu->hw_attr into armpmu->private
- Added missing space in stub definition for armv8pmu_branch_valid()
- Replaced both kmalloc() with kzalloc()
- Added BRBE_BANK_MAX_ENTRIES
- Updated comment for capture_brbe_flags()
- Updated comment for struct brbe_hw_attr
- Dropped space after type cast in couple of places
- Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
- Captured cpuc->branches->branch_entries[idx] in a local variable
- Dropped saved_priv from armv8pmu_branch_read()
- Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
- Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
- Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
- Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
  select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
- Reorganized brbe_valid_nr() and dropped the pr_warn() message
- Changed probe sequence in brbe_attributes_probe()
- Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
- Disable BRBE before disabling the PMU event counter
- Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
- Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()

Changes in V6:

https://lore.kernel.org/linux-arm-kernel/20221208084402.863310-1-anshuman.khandual@arm.com/

- Restore the exception level privilege after reading the branch records
- Unpause the buffer after reading the branch records
- Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
- Reworked BRBE implementation and branch stack sampling support on arm pmu
- BRBE implementation is now part of overall ARMV8 PMU implementation
- BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
- CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
- File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
- File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
- BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
- BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
- BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
- Added sched_task() callback into struct arm_pmu
- Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
- Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
- Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
- Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
- Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
- armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
- Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
- Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
- Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
- Dropped set_brbe_disabled() as well
- Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events

Changes in V5:

https://lore.kernel.org/linux-arm-kernel/20221107062514.2851047-1-anshuman.khandual@arm.com/

- Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
- Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
- Changed config ARM_BRBE_PMU from 'tristate' to 'bool'

Changes in V4:

https://lore.kernel.org/all/20221017055713.451092-1-anshuman.khandual@arm.com/

- Changed ../tools/sysreg declarations as suggested
- Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
- Dropped perfmon_capable() check in armpmu_event_init()
- s/pr_warn_once/pr_info in armpmu_event_init()
- Added brbe_format element into struct pmu_hw_events
- Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
- Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning

Changes in V3:

https://lore.kernel.org/all/20220929075857.158358-1-anshuman.khandual@arm.com/

- Moved brbe_stack from the stack and now dynamically allocated
- Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
- Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
- Created dummy BRBINF_EL1 field definitions in tools/sysreg
- Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
- Both exception and exception return branche records are now captured
  only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
  been checked in generic perf via perf_allow_kernel()

Changes in V2:

https://lore.kernel.org/all/20220908051046.465307-1-anshuman.khandual@arm.com/

- Dropped branch sample filter helpers consolidation patch from this series
- Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
- Use cached perfmon_capable() while configuring BRBE branch record filters

Changes in V1:

https://lore.kernel.org/linux-arm-kernel/20220613100119.684673-1-anshuman.khandual@arm.com/

- Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
- Process new perf branch types via PERF_BR_EXTEND_ABI

Changes in RFC V2:

https://lore.kernel.org/linux-arm-kernel/20220412115455.293119-1-anshuman.khandual@arm.com/

- Added branch_sample_priv() while consolidating other branch sample filter helpers
- Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
- Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
- Added documentation for struct arm_pmu changes, updated commit message
- Updated commit message for BRBE detection infrastructure patch
- PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
- Branch privilege state capture mechanism has now moved inside the driver

Changes in RFC V1:

https://lore.kernel.org/all/1642998653-21377-1-git-send-email-anshuman.khandual@arm.com/

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: James Clark <james.clark@arm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki Poulose <suzuki.poulose@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (6):
  arm64/sysreg: Add BRBE registers and fields
  KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
  drivers: perf: arm_pmu: Add infrastructure for branch stack sampling
  arm64/boot: Enable EL2 requirements for BRBE
  drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE
  KVM: arm64: nvhe: Disable branch generation in nVHE guests

James Clark (3):
  perf: test: Speed up running brstack test on an Arm model
  perf: test: Remove empty lines from branch filter test output
  perf: test: Extend branch stack sampling test for Arm64 BRBE

 Documentation/arch/arm64/booting.rst   |  26 +
 arch/arm64/include/asm/el2_setup.h     |  90 ++-
 arch/arm64/include/asm/kvm_host.h      |   5 +-
 arch/arm64/include/asm/sysreg.h        |  17 +-
 arch/arm64/kvm/debug.c                 |   5 +
 arch/arm64/kvm/hyp/nvhe/debug-sr.c     |  31 +
 arch/arm64/kvm/sys_regs.c              |  56 ++
 arch/arm64/tools/sysreg                | 131 ++++
 drivers/perf/Kconfig                   |  11 +
 drivers/perf/Makefile                  |   1 +
 drivers/perf/arm_brbe.c                | 968 +++++++++++++++++++++++++
 drivers/perf/arm_pmu.c                 |  25 +-
 drivers/perf/arm_pmuv3.c               | 146 +++-
 drivers/perf/arm_pmuv3_branch.h        |  73 ++
 include/linux/perf/arm_pmu.h           |  37 +-
 tools/perf/tests/builtin-test.c        |   1 +
 tools/perf/tests/shell/test_brstack.sh |  57 +-
 tools/perf/tests/tests.h               |   1 +
 tools/perf/tests/workloads/Build       |   2 +
 tools/perf/tests/workloads/traploop.c  |  39 +
 20 files changed, 1696 insertions(+), 26 deletions(-)
 create mode 100644 drivers/perf/arm_brbe.c
 create mode 100644 drivers/perf/arm_pmuv3_branch.h
 create mode 100644 tools/perf/tests/workloads/traploop.c

Comments

Adam Young April 5, 2024, 3:54 a.m. UTC | #1
Will this apply on 6.8?  We have been chasing getting this tested, and 
are a revision behind you.

We are just testing on 6.8, and the fact that we don't have one that 
works on stable revision has been one

of the  reasons we have not provided feedback.

I did get to test an early version of the patch several months ago and 
strictly speaking it works:

I used a quicksort built unoptimized and compared it with an AutoFDO 
version consuming the output from BRBE.  The AutoFDO version was 
significantly faster, so we do get some optimization.

What is that actual Test framework you are using to test?  I would like 
to do an apples-to-apples comparison with this version of the patch set 
or the most recent one I can get to apply.



On 4/4/24 22:46, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> the relevant register definitions could be accessed here.
>
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>
> This series applies on 6.9-rc2.
>
> Also this series is being hosted below for quick access, review and test.
>
> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>
> There are still some open questions regarding handling multiple perf events
> with different privilege branch filters getting on the same PMU, supporting
> guest branch stack tracing from the host etc. Finally also looking for some
> suggestions regarding supporting BRBE inside the guest. The series has been
> re-organized completely as suggested earlier.
>
> - Anshuman
>
> ========== Perf Branch Stack Sampling Support (arm64 platforms) ===========
>
> Currently arm64 platform does not support perf branch stack sampling. Hence
> any event requesting for branch stack records i.e PERF_SAMPLE_BRANCH_STACK
> marked in event->attr.sample_type, will be rejected in armpmu_event_init().
>
> static int armpmu_event_init(struct perf_event *event)
> {
> 	........
>          /* does not support taken branch sampling */
>          if (has_branch_stack(event))
>                  return -EOPNOTSUPP;
> 	........
> }
>
> $perf record -j any,u,k ls
> Error:
> cycles:P: PMU Hardware or event type doesn't support branch stack sampling.
>
> -------------------- CONFIG_ARM64_BRBE and FEAT_BRBE ----------------------
>
> After this series, perf branch stack sampling feature gets enabled on arm64
> platforms where FEAT_BRBE HW feature is supported, and CONFIG_ARM64_BRBE is
> also selected during build. Let's observe all all possible scenarios here.
>
> 1. Feature not built (!CONFIG_ARM64_BRBE):
>
> Falls back to the current behaviour i.e event gets rejected.
>
> 2. Feature built but HW not supported (CONFIG_ARM64_BRBE && !FEAT_BRBE):
>
> Falls back to the current behaviour i.e event gets rejected.
>
> 3. Feature built and HW supported (CONFIG_ARM64_BRBE && FEAT_BRBE):
>
> Platform supports branch stack sampling requests. Let's observe through a
> simple example here.
>
> $perf record -j any_call,u,k,save_type ls
>
> [Please refer perf-record man pages for all possible branch filter options]
>
> $perf report
> -------------------------- Snip ----------------------
> # Overhead  Command  Source Shared Object  Source Symbol                                 Target Symbol                                 Basic Block Cycles
> # ........  .......  ....................  ............................................  ............................................  ..................
> #
>       3.52%  ls       [kernel.kallsyms]     [k] sched_clock_noinstr                       [k] arch_counter_get_cntpct                   16
>       3.52%  ls       [kernel.kallsyms]     [k] sched_clock                               [k] sched_clock_noinstr                       9
>       1.85%  ls       [kernel.kallsyms]     [k] sched_clock_cpu                           [k] sched_clock                               5
>       1.80%  ls       [kernel.kallsyms]     [k] irqtime_account_irq                       [k] sched_clock_cpu                           20
>       1.58%  ls       [kernel.kallsyms]     [k] gic_handle_irq                            [k] generic_handle_domain_irq                 19
>       1.58%  ls       [kernel.kallsyms]     [k] call_on_irq_stack                         [k] gic_handle_irq                            9
>       1.58%  ls       [kernel.kallsyms]     [k] do_interrupt_handler                      [k] call_on_irq_stack                         23
>       1.58%  ls       [kernel.kallsyms]     [k] generic_handle_domain_irq                 [k] __irq_resolve_mapping                     6
>       1.58%  ls       [kernel.kallsyms]     [k] __irq_resolve_mapping                     [k] __rcu_read_lock                           10
> -------------------------- Snip ----------------------
>
> $perf report -D | grep cycles
> -------------------------- Snip ----------------------
> .....  1: ffff800080dd3334 -> ffff800080dd759c 39 cycles  P   0 IND_CALL
> .....  2: ffff800080ffaea0 -> ffff800080ffb688 16 cycles  P   0 IND_CALL
> .....  3: ffff800080139918 -> ffff800080ffae64 9  cycles  P   0 CALL
> .....  4: ffff800080dd3324 -> ffff8000801398f8 7  cycles  P   0 CALL
> .....  5: ffff8000800f8548 -> ffff800080dd330c 21 cycles  P   0 IND_CALL
> .....  6: ffff8000800f864c -> ffff8000800f84ec 6  cycles  P   0 CALL
> .....  7: ffff8000800f86dc -> ffff8000800f8638 11 cycles  P   0 CALL
> .....  8: ffff8000800f86d4 -> ffff800081008630 16 cycles  P   0 CALL
> -------------------------- Snip ----------------------
>
> perf script and other tooling can also be applied on the captured perf.data
> Similarly branch stack sampling records can be collected via direct system
> call i.e perf_event_open() method after setting 'struct perf_event_attr' as
> required.
>
> event->attr.sample_type |= PERF_SAMPLE_BRANCH_STACK
> event->attr.branch_sample_type |= PERF_SAMPLE_BRANCH_<FILTER_1> |
> 				  PERF_SAMPLE_BRANCH_<FILTER_2> |
> 				  PERF_SAMPLE_BRANCH_<FILTER_3> |
> 				  ...............................
>
> But all branch filters might not be supported on the platform.
>
> ----------------------- BRBE Branch Filters Support -----------------------
>
> - Following branch filters are supported on arm64.
>
> 	PERF_SAMPLE_BRANCH_USER		/* Branch privilege filters */
> 	PERF_SAMPLE_BRANCH_KERNEL
> 	PERF_SAMPLE_BRANCH_HV
>
> 	PERF_SAMPLE_BRANCH_ANY		/* Branch type filters */
> 	PERF_SAMPLE_BRANCH_ANY_CALL
> 	PERF_SAMPLE_BRANCH_ANY_RETURN
> 	PERF_SAMPLE_BRANCH_IND_CALL
> 	PERF_SAMPLE_BRANCH_COND
> 	PERF_SAMPLE_BRANCH_IND_JUMP
> 	PERF_SAMPLE_BRANCH_CALL
>
> 	PERF_SAMPLE_BRANCH_NO_FLAGS	/* Branch record flags */
> 	PERF_SAMPLE_BRANCH_NO_CYCLES
> 	PERF_SAMPLE_BRANCH_TYPE_SAVE
> 	PERF_SAMPLE_BRANCH_HW_INDEX
> 	PERF_SAMPLE_BRANCH_PRIV_SAVE
>
> - Following branch filters are not supported on arm64.
>
> 	PERF_SAMPLE_BRANCH_ABORT_TX
> 	PERF_SAMPLE_BRANCH_IN_TX
> 	PERF_SAMPLE_BRANCH_NO_TX
> 	PERF_SAMPLE_BRANCH_CALL_STACK
>
> Events requesting above non-supported branch filters get rejected.
>
> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>
> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
> remain the same for all the events during a perf record session.
>
> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>
> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>
> This 'branch_sample_type' is used to configure the BRBE hardware, when both
> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
> remain the same for all events. Let's consider the following example
>
> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>
> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>
> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
> BRBE hardware with 'branch_sample_type' from last event to be added in the
> PMU and hence captured branch records only get passed on to matching events
> during a PMU interrupt.
>
> static int
> armpmu_add(struct perf_event *event, int flags)
> {
> 	........
> 	if (has_branch_stack(event)) {
> 		/*
> 		 * Reset branch records buffer if a new task event gets
> 		 * scheduled on a PMU which might have existing records.
> 		 * Otherwise older branch records present in the buffer
> 		 * might leak into the new task event.
> 		 */
> 		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
> 			hw_events->brbe_context = event->ctx;
> 			if (armpmu->branch_reset)
> 				armpmu->branch_reset();
> 		}
> 		hw_events->brbe_users++;
> Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
> 	}
> 	........
> }
>
> Instead of overriding existing 'branch_sample_type', both could be merged.
>
> --------------------------- Virtualisation support ------------------------
>
> - Branch stack sampling is not currently supported inside the guest (TODO)
>
> 	- FEAT_BRBE advertised as absent via clearing ID_AA64DFR0_EL1.BRBE
> 	- Future support in guest requires emulating FEAT_BRBE
>
> - Branch stack sampling the guest is not supported in the host      (TODO)
>
> 	- Tracing the guest with event->attr.exclude_guest = 0
> 	- There are multiple challenges involved regarding mixing events
> 	  with mismatched branch_sample_type and exclude_guest and passing
> 	  on captured BRBE records to intended events during PMU interrupt
>
> - Guest access for BRBE registers and instructions has been blocked
>
> - BRBE state save is not required for VHE host (EL2) guest (EL1) transition
>
> - BRBE state is saved for NVHE host (EL1) guest (EL1) transition
>
> -------------------------------- Testing ---------------------------------
>
> - Cross compiled for both arm64 and arm32 platforms
> - Passes all branch tests with 'perf test branch' on arm64
>
> -------------------------------- Questions -------------------------------
>
> - Instead of configuring the BRBE HW with branch_sample_type from the last
>    event to be added on the PMU as proposed, could those be merged together
>    e.g all privilege requests ORed, to form a common BRBE configuration and
>    all events get branch records after a PMU interrupt ?
>
> Changes in V17:
>
> - Added back Reviewed-by tags from Mark Brown
> - Updated the commit message regarding the field BRBINFx_EL1_TYPE_IMPDEF_TRAP_EL3
> - Added leading 0s for all values as BRBIDR0_EL1.NUMREC is a 8 bit field
> - Added leading 0s for all values as BRBFCR_EL1.BANK is a 2 bit field
> - Reordered BRBCR_EL1/BRBCR_EL12/BRBCR_EL2 registers as per sysreg encodings
> - Renamed s/FIRST/BANK_0 and s/SECOND/BANK_1 in BRBFCR_EL1.BANK
> - Renamed s/UNCOND_DIRECT/DIRECT_UNCOND in BRBINFx_EL1.TYPE
> - Renamed s/COND_DIRECT/DIRECT_COND in BRBINFx_EL1.TYPE
> - Dropped __SYS_BRBINF/__SYS_BRBSRC/__SYS_BRBTGT and their expansions
> - Moved all existing BRBE registers from sysreg.h header to tools/sysreg format
> - Updated the commit message including about sys_insn_descs[]
> - Changed KVM to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
> - Moved the BRBE instructions into sys_insn_descs[] array
> - ARM PMUV3 changes have been moved into the BRBE driver patch instead
> - Moved down branch_stack_add() in armpmu_add() after event's basic checks
> - Added new callbacks in struct arm_pmu e.g branch_stack_[init|add|del]()
> - Renamed struct arm_pmu callback branch_reset() as branch_stack_reset()
> - Dropped the comment in armpmu_event_init()
> - Renamed 'pmu_hw_events' elements from 'brbe_' to more generic 'branch_'
> - Separated out from the BRBE driver implementation patch
> - Dropped the comment in __init_el2_brbe()
> - Updated __init_el2_brbe() with BRBCR_EL2.MPRED requirements
> - Updated __init_el2_brbe() with __check_hvhe() constructs
> - Updated booting.rst regarding MPRED, MDCR_EL3 and fine grained control
> - Dropped Documentation/arch/arm64/brbe.rst
> - Renamed armv8pmu_branch_reset() as armv8pmu_branch_stack_reset()
> - Separated out booting.rst and EL2 boot requirements into a new patch
> - Dropped process_branch_aborts() completely
> - Added an warning if transaction states get detected unexpectedly
> - Dropped enum brbe_bank_idx from the driver
> - Defined armv8pmu_branch_stack_init/add/del() callbacks in the driver
> - Changed BRBE driver to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
> - Dropped isb() call sites in  __debug_[save|restore]_brbe()
> - Changed to [read|write]_sysreg_el1() accessors in __debug_[save|restore]_brbe()
>
> Changes in V16
>
> https://lore.kernel.org/all/20240125094119.2542332-1-anshuman.khandual@arm.com/
>
> - Updated BRBINFx_EL1.TYPE = 0b110000 as field IMPDEF_TRAP_EL3
> - Updated BRBCR_ELx[9] as field FZPSS
> - Updated BRBINFINJ_EL1 to use sysreg field BRBINFx_EL1
> - Added BRB_INF_SRC_TGT_EL1 macro for corresponding BRB_[INF|SRC|TGT] expansion
> - Renamed arm_brbe.h as arm_pmuv3_branch.h
> - Updated perf_sample_save_brstack()'s new argument requirements with NULL
> - Fixed typo (s/informations/information) in Documentation/arch/arm64/brbe.rst
> - Added SPDX-License-Identifier in Documentation/arch/arm64/brbe.rst
> - Added new PERF_SAMPLE_BRANCH_COUNTERS into BRBE_EXCLUDE_BRANCH_FILTERS
> - Dropped BRBFCR_EL1 and BRBCR_EL1 from enum vcpu_sysreg
> - Reverted back the KVM NVHE patch - use host_debug_state based 'brbcr_el1'
>    element and dropped the previous dependency on Jame's coresight series
>
> Changes in V15:
>
> https://lore.kernel.org/all/20231201053906.1261704-1-anshuman.khandual@arm.com/
>
> - Added a comment for armv8pmu_branch_probe() regarding single cpu probe
> - Added a text in brbe.rst regarding single cpu probe
> - Dropped runtime BRBE enable for setting DEBUG_STATE_SAVE_BRBE
> - Dropped zero_branch_stack based zero branch records mechanism
> - Replaced BRBFCR_EL1_DEFAULT_CONFIG with BRBFCR_EL1_CONFIG_MASK
> - Added BRBFCR_EL1_CONFIG_MASK masking in branch_type_to_brbfcr()
> - Moved BRBE helpers from arm_brbe.h into arm_brbe.c
> - Moved armv8_pmu_xxx() declaration inside arm_brbe.h for arm64 (CONFIG_ARM64_BRBE)
> - Moved armv8_pmu_xxx() stub definitions inside arm_brbe.h for arm32 (!CONFIG_ARM64_BRBE)
> - Included arm_brbe.h header both in arm_pmuv3.c and arm_brbe.c
> - Dropped BRBE custom pr_fmt()
> - Dropped CONFIG_PERF_EVENTS wrapping from header entries
> - Flush branch records when a cpu bound event follows a task bound event
> - Dropped BRBFCR_EL1 from __debug_save_brbe()/__debug_restore_brbe()
> - Always save the live SYS_BRBCR_EL1 in host context and then check if
>    BRBE was enabled before resetting SYS_BRBCR_EL1 for the host
>
> Changes in V14:
>
> https://lore.kernel.org/all/20231114051329.327572-1-anshuman.khandual@arm.com/
>
> - This series has been reorganised as suggested during V13
> - There are just eight patches now i.e 5 enablement and 3 perf branch tests
>
> - Fixed brackets problem in __SYS_BRBINFO/BRBSRC/BRBTGT() macros
> - Renamed the macro i.e s/__SYS_BRBINFO/__SYS_BRBINF/
> - Renamed s/BRB_IALL/BRB_IALL_INSN and s/BRBE_INJ/BRB_INJ_INSN
> - Moved BRB_IALL_INSN and SYS_BRB_INSN instructions to sysreg patch
> - Changed E1BRE as ExBRE in sysreg fields inside BRBCR_ELx
> - Used BRBCR_ELx for defining all BRBCR_EL1, BRBCR_EL2, and BRBCR_EL12 (new)
>
> - Folded the following three patches into a single patch i.e [PATCH 3/8]
>
>    drivers: perf: arm_pmu: Add new sched_task() callback
>    arm64/perf: Add branch stack support in struct arm_pmu
>    arm64/perf: Add branch stack support in struct pmu_hw_events
>    arm64/perf: Add branch stack support in ARMV8 PMU
>    arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()
>
> - All armv8pmu_branch_xxxx() stub definitions have been moved inside
>    include/linux/perf/arm_pmuv3.h for easy access from both arm32 and arm64
> - Added brbe_users, brbe_context and brbe_sample_type in struct pmu_hw_events
> - Added comments for all the above new elements in struct pmu_hw_events
> - Added branch_reset() and sched_task() callbacks
> - Changed and optimized branch records processing during a PMU IRQ
> - NO branch records get captured for event with mismatched brbe_sample_type
> - Branch record context is tracked from armpmu_del() & armpmu_add()
> - Branch record hardware is driven from armv8pmu_start() & armv8pmu_stop()
> - Dropped NULL check for 'pmu_ctx' inside armv8pmu_sched_task()
> - Moved down PERF_ATTACH_TASK_DATA assignment with a preceding comment
> - In conflicting branch sample type requests, first event takes precedence
>
> - Folded the following five patches from V13 into a single patch i.e
>    [PATCH 4/8]
>
>    arm64/perf: Enable branch stack events via FEAT_BRBE
>    arm64/perf: Add struct brbe_regset helper functions
>    arm64/perf: Implement branch records save on task sched out
>    arm64/perf: Implement branch records save on PMU IRQ
>
> - Fixed the year in copyright statement
> - Added Documentation/arch/arm64/brbe.rst
> - Updated Documentation/arch/arm64/booting.rst (BRBCR_EL2.CC for EL1 entry)
> - Added __init_el2_brbe() which enables branch record cycle count support
> - Disabled EL2 traps in __init_el2_fgt() while accessing BRBE registers and
>    executing instructions
> - Changed CONFIG_ARM64_BRBE user visible description
> - Fixed a typo in CONFIG_ARM64_BRBE config option description text
> - Added BUILD_BUG_ON() co-relating BRBE_BANK_MAX_ENTRIES and MAX_BRANCH_RECORDS
> - Dropped arm64_create_brbe_task_ctx_kmem_cache()
> - Moved down comment for PERF_SAMPLE_BRANCH_KERNEL in branch_type_to_brbcr()
> - Renamed BRBCR_ELx_DEFAULT_CONFIG as BRBCR_ELx_CONFIG_MASK
> - Replaced BRBCR_ELx_DEFAULT_TS with BRBCR_ELx_TS_MASK in BRBCR_ELx_CONFIG_MASK
> - Replaced BRBCR_ELx_E1BRE instances with BRBCR_ELx_ExBRE
>
> - Added BRBE specific branch stack sampling perf test patches into the series
> - Added a patch to prevent guest accesses into BRBE registers and instructions
> - Added a patch to save the BRBE host context in NVHE environment
> - Updated most commit messages
>
> Changes in V13:
>
> https://lore.kernel.org/all/20230711082455.215983-1-anshuman.khandual@arm.com/
> https://lore.kernel.org/all/20230622065351.1092893-1-anshuman.khandual@arm.com/
>
> - Added branch callback stubs for aarch32 pmuv3 based platforms
> - Updated the comments for capture_brbe_regset()
> - Deleted the comments in __read_brbe_regset()
> - Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
> - Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
> - Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()
>
> Changes in V12:
>
> https://lore.kernel.org/all/20230615133239.442736-1-anshuman.khandual@arm.com/
>
> - Replaced branch types with complete DIRECT/INDIRECT prefixes/suffixes
> - Replaced branch types with complete INSN/ALIGN prefixes/suffixes
> - Replaced return branch types as simple RET/ERET
> - Replaced time field GST_PHYSICAL as GUEST_PHYSICAL
> - Added 0 padding for BRBIDR0_EL1.NUMREC enum values
> - Dropped helper arm_pmu_branch_stack_supported()
> - Renamed armv8pmu_branch_valid() as armv8pmu_branch_attr_valid()
> - Separated perf_task_ctx_cache setup from arm_pmu private allocation
> - Collected changes to branch_records_alloc() in a single patch [5/10]
> - Reworked and cleaned up branch_records_alloc()
> - Reworked armv8pmu_branch_read() with new loop iterations in patch [6/10]
> - Reworked capture_brbe_regset() with new loop iterations in patch [8/10]
> - Updated the comment in branch_type_to_brbcr()
> - Fixed the comment before stitch_stored_live_entries()
> - Fixed BRBINFINJ_EL1 definition for VALID_FULL enum field
> - Factored out helper __read_brbe_regset() from capture_brbe_regset()
> - Dropped the helper copy_brbe_regset()
> - Simplified stitch_stored_live_entries() with memcpy(), memmove()
> - Reworked armv8pmu_probe_pmu() to bail out early with !probe.present
> - Rework brbe_attributes_probe() without 'struct brbe_hw_attr'
> - Dropped 'struct brbe_hw_attr' argument from capture_brbe_regset()
> - Dropped 'struct brbe_hw_attr' argument from brbe_branch_save()
> - Dropped arm_pmu->private and added arm_pmu->reg_trbidr instead
>
> Changes in V11:
>
> https://lore.kernel.org/all/20230531040428.501523-1-anshuman.khandual@arm.com/
>
> - Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data
>
> Changes in V10:
>
> https://lore.kernel.org/all/20230517022410.722287-1-anshuman.khandual@arm.com/
>
> - Rebased the series on v6.4-rc2
> - Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
> - Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
> - Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()
>
> Changes in V9:
>
> https://lore.kernel.org/all/20230315051444.1683170-1-anshuman.khandual@arm.com/
>
> - Fixed build problem with has_branch_stack() in arm64 header
> - BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
> - Renamed all BRBINF_EL1 call sites as BRBINFx_EL1
> - Dropped static const char branch_filter_error_msg[]
> - Implemented a positive list check for BRBE supported perf branch filters
> - Added a comment in armv8pmu_handle_irq()
> - Implemented per-cpu allocation for struct branch_record records
> - Skipped looping through bank 1 if an invalid record is detected in bank 0
> - Added comment in armv8pmu_branch_read() explaining prohibited region etc
> - Added comment warning about erroneously marking transactions as aborted
> - Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
> - Dropped the last argument (idx) in capture_brbe_flags()
> - Dropped the brbcr argument from capture_brbe_flags()
> - Used perf_sample_save_brstack() to capture branch records for perf_sample_data
> - Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
> - Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
> - Implemented event task context based branch records save mechanism
>
> Changes in V8:
>
> https://lore.kernel.org/all/20230123125956.1350336-1-anshuman.khandual@arm.com/
>
> - Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
> - Added a comment and line break before arm_pmu->private element
> - Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
> - Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
> - Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
> - Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
> - Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
> - Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
> - Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
> - Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()
> - Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
> - Renamed brbe_valid() as brbe_record_is_complete()
> - Renamed brbe_source() as brbe_record_is_source_only()
> - Renamed brbe_target() as brbe_record_is_target_only()
> - Inverted checks for !brbe_record_is_[target|source]_only() for info capture
> - Replaced 'fetch' with 'get' in all helpers that extract field value
> - Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
> - Dropped select_brbe_bank_index() completely, added capture_branch_entry()
> - Process captured branch entries in two separate loops one for each BRBE bank
> - Moved branch_records_alloc() inside armv8pmu_probe_pmu()
> - Added a forward declaration for the helper has_branch_stack()
> - Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
> - Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call
>
> Changes in V7:
>
> https://lore.kernel.org/all/20230105031039.207972-1-anshuman.khandual@arm.com/
>
> - Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
> - Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
> - Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
> - Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
> - Defined BRBCR_EL1_DEFAULT_TS in the header
> - Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
> - Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
> - Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
> - Also set BRBE in paused state in armv8pmu_branch_disable()
> - Dropped brbe_paused(), set_brbe_paused() helpers
> - Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
> - Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
> - Added valid_brbe_[cc, format, version]() helpers
> - Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
> - Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
> - Defined enum brbe_bank_idx with possible values for BRBE bank indices
> - Changed armpmu->hw_attr into armpmu->private
> - Added missing space in stub definition for armv8pmu_branch_valid()
> - Replaced both kmalloc() with kzalloc()
> - Added BRBE_BANK_MAX_ENTRIES
> - Updated comment for capture_brbe_flags()
> - Updated comment for struct brbe_hw_attr
> - Dropped space after type cast in couple of places
> - Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
> - Captured cpuc->branches->branch_entries[idx] in a local variable
> - Dropped saved_priv from armv8pmu_branch_read()
> - Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
> - Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
> - Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
> - Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
>    select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
> - Reorganized brbe_valid_nr() and dropped the pr_warn() message
> - Changed probe sequence in brbe_attributes_probe()
> - Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
> - Disable BRBE before disabling the PMU event counter
> - Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
> - Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()
>
> Changes in V6:
>
> https://lore.kernel.org/linux-arm-kernel/20221208084402.863310-1-anshuman.khandual@arm.com/
>
> - Restore the exception level privilege after reading the branch records
> - Unpause the buffer after reading the branch records
> - Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
> - Reworked BRBE implementation and branch stack sampling support on arm pmu
> - BRBE implementation is now part of overall ARMV8 PMU implementation
> - BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
> - CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
> - File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
> - File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
> - BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
> - BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
> - BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
> - Added sched_task() callback into struct arm_pmu
> - Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
> - Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
> - Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
> - Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
> - Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
> - armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
> - Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
> - Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
> - Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
> - Dropped set_brbe_disabled() as well
> - Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events
>
> Changes in V5:
>
> https://lore.kernel.org/linux-arm-kernel/20221107062514.2851047-1-anshuman.khandual@arm.com/
>
> - Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
> - Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
> - Changed config ARM_BRBE_PMU from 'tristate' to 'bool'
>
> Changes in V4:
>
> https://lore.kernel.org/all/20221017055713.451092-1-anshuman.khandual@arm.com/
>
> - Changed ../tools/sysreg declarations as suggested
> - Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
> - Dropped perfmon_capable() check in armpmu_event_init()
> - s/pr_warn_once/pr_info in armpmu_event_init()
> - Added brbe_format element into struct pmu_hw_events
> - Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
> - Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning
>
> Changes in V3:
>
> https://lore.kernel.org/all/20220929075857.158358-1-anshuman.khandual@arm.com/
>
> - Moved brbe_stack from the stack and now dynamically allocated
> - Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
> - Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
> - Created dummy BRBINF_EL1 field definitions in tools/sysreg
> - Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
> - Both exception and exception return branche records are now captured
>    only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
>    been checked in generic perf via perf_allow_kernel()
>
> Changes in V2:
>
> https://lore.kernel.org/all/20220908051046.465307-1-anshuman.khandual@arm.com/
>
> - Dropped branch sample filter helpers consolidation patch from this series
> - Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
> - Use cached perfmon_capable() while configuring BRBE branch record filters
>
> Changes in V1:
>
> https://lore.kernel.org/linux-arm-kernel/20220613100119.684673-1-anshuman.khandual@arm.com/
>
> - Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
> - Process new perf branch types via PERF_BR_EXTEND_ABI
>
> Changes in RFC V2:
>
> https://lore.kernel.org/linux-arm-kernel/20220412115455.293119-1-anshuman.khandual@arm.com/
>
> - Added branch_sample_priv() while consolidating other branch sample filter helpers
> - Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
> - Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
> - Added documentation for struct arm_pmu changes, updated commit message
> - Updated commit message for BRBE detection infrastructure patch
> - PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
> - Branch privilege state capture mechanism has now moved inside the driver
>
> Changes in RFC V1:
>
> https://lore.kernel.org/all/1642998653-21377-1-git-send-email-anshuman.khandual@arm.com/
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: James Clark <james.clark@arm.com>
> Cc: Rob Herring <robh@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Suzuki Poulose <suzuki.poulose@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-perf-users@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
>
> Anshuman Khandual (6):
>    arm64/sysreg: Add BRBE registers and fields
>    KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
>    drivers: perf: arm_pmu: Add infrastructure for branch stack sampling
>    arm64/boot: Enable EL2 requirements for BRBE
>    drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE
>    KVM: arm64: nvhe: Disable branch generation in nVHE guests
>
> James Clark (3):
>    perf: test: Speed up running brstack test on an Arm model
>    perf: test: Remove empty lines from branch filter test output
>    perf: test: Extend branch stack sampling test for Arm64 BRBE
>
>   Documentation/arch/arm64/booting.rst   |  26 +
>   arch/arm64/include/asm/el2_setup.h     |  90 ++-
>   arch/arm64/include/asm/kvm_host.h      |   5 +-
>   arch/arm64/include/asm/sysreg.h        |  17 +-
>   arch/arm64/kvm/debug.c                 |   5 +
>   arch/arm64/kvm/hyp/nvhe/debug-sr.c     |  31 +
>   arch/arm64/kvm/sys_regs.c              |  56 ++
>   arch/arm64/tools/sysreg                | 131 ++++
>   drivers/perf/Kconfig                   |  11 +
>   drivers/perf/Makefile                  |   1 +
>   drivers/perf/arm_brbe.c                | 968 +++++++++++++++++++++++++
>   drivers/perf/arm_pmu.c                 |  25 +-
>   drivers/perf/arm_pmuv3.c               | 146 +++-
>   drivers/perf/arm_pmuv3_branch.h        |  73 ++
>   include/linux/perf/arm_pmu.h           |  37 +-
>   tools/perf/tests/builtin-test.c        |   1 +
>   tools/perf/tests/shell/test_brstack.sh |  57 +-
>   tools/perf/tests/tests.h               |   1 +
>   tools/perf/tests/workloads/Build       |   2 +
>   tools/perf/tests/workloads/traploop.c  |  39 +
>   20 files changed, 1696 insertions(+), 26 deletions(-)
>   create mode 100644 drivers/perf/arm_brbe.c
>   create mode 100644 drivers/perf/arm_pmuv3_branch.h
>   create mode 100644 tools/perf/tests/workloads/traploop.c
>
Anshuman Khandual April 8, 2024, 2:26 a.m. UTC | #2
> 
> Will this apply on 6.8?  We have been chasing getting this tested, and are a revision behind you.

It might require some rebasing though - especially in the KVM code.

> 
> We are just testing on 6.8, and the fact that we don't have one that works on stable revision has been one
> 
> of the  reasons we have not provided feedback.

Because there are some other changes that went in during last merge window,
hence will suggest to test this series on the latest v6.9 release instead.

> 
> I did get to test an early version of the patch several months ago and strictly speaking it works:
> 
> I used a quicksort built unoptimized and compared it with an AutoFDO version consuming the output from BRBE.  The AutoFDO version was significantly faster, so we do get some optimization.

Just wondering if AutoFDO version (with BRBE branches) was also better than
previous AutoFDO version (without BRBE branches) ? That would ascertain the
observed improvement to the BRBE branch records.

> 
> What is that actual Test framework you are using to test?  I would like to do an apples-to-apples comparison with this version of the patch set or the most recent one I can get to apply.

The following standard perf tools based test should be a good start unless
you would like to hand craft some workloads with specific branch types etc

#perf test -vv branch

> 
> 
> 
> On 4/4/24 22:46, Anshuman Khandual wrote:
>> This series enables perf branch stack sampling support on arm64 platform
>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>> the relevant register definitions could be accessed here.
>>
>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>
>> This series applies on 6.9-rc2.
>>
>> Also this series is being hosted below for quick access, review and test.
>>
>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>
>> There are still some open questions regarding handling multiple perf events
>> with different privilege branch filters getting on the same PMU, supporting
>> guest branch stack tracing from the host etc. Finally also looking for some
>> suggestions regarding supporting BRBE inside the guest. The series has been
>> re-organized completely as suggested earlier.
>>
>> - Anshuman
>>
>> ========== Perf Branch Stack Sampling Support (arm64 platforms) ===========
>>
>> Currently arm64 platform does not support perf branch stack sampling. Hence
>> any event requesting for branch stack records i.e PERF_SAMPLE_BRANCH_STACK
>> marked in event->attr.sample_type, will be rejected in armpmu_event_init().
>>
>> static int armpmu_event_init(struct perf_event *event)
>> {
>>     ........
>>          /* does not support taken branch sampling */
>>          if (has_branch_stack(event))
>>                  return -EOPNOTSUPP;
>>     ........
>> }
>>
>> $perf record -j any,u,k ls
>> Error:
>> cycles:P: PMU Hardware or event type doesn't support branch stack sampling.
>>
>> -------------------- CONFIG_ARM64_BRBE and FEAT_BRBE ----------------------
>>
>> After this series, perf branch stack sampling feature gets enabled on arm64
>> platforms where FEAT_BRBE HW feature is supported, and CONFIG_ARM64_BRBE is
>> also selected during build. Let's observe all all possible scenarios here.
>>
>> 1. Feature not built (!CONFIG_ARM64_BRBE):
>>
>> Falls back to the current behaviour i.e event gets rejected.
>>
>> 2. Feature built but HW not supported (CONFIG_ARM64_BRBE && !FEAT_BRBE):
>>
>> Falls back to the current behaviour i.e event gets rejected.
>>
>> 3. Feature built and HW supported (CONFIG_ARM64_BRBE && FEAT_BRBE):
>>
>> Platform supports branch stack sampling requests. Let's observe through a
>> simple example here.
>>
>> $perf record -j any_call,u,k,save_type ls
>>
>> [Please refer perf-record man pages for all possible branch filter options]
>>
>> $perf report
>> -------------------------- Snip ----------------------
>> # Overhead  Command  Source Shared Object  Source Symbol                                 Target Symbol                                 Basic Block Cycles
>> # ........  .......  ....................  ............................................  ............................................  ..................
>> #
>>       3.52%  ls       [kernel.kallsyms]     [k] sched_clock_noinstr                       [k] arch_counter_get_cntpct                   16
>>       3.52%  ls       [kernel.kallsyms]     [k] sched_clock                               [k] sched_clock_noinstr                       9
>>       1.85%  ls       [kernel.kallsyms]     [k] sched_clock_cpu                           [k] sched_clock                               5
>>       1.80%  ls       [kernel.kallsyms]     [k] irqtime_account_irq                       [k] sched_clock_cpu                           20
>>       1.58%  ls       [kernel.kallsyms]     [k] gic_handle_irq                            [k] generic_handle_domain_irq                 19
>>       1.58%  ls       [kernel.kallsyms]     [k] call_on_irq_stack                         [k] gic_handle_irq                            9
>>       1.58%  ls       [kernel.kallsyms]     [k] do_interrupt_handler                      [k] call_on_irq_stack                         23
>>       1.58%  ls       [kernel.kallsyms]     [k] generic_handle_domain_irq                 [k] __irq_resolve_mapping                     6
>>       1.58%  ls       [kernel.kallsyms]     [k] __irq_resolve_mapping                     [k] __rcu_read_lock                           10
>> -------------------------- Snip ----------------------
>>
>> $perf report -D | grep cycles
>> -------------------------- Snip ----------------------
>> .....  1: ffff800080dd3334 -> ffff800080dd759c 39 cycles  P   0 IND_CALL
>> .....  2: ffff800080ffaea0 -> ffff800080ffb688 16 cycles  P   0 IND_CALL
>> .....  3: ffff800080139918 -> ffff800080ffae64 9  cycles  P   0 CALL
>> .....  4: ffff800080dd3324 -> ffff8000801398f8 7  cycles  P   0 CALL
>> .....  5: ffff8000800f8548 -> ffff800080dd330c 21 cycles  P   0 IND_CALL
>> .....  6: ffff8000800f864c -> ffff8000800f84ec 6  cycles  P   0 CALL
>> .....  7: ffff8000800f86dc -> ffff8000800f8638 11 cycles  P   0 CALL
>> .....  8: ffff8000800f86d4 -> ffff800081008630 16 cycles  P   0 CALL
>> -------------------------- Snip ----------------------
>>
>> perf script and other tooling can also be applied on the captured perf.data
>> Similarly branch stack sampling records can be collected via direct system
>> call i.e perf_event_open() method after setting 'struct perf_event_attr' as
>> required.
>>
>> event->attr.sample_type |= PERF_SAMPLE_BRANCH_STACK
>> event->attr.branch_sample_type |= PERF_SAMPLE_BRANCH_<FILTER_1> |
>>                   PERF_SAMPLE_BRANCH_<FILTER_2> |
>>                   PERF_SAMPLE_BRANCH_<FILTER_3> |
>>                   ...............................
>>
>> But all branch filters might not be supported on the platform.
>>
>> ----------------------- BRBE Branch Filters Support -----------------------
>>
>> - Following branch filters are supported on arm64.
>>
>>     PERF_SAMPLE_BRANCH_USER        /* Branch privilege filters */
>>     PERF_SAMPLE_BRANCH_KERNEL
>>     PERF_SAMPLE_BRANCH_HV
>>
>>     PERF_SAMPLE_BRANCH_ANY        /* Branch type filters */
>>     PERF_SAMPLE_BRANCH_ANY_CALL
>>     PERF_SAMPLE_BRANCH_ANY_RETURN
>>     PERF_SAMPLE_BRANCH_IND_CALL
>>     PERF_SAMPLE_BRANCH_COND
>>     PERF_SAMPLE_BRANCH_IND_JUMP
>>     PERF_SAMPLE_BRANCH_CALL
>>
>>     PERF_SAMPLE_BRANCH_NO_FLAGS    /* Branch record flags */
>>     PERF_SAMPLE_BRANCH_NO_CYCLES
>>     PERF_SAMPLE_BRANCH_TYPE_SAVE
>>     PERF_SAMPLE_BRANCH_HW_INDEX
>>     PERF_SAMPLE_BRANCH_PRIV_SAVE
>>
>> - Following branch filters are not supported on arm64.
>>
>>     PERF_SAMPLE_BRANCH_ABORT_TX
>>     PERF_SAMPLE_BRANCH_IN_TX
>>     PERF_SAMPLE_BRANCH_NO_TX
>>     PERF_SAMPLE_BRANCH_CALL_STACK
>>
>> Events requesting above non-supported branch filters get rejected.
>>
>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>
>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>> remain the same for all the events during a perf record session.
>>
>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>
>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>
>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>> remain the same for all events. Let's consider the following example
>>
>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>
>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>
>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>> PMU and hence captured branch records only get passed on to matching events
>> during a PMU interrupt.
>>
>> static int
>> armpmu_add(struct perf_event *event, int flags)
>> {
>>     ........
>>     if (has_branch_stack(event)) {
>>         /*
>>          * Reset branch records buffer if a new task event gets
>>          * scheduled on a PMU which might have existing records.
>>          * Otherwise older branch records present in the buffer
>>          * might leak into the new task event.
>>          */
>>         if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>>             hw_events->brbe_context = event->ctx;
>>             if (armpmu->branch_reset)
>>                 armpmu->branch_reset();
>>         }
>>         hw_events->brbe_users++;
>> Here ------->    hw_events->brbe_sample_type = event->attr.branch_sample_type;
>>     }
>>     ........
>> }
>>
>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>
>> --------------------------- Virtualisation support ------------------------
>>
>> - Branch stack sampling is not currently supported inside the guest (TODO)
>>
>>     - FEAT_BRBE advertised as absent via clearing ID_AA64DFR0_EL1.BRBE
>>     - Future support in guest requires emulating FEAT_BRBE
>>
>> - Branch stack sampling the guest is not supported in the host      (TODO)
>>
>>     - Tracing the guest with event->attr.exclude_guest = 0
>>     - There are multiple challenges involved regarding mixing events
>>       with mismatched branch_sample_type and exclude_guest and passing
>>       on captured BRBE records to intended events during PMU interrupt
>>
>> - Guest access for BRBE registers and instructions has been blocked
>>
>> - BRBE state save is not required for VHE host (EL2) guest (EL1) transition
>>
>> - BRBE state is saved for NVHE host (EL1) guest (EL1) transition
>>
>> -------------------------------- Testing ---------------------------------
>>
>> - Cross compiled for both arm64 and arm32 platforms
>> - Passes all branch tests with 'perf test branch' on arm64
>>
>> -------------------------------- Questions -------------------------------
>>
>> - Instead of configuring the BRBE HW with branch_sample_type from the last
>>    event to be added on the PMU as proposed, could those be merged together
>>    e.g all privilege requests ORed, to form a common BRBE configuration and
>>    all events get branch records after a PMU interrupt ?
>>
>> Changes in V17:
>>
>> - Added back Reviewed-by tags from Mark Brown
>> - Updated the commit message regarding the field BRBINFx_EL1_TYPE_IMPDEF_TRAP_EL3
>> - Added leading 0s for all values as BRBIDR0_EL1.NUMREC is a 8 bit field
>> - Added leading 0s for all values as BRBFCR_EL1.BANK is a 2 bit field
>> - Reordered BRBCR_EL1/BRBCR_EL12/BRBCR_EL2 registers as per sysreg encodings
>> - Renamed s/FIRST/BANK_0 and s/SECOND/BANK_1 in BRBFCR_EL1.BANK
>> - Renamed s/UNCOND_DIRECT/DIRECT_UNCOND in BRBINFx_EL1.TYPE
>> - Renamed s/COND_DIRECT/DIRECT_COND in BRBINFx_EL1.TYPE
>> - Dropped __SYS_BRBINF/__SYS_BRBSRC/__SYS_BRBTGT and their expansions
>> - Moved all existing BRBE registers from sysreg.h header to tools/sysreg format
>> - Updated the commit message including about sys_insn_descs[]
>> - Changed KVM to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
>> - Moved the BRBE instructions into sys_insn_descs[] array
>> - ARM PMUV3 changes have been moved into the BRBE driver patch instead
>> - Moved down branch_stack_add() in armpmu_add() after event's basic checks
>> - Added new callbacks in struct arm_pmu e.g branch_stack_[init|add|del]()
>> - Renamed struct arm_pmu callback branch_reset() as branch_stack_reset()
>> - Dropped the comment in armpmu_event_init()
>> - Renamed 'pmu_hw_events' elements from 'brbe_' to more generic 'branch_'
>> - Separated out from the BRBE driver implementation patch
>> - Dropped the comment in __init_el2_brbe()
>> - Updated __init_el2_brbe() with BRBCR_EL2.MPRED requirements
>> - Updated __init_el2_brbe() with __check_hvhe() constructs
>> - Updated booting.rst regarding MPRED, MDCR_EL3 and fine grained control
>> - Dropped Documentation/arch/arm64/brbe.rst
>> - Renamed armv8pmu_branch_reset() as armv8pmu_branch_stack_reset()
>> - Separated out booting.rst and EL2 boot requirements into a new patch
>> - Dropped process_branch_aborts() completely
>> - Added an warning if transaction states get detected unexpectedly
>> - Dropped enum brbe_bank_idx from the driver
>> - Defined armv8pmu_branch_stack_init/add/del() callbacks in the driver
>> - Changed BRBE driver to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
>> - Dropped isb() call sites in  __debug_[save|restore]_brbe()
>> - Changed to [read|write]_sysreg_el1() accessors in __debug_[save|restore]_brbe()
>>
>> Changes in V16
>>
>> https://lore.kernel.org/all/20240125094119.2542332-1-anshuman.khandual@arm.com/
>>
>> - Updated BRBINFx_EL1.TYPE = 0b110000 as field IMPDEF_TRAP_EL3
>> - Updated BRBCR_ELx[9] as field FZPSS
>> - Updated BRBINFINJ_EL1 to use sysreg field BRBINFx_EL1
>> - Added BRB_INF_SRC_TGT_EL1 macro for corresponding BRB_[INF|SRC|TGT] expansion
>> - Renamed arm_brbe.h as arm_pmuv3_branch.h
>> - Updated perf_sample_save_brstack()'s new argument requirements with NULL
>> - Fixed typo (s/informations/information) in Documentation/arch/arm64/brbe.rst
>> - Added SPDX-License-Identifier in Documentation/arch/arm64/brbe.rst
>> - Added new PERF_SAMPLE_BRANCH_COUNTERS into BRBE_EXCLUDE_BRANCH_FILTERS
>> - Dropped BRBFCR_EL1 and BRBCR_EL1 from enum vcpu_sysreg
>> - Reverted back the KVM NVHE patch - use host_debug_state based 'brbcr_el1'
>>    element and dropped the previous dependency on Jame's coresight series
>>
>> Changes in V15:
>>
>> https://lore.kernel.org/all/20231201053906.1261704-1-anshuman.khandual@arm.com/
>>
>> - Added a comment for armv8pmu_branch_probe() regarding single cpu probe
>> - Added a text in brbe.rst regarding single cpu probe
>> - Dropped runtime BRBE enable for setting DEBUG_STATE_SAVE_BRBE
>> - Dropped zero_branch_stack based zero branch records mechanism
>> - Replaced BRBFCR_EL1_DEFAULT_CONFIG with BRBFCR_EL1_CONFIG_MASK
>> - Added BRBFCR_EL1_CONFIG_MASK masking in branch_type_to_brbfcr()
>> - Moved BRBE helpers from arm_brbe.h into arm_brbe.c
>> - Moved armv8_pmu_xxx() declaration inside arm_brbe.h for arm64 (CONFIG_ARM64_BRBE)
>> - Moved armv8_pmu_xxx() stub definitions inside arm_brbe.h for arm32 (!CONFIG_ARM64_BRBE)
>> - Included arm_brbe.h header both in arm_pmuv3.c and arm_brbe.c
>> - Dropped BRBE custom pr_fmt()
>> - Dropped CONFIG_PERF_EVENTS wrapping from header entries
>> - Flush branch records when a cpu bound event follows a task bound event
>> - Dropped BRBFCR_EL1 from __debug_save_brbe()/__debug_restore_brbe()
>> - Always save the live SYS_BRBCR_EL1 in host context and then check if
>>    BRBE was enabled before resetting SYS_BRBCR_EL1 for the host
>>
>> Changes in V14:
>>
>> https://lore.kernel.org/all/20231114051329.327572-1-anshuman.khandual@arm.com/
>>
>> - This series has been reorganised as suggested during V13
>> - There are just eight patches now i.e 5 enablement and 3 perf branch tests
>>
>> - Fixed brackets problem in __SYS_BRBINFO/BRBSRC/BRBTGT() macros
>> - Renamed the macro i.e s/__SYS_BRBINFO/__SYS_BRBINF/
>> - Renamed s/BRB_IALL/BRB_IALL_INSN and s/BRBE_INJ/BRB_INJ_INSN
>> - Moved BRB_IALL_INSN and SYS_BRB_INSN instructions to sysreg patch
>> - Changed E1BRE as ExBRE in sysreg fields inside BRBCR_ELx
>> - Used BRBCR_ELx for defining all BRBCR_EL1, BRBCR_EL2, and BRBCR_EL12 (new)
>>
>> - Folded the following three patches into a single patch i.e [PATCH 3/8]
>>
>>    drivers: perf: arm_pmu: Add new sched_task() callback
>>    arm64/perf: Add branch stack support in struct arm_pmu
>>    arm64/perf: Add branch stack support in struct pmu_hw_events
>>    arm64/perf: Add branch stack support in ARMV8 PMU
>>    arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()
>>
>> - All armv8pmu_branch_xxxx() stub definitions have been moved inside
>>    include/linux/perf/arm_pmuv3.h for easy access from both arm32 and arm64
>> - Added brbe_users, brbe_context and brbe_sample_type in struct pmu_hw_events
>> - Added comments for all the above new elements in struct pmu_hw_events
>> - Added branch_reset() and sched_task() callbacks
>> - Changed and optimized branch records processing during a PMU IRQ
>> - NO branch records get captured for event with mismatched brbe_sample_type
>> - Branch record context is tracked from armpmu_del() & armpmu_add()
>> - Branch record hardware is driven from armv8pmu_start() & armv8pmu_stop()
>> - Dropped NULL check for 'pmu_ctx' inside armv8pmu_sched_task()
>> - Moved down PERF_ATTACH_TASK_DATA assignment with a preceding comment
>> - In conflicting branch sample type requests, first event takes precedence
>>
>> - Folded the following five patches from V13 into a single patch i.e
>>    [PATCH 4/8]
>>
>>    arm64/perf: Enable branch stack events via FEAT_BRBE
>>    arm64/perf: Add struct brbe_regset helper functions
>>    arm64/perf: Implement branch records save on task sched out
>>    arm64/perf: Implement branch records save on PMU IRQ
>>
>> - Fixed the year in copyright statement
>> - Added Documentation/arch/arm64/brbe.rst
>> - Updated Documentation/arch/arm64/booting.rst (BRBCR_EL2.CC for EL1 entry)
>> - Added __init_el2_brbe() which enables branch record cycle count support
>> - Disabled EL2 traps in __init_el2_fgt() while accessing BRBE registers and
>>    executing instructions
>> - Changed CONFIG_ARM64_BRBE user visible description
>> - Fixed a typo in CONFIG_ARM64_BRBE config option description text
>> - Added BUILD_BUG_ON() co-relating BRBE_BANK_MAX_ENTRIES and MAX_BRANCH_RECORDS
>> - Dropped arm64_create_brbe_task_ctx_kmem_cache()
>> - Moved down comment for PERF_SAMPLE_BRANCH_KERNEL in branch_type_to_brbcr()
>> - Renamed BRBCR_ELx_DEFAULT_CONFIG as BRBCR_ELx_CONFIG_MASK
>> - Replaced BRBCR_ELx_DEFAULT_TS with BRBCR_ELx_TS_MASK in BRBCR_ELx_CONFIG_MASK
>> - Replaced BRBCR_ELx_E1BRE instances with BRBCR_ELx_ExBRE
>>
>> - Added BRBE specific branch stack sampling perf test patches into the series
>> - Added a patch to prevent guest accesses into BRBE registers and instructions
>> - Added a patch to save the BRBE host context in NVHE environment
>> - Updated most commit messages
>>
>> Changes in V13:
>>
>> https://lore.kernel.org/all/20230711082455.215983-1-anshuman.khandual@arm.com/
>> https://lore.kernel.org/all/20230622065351.1092893-1-anshuman.khandual@arm.com/
>>
>> - Added branch callback stubs for aarch32 pmuv3 based platforms
>> - Updated the comments for capture_brbe_regset()
>> - Deleted the comments in __read_brbe_regset()
>> - Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
>> - Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
>> - Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()
>>
>> Changes in V12:
>>
>> https://lore.kernel.org/all/20230615133239.442736-1-anshuman.khandual@arm.com/
>>
>> - Replaced branch types with complete DIRECT/INDIRECT prefixes/suffixes
>> - Replaced branch types with complete INSN/ALIGN prefixes/suffixes
>> - Replaced return branch types as simple RET/ERET
>> - Replaced time field GST_PHYSICAL as GUEST_PHYSICAL
>> - Added 0 padding for BRBIDR0_EL1.NUMREC enum values
>> - Dropped helper arm_pmu_branch_stack_supported()
>> - Renamed armv8pmu_branch_valid() as armv8pmu_branch_attr_valid()
>> - Separated perf_task_ctx_cache setup from arm_pmu private allocation
>> - Collected changes to branch_records_alloc() in a single patch [5/10]
>> - Reworked and cleaned up branch_records_alloc()
>> - Reworked armv8pmu_branch_read() with new loop iterations in patch [6/10]
>> - Reworked capture_brbe_regset() with new loop iterations in patch [8/10]
>> - Updated the comment in branch_type_to_brbcr()
>> - Fixed the comment before stitch_stored_live_entries()
>> - Fixed BRBINFINJ_EL1 definition for VALID_FULL enum field
>> - Factored out helper __read_brbe_regset() from capture_brbe_regset()
>> - Dropped the helper copy_brbe_regset()
>> - Simplified stitch_stored_live_entries() with memcpy(), memmove()
>> - Reworked armv8pmu_probe_pmu() to bail out early with !probe.present
>> - Rework brbe_attributes_probe() without 'struct brbe_hw_attr'
>> - Dropped 'struct brbe_hw_attr' argument from capture_brbe_regset()
>> - Dropped 'struct brbe_hw_attr' argument from brbe_branch_save()
>> - Dropped arm_pmu->private and added arm_pmu->reg_trbidr instead
>>
>> Changes in V11:
>>
>> https://lore.kernel.org/all/20230531040428.501523-1-anshuman.khandual@arm.com/
>>
>> - Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data
>>
>> Changes in V10:
>>
>> https://lore.kernel.org/all/20230517022410.722287-1-anshuman.khandual@arm.com/
>>
>> - Rebased the series on v6.4-rc2
>> - Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
>> - Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
>> - Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()
>>
>> Changes in V9:
>>
>> https://lore.kernel.org/all/20230315051444.1683170-1-anshuman.khandual@arm.com/
>>
>> - Fixed build problem with has_branch_stack() in arm64 header
>> - BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
>> - Renamed all BRBINF_EL1 call sites as BRBINFx_EL1
>> - Dropped static const char branch_filter_error_msg[]
>> - Implemented a positive list check for BRBE supported perf branch filters
>> - Added a comment in armv8pmu_handle_irq()
>> - Implemented per-cpu allocation for struct branch_record records
>> - Skipped looping through bank 1 if an invalid record is detected in bank 0
>> - Added comment in armv8pmu_branch_read() explaining prohibited region etc
>> - Added comment warning about erroneously marking transactions as aborted
>> - Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
>> - Dropped the last argument (idx) in capture_brbe_flags()
>> - Dropped the brbcr argument from capture_brbe_flags()
>> - Used perf_sample_save_brstack() to capture branch records for perf_sample_data
>> - Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
>> - Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
>> - Implemented event task context based branch records save mechanism
>>
>> Changes in V8:
>>
>> https://lore.kernel.org/all/20230123125956.1350336-1-anshuman.khandual@arm.com/
>>
>> - Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
>> - Added a comment and line break before arm_pmu->private element
>> - Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
>> - Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
>> - Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
>> - Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
>> - Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
>> - Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
>> - Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
>> - Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()
>> - Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
>> - Renamed brbe_valid() as brbe_record_is_complete()
>> - Renamed brbe_source() as brbe_record_is_source_only()
>> - Renamed brbe_target() as brbe_record_is_target_only()
>> - Inverted checks for !brbe_record_is_[target|source]_only() for info capture
>> - Replaced 'fetch' with 'get' in all helpers that extract field value
>> - Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
>> - Dropped select_brbe_bank_index() completely, added capture_branch_entry()
>> - Process captured branch entries in two separate loops one for each BRBE bank
>> - Moved branch_records_alloc() inside armv8pmu_probe_pmu()
>> - Added a forward declaration for the helper has_branch_stack()
>> - Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
>> - Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call
>>
>> Changes in V7:
>>
>> https://lore.kernel.org/all/20230105031039.207972-1-anshuman.khandual@arm.com/
>>
>> - Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
>> - Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
>> - Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
>> - Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
>> - Defined BRBCR_EL1_DEFAULT_TS in the header
>> - Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
>> - Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
>> - Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
>> - Also set BRBE in paused state in armv8pmu_branch_disable()
>> - Dropped brbe_paused(), set_brbe_paused() helpers
>> - Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
>> - Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
>> - Added valid_brbe_[cc, format, version]() helpers
>> - Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
>> - Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
>> - Defined enum brbe_bank_idx with possible values for BRBE bank indices
>> - Changed armpmu->hw_attr into armpmu->private
>> - Added missing space in stub definition for armv8pmu_branch_valid()
>> - Replaced both kmalloc() with kzalloc()
>> - Added BRBE_BANK_MAX_ENTRIES
>> - Updated comment for capture_brbe_flags()
>> - Updated comment for struct brbe_hw_attr
>> - Dropped space after type cast in couple of places
>> - Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
>> - Captured cpuc->branches->branch_entries[idx] in a local variable
>> - Dropped saved_priv from armv8pmu_branch_read()
>> - Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
>> - Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
>> - Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
>> - Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
>>    select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
>> - Reorganized brbe_valid_nr() and dropped the pr_warn() message
>> - Changed probe sequence in brbe_attributes_probe()
>> - Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
>> - Disable BRBE before disabling the PMU event counter
>> - Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
>> - Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()
>>
>> Changes in V6:
>>
>> https://lore.kernel.org/linux-arm-kernel/20221208084402.863310-1-anshuman.khandual@arm.com/
>>
>> - Restore the exception level privilege after reading the branch records
>> - Unpause the buffer after reading the branch records
>> - Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
>> - Reworked BRBE implementation and branch stack sampling support on arm pmu
>> - BRBE implementation is now part of overall ARMV8 PMU implementation
>> - BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
>> - CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
>> - File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
>> - File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
>> - BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
>> - BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
>> - BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
>> - Added sched_task() callback into struct arm_pmu
>> - Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
>> - Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
>> - Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
>> - Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
>> - Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
>> - armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
>> - Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
>> - Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
>> - Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
>> - Dropped set_brbe_disabled() as well
>> - Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events
>>
>> Changes in V5:
>>
>> https://lore.kernel.org/linux-arm-kernel/20221107062514.2851047-1-anshuman.khandual@arm.com/
>>
>> - Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
>> - Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
>> - Changed config ARM_BRBE_PMU from 'tristate' to 'bool'
>>
>> Changes in V4:
>>
>> https://lore.kernel.org/all/20221017055713.451092-1-anshuman.khandual@arm.com/
>>
>> - Changed ../tools/sysreg declarations as suggested
>> - Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
>> - Dropped perfmon_capable() check in armpmu_event_init()
>> - s/pr_warn_once/pr_info in armpmu_event_init()
>> - Added brbe_format element into struct pmu_hw_events
>> - Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
>> - Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning
>>
>> Changes in V3:
>>
>> https://lore.kernel.org/all/20220929075857.158358-1-anshuman.khandual@arm.com/
>>
>> - Moved brbe_stack from the stack and now dynamically allocated
>> - Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
>> - Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
>> - Created dummy BRBINF_EL1 field definitions in tools/sysreg
>> - Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
>> - Both exception and exception return branche records are now captured
>>    only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
>>    been checked in generic perf via perf_allow_kernel()
>>
>> Changes in V2:
>>
>> https://lore.kernel.org/all/20220908051046.465307-1-anshuman.khandual@arm.com/
>>
>> - Dropped branch sample filter helpers consolidation patch from this series
>> - Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
>> - Use cached perfmon_capable() while configuring BRBE branch record filters
>>
>> Changes in V1:
>>
>> https://lore.kernel.org/linux-arm-kernel/20220613100119.684673-1-anshuman.khandual@arm.com/
>>
>> - Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
>> - Process new perf branch types via PERF_BR_EXTEND_ABI
>>
>> Changes in RFC V2:
>>
>> https://lore.kernel.org/linux-arm-kernel/20220412115455.293119-1-anshuman.khandual@arm.com/
>>
>> - Added branch_sample_priv() while consolidating other branch sample filter helpers
>> - Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
>> - Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
>> - Added documentation for struct arm_pmu changes, updated commit message
>> - Updated commit message for BRBE detection infrastructure patch
>> - PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
>> - Branch privilege state capture mechanism has now moved inside the driver
>>
>> Changes in RFC V1:
>>
>> https://lore.kernel.org/all/1642998653-21377-1-git-send-email-anshuman.khandual@arm.com/
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Mark Brown <broonie@kernel.org>
>> Cc: James Clark <james.clark@arm.com>
>> Cc: Rob Herring <robh@kernel.org>
>> Cc: Marc Zyngier <maz@kernel.org>
>> Cc: Suzuki Poulose <suzuki.poulose@arm.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-perf-users@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>>
>> Anshuman Khandual (6):
>>    arm64/sysreg: Add BRBE registers and fields
>>    KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
>>    drivers: perf: arm_pmu: Add infrastructure for branch stack sampling
>>    arm64/boot: Enable EL2 requirements for BRBE
>>    drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE
>>    KVM: arm64: nvhe: Disable branch generation in nVHE guests
>>
>> James Clark (3):
>>    perf: test: Speed up running brstack test on an Arm model
>>    perf: test: Remove empty lines from branch filter test output
>>    perf: test: Extend branch stack sampling test for Arm64 BRBE
>>
>>   Documentation/arch/arm64/booting.rst   |  26 +
>>   arch/arm64/include/asm/el2_setup.h     |  90 ++-
>>   arch/arm64/include/asm/kvm_host.h      |   5 +-
>>   arch/arm64/include/asm/sysreg.h        |  17 +-
>>   arch/arm64/kvm/debug.c                 |   5 +
>>   arch/arm64/kvm/hyp/nvhe/debug-sr.c     |  31 +
>>   arch/arm64/kvm/sys_regs.c              |  56 ++
>>   arch/arm64/tools/sysreg                | 131 ++++
>>   drivers/perf/Kconfig                   |  11 +
>>   drivers/perf/Makefile                  |   1 +
>>   drivers/perf/arm_brbe.c                | 968 +++++++++++++++++++++++++
>>   drivers/perf/arm_pmu.c                 |  25 +-
>>   drivers/perf/arm_pmuv3.c               | 146 +++-
>>   drivers/perf/arm_pmuv3_branch.h        |  73 ++
>>   include/linux/perf/arm_pmu.h           |  37 +-
>>   tools/perf/tests/builtin-test.c        |   1 +
>>   tools/perf/tests/shell/test_brstack.sh |  57 +-
>>   tools/perf/tests/tests.h               |   1 +
>>   tools/perf/tests/workloads/Build       |   2 +
>>   tools/perf/tests/workloads/traploop.c  |  39 +
>>   20 files changed, 1696 insertions(+), 26 deletions(-)
>>   create mode 100644 drivers/perf/arm_brbe.c
>>   create mode 100644 drivers/perf/arm_pmuv3_branch.h
>>   create mode 100644 tools/perf/tests/workloads/traploop.c
>>
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
James Clark May 30, 2024, 9:47 a.m. UTC | #3
On 05/04/2024 03:46, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> the relevant register definitions could be accessed here.
> 
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> 
> This series applies on 6.9-rc2.
> 
> Also this series is being hosted below for quick access, review and test.
> 
> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
> 
> There are still some open questions regarding handling multiple perf events
> with different privilege branch filters getting on the same PMU, supporting
> guest branch stack tracing from the host etc. Finally also looking for some
> suggestions regarding supporting BRBE inside the guest. The series has been
> re-organized completely as suggested earlier.
> 
> - Anshuman
> 
[...]
> 
> ------------------ Possible 'branch_sample_type' Mismatch -----------------
> 
> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
> remain the same for all the events during a perf record session.
> 
> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
> 
> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
> 
> This 'branch_sample_type' is used to configure the BRBE hardware, when both
> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
> remain the same for all events. Let's consider the following example
> 
> $perf record -e cycles:u -e instructions:k -j any,save_type ls
> 
> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
> 
> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
> BRBE hardware with 'branch_sample_type' from last event to be added in the
> PMU and hence captured branch records only get passed on to matching events
> during a PMU interrupt.
> 

Hi Anshuman,

Surely because of this example we should merge? At least we have to try
to make the most common basic command lines work. Unless we expect all
tools to know whether the branch buffer is shared between PMUs on each
architecture or not. The driver knows though, so can merge the settings
because it all has to go into one BRBE.

Merging the settings in tools would be a much harder problem.

Any user that doesn't have permission for anything in the merged result
can continue to get nothing.

And we can always add filtering in the kernel later on if we want to to
make it appear to behave even more normally.

> static int
> armpmu_add(struct perf_event *event, int flags)
> {
> 	........
> 	if (has_branch_stack(event)) {
> 		/*
> 		 * Reset branch records buffer if a new task event gets
> 		 * scheduled on a PMU which might have existing records.
> 		 * Otherwise older branch records present in the buffer
> 		 * might leak into the new task event.
> 		 */
> 		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
> 			hw_events->brbe_context = event->ctx;
> 			if (armpmu->branch_reset)
> 				armpmu->branch_reset();
> 		}
> 		hw_events->brbe_users++;
> Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
> 	}
> 	........
> }
> 
> Instead of overriding existing 'branch_sample_type', both could be merged.
> 

I can't see any use case where anyone would want the override behavior.
Especially if you imagine multiple users not even aware of each other.
Either the current "no records for mismatches" or the merged one make sense.

James
James Clark May 30, 2024, 10:03 a.m. UTC | #4
On 05/04/2024 03:46, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> the relevant register definitions could be accessed here.
> 
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> 
> This series applies on 6.9-rc2.
> 
> Also this series is being hosted below for quick access, review and test.
> 
> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
> 
> There are still some open questions regarding handling multiple perf events
> with different privilege branch filters getting on the same PMU, supporting
> guest branch stack tracing from the host etc. Finally also looking for some
> suggestions regarding supporting BRBE inside the guest. The series has been
> re-organized completely as suggested earlier.

For guest support I'm still of this opinion:

  * No support for the host looking into guests (the addresses don't
    make sense anyway without also running Perf record in the guest)
  * Save and restore the host buffer and registers on guest switch (if
    it was ever used by either host or guest)
  * Let the guest do whatever it wants with BRBE without any
    virtualisation

Merging this with the current PMU virtualistion stuff seems like a lot
of work for no use case (host looking into guests). Having said that, it
might not even be worth discussing on this patchset apart from "no guest
support", and we can do it later to avoid confusion that it's being
proposed for this version.

James
Mark Rutland May 30, 2024, 5:41 p.m. UTC | #5
On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
> On 05/04/2024 03:46, Anshuman Khandual wrote:
> > This series enables perf branch stack sampling support on arm64 platform
> > via a new arch feature called Branch Record Buffer Extension (BRBE). All
> > the relevant register definitions could be accessed here.
> > 
> > https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> > 
> > This series applies on 6.9-rc2.
> > 
> > Also this series is being hosted below for quick access, review and test.
> > 
> > https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
> > 
> > There are still some open questions regarding handling multiple perf events
> > with different privilege branch filters getting on the same PMU, supporting
> > guest branch stack tracing from the host etc. Finally also looking for some
> > suggestions regarding supporting BRBE inside the guest. The series has been
> > re-organized completely as suggested earlier.
> > 
> > - Anshuman
> > 
> [...]
> > 
> > ------------------ Possible 'branch_sample_type' Mismatch -----------------
> > 
> > Branch stack sampling attributes 'event->attr.branch_sample_type' generally
> > remain the same for all the events during a perf record session.
> > 
> > $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
> > 
> > event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
> > 
> > This 'branch_sample_type' is used to configure the BRBE hardware, when both
> > events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
> > PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
> > remain the same for all events. Let's consider the following example
> > 
> > $perf record -e cycles:u -e instructions:k -j any,save_type ls
> > 
> > cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
> > 
> > Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
> > inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
> > BRBE hardware with 'branch_sample_type' from last event to be added in the
> > PMU and hence captured branch records only get passed on to matching events
> > during a PMU interrupt.
> > 
> 
> Hi Anshuman,
> 
> Surely because of this example we should merge? At least we have to try
> to make the most common basic command lines work. Unless we expect all
> tools to know whether the branch buffer is shared between PMUs on each
> architecture or not. The driver knows though, so can merge the settings
> because it all has to go into one BRBE.

The difficulty here is that these are opened as independent events (not
in the same event group), and so from the driver's PoV, this is no
different two two users independently doing:

	perf record -e event:u -j any,save_type -p ${SOME_PID}

	perf record -e event:k -j any,save_type -p ${SOME_PID}

... where either would be surprised to get the merged result.

So, if we merge the filters into the most permissive set, we *must*
filter them when handing them to userspace so that each event gets the
expected set of branch records.

Assuming we do that, for Anshuman's case above:

	perf record -e cycles:u -e instructions:k -j any,save_type ls	

... the overflow of the cycles evnt will only record user branch
records, and the overflow of the instructions event will only record
kernel branch records.

I think it's arguable either way as to whether users want that or merged
records; we should probably figure out with the tools folk what the
desired behaviour is for that command line, but as above I don't think
that we can have the kernel give both events merged records unless
userspace asks for that explicitly.

> Merging the settings in tools would be a much harder problem.

I can see that it'd be harder to do that up-front when parsing each
event independently, but there is a phase where the tool knows about all
of the events and *might* be able to do that, where the driver doesn't
really know at any point that these events are related.

Regardless, I assume that if the user passes:

	perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls

... both events will be opened with PERF_SAMPLE_BRANCH_USER and
PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
is sufficient.

> Any user that doesn't have permission for anything in the merged result
> can continue to get nothing.
> 
> And we can always add filtering in the kernel later on if we want to to
> make it appear to behave even more normally.

As above, I think if we merge the HW filters in the kernel then the
kernel must do SW filtering. I don't think that's something we can leave
for later.

> > static int
> > armpmu_add(struct perf_event *event, int flags)
> > {
> > 	........
> > 	if (has_branch_stack(event)) {
> > 		/*
> > 		 * Reset branch records buffer if a new task event gets
> > 		 * scheduled on a PMU which might have existing records.
> > 		 * Otherwise older branch records present in the buffer
> > 		 * might leak into the new task event.
> > 		 */
> > 		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
> > 			hw_events->brbe_context = event->ctx;
> > 			if (armpmu->branch_reset)
> > 				armpmu->branch_reset();
> > 		}
> > 		hw_events->brbe_users++;
> > Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
> > 	}
> > 	........
> > }
> > 
> > Instead of overriding existing 'branch_sample_type', both could be merged.
> > 
> 
> I can't see any use case where anyone would want the override behavior.
> Especially if you imagine multiple users not even aware of each other.

I completely agree that one event overriding another is not an
acceptable solution.

> Either the current "no records for mismatches" or the merged one make sense.

I think our options are:

1) Do not allow events with conflicting HW filters to be scheduled (e.g.
   treating this like a counter constraint).

2) Allow events with conflicting HW filters to be scheduled, merge the
   active HW filters, and SW filter the records in the kernel.

3) Allow events with conflicting branch filters to be scheduled, but
   only those which match the "active" filter get records.

So far (2) seems to me like the best option, and IIUC that's what x86
does with LBR. I suspect that also justifies discarding records at
context switch time, since we can't know how many relevant records were
discarded due to conflicting records (and so cannot safely stitch
records), and users have to expect that they may get fewer records than
may exist in HW anyway.

Mark.
Mark Rutland May 31, 2024, 1:01 p.m. UTC | #6
On Thu, May 30, 2024 at 06:41:14PM +0100, Mark Rutland wrote:
> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
> > On 05/04/2024 03:46, Anshuman Khandual wrote:
> > > ------------------ Possible 'branch_sample_type' Mismatch -----------------
> > > 
> > > Branch stack sampling attributes 'event->attr.branch_sample_type' generally
> > > remain the same for all the events during a perf record session.
> > > 
> > > $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
> > > 
> > > event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
> > > 
> > > This 'branch_sample_type' is used to configure the BRBE hardware, when both
> > > events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
> > > PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
> > > remain the same for all events. Let's consider the following example
> > > 
> > > $perf record -e cycles:u -e instructions:k -j any,save_type ls
> > > 
> > > cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
> > > 
> > > Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
> > > inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
> > > BRBE hardware with 'branch_sample_type' from last event to be added in the
> > > PMU and hence captured branch records only get passed on to matching events
> > > during a PMU interrupt.
> > > 
> > 
> > Hi Anshuman,
> > 
> > Surely because of this example we should merge? At least we have to try
> > to make the most common basic command lines work. Unless we expect all
> > tools to know whether the branch buffer is shared between PMUs on each
> > architecture or not. The driver knows though, so can merge the settings
> > because it all has to go into one BRBE.
> 
> The difficulty here is that these are opened as independent events (not
> in the same event group), and so from the driver's PoV, this is no
> different two two users independently doing:
> 
> 	perf record -e event:u -j any,save_type -p ${SOME_PID}
> 
> 	perf record -e event:k -j any,save_type -p ${SOME_PID}
> 
> .. where either would be surprised to get the merged result.

I took a look at how x86 handles this, and it looks like they may have the
problem we'd like to avoid. AFAICT, intel_pmu_lbr_add() blats cpuc->br_sel with
the branch selection of the last event added, and 

So I took a look at what happens on my x86-64 desktop running v5.10.0-9-amd64
from Debian 11.

Running the following program:

| int main (int argc, char *argv[])
| {
|         for (;;) {
|                 asm volatile("" ::: "memory");
|         }
| 
|         return 0;
| }	

I set /proc/sys/kernel/perf_event_paranoid to 2 and started two independent
perf sessions:

	perf record -e cycles:u -j any -o perf-user.data -p 1320224

	sudo perf record -e cycles:k -j any -o perf-kernel.data -p 1320224

... after ~10 seconds, I killed both sessions with ^C.

When i susbsequently do 'perf report -i perf-kernel.data, I see:

| Samples: 295  of event 'cycles:k', Event count (approx.): 295
| Overhead  Command  Source Shared Object  Source Symbol               Target Symbol  Basic Block Cycles
|   99.66%  loop     loop                  [k] main                    [k] main       -
|    0.34%  loop     [kernel.kallsyms]     [k] native_irq_return_iret  [k] main       -

... where the user symbols are surprising.

Similarly for 'perf report -i perf-user.data', I see:

| Samples: 198K of event 'cycles:u', Event count (approx.): 198739
| Overhead  Command  Source Shared Object  Source Symbol           Target Symbol           Basic Block Cycles
|   99.99%  loop     loop                  [.] main                [.] main                -
|    0.00%  loop     [unknown]             [.] 0xffffffff87801007  [.] main                -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e05626  [.] 0xffffffff86e05629  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e0563d  [.] 0xffffffff86e0c850  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e0c86f  [.] 0xffffffff86e6b3f0  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e0c884  [.] 0xffffffff86e11ed0  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e0c88a  [.] 0xffffffff86e13850  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e11eee  [.] 0xffffffff86e0c889  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e13885  [.] 0xffffffff86e13888  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e13889  [.] 0xffffffff86e138a1  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e138a9  [.] 0xffffffff86e6b320  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e138c3  [.] 0xffffffff86e6b3f0  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e6b33a  [.] 0xffffffff86e138ae  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86e6b3fb  [.] 0xffffffff86e0c874  -
|    0.00%  loop     [unknown]             [.] 0xffffffff86ff6c91  [.] 0xffffffff87a01ca0  -
|    0.00%  loop     [unknown]             [.] 0xffffffff87a01ca0  [.] 0xffffffff87a01ca5  -
|    0.00%  loop     [unknown]             [.] 0xffffffff87a01ca5  [.] 0xffffffff87a01cb1  -
|    0.00%  loop     [unknown]             [.] 0xffffffff87a01cb5  [.] 0xffffffff86e05600  -

Where the unknown (kernel!) samples are surprising.

Peter, do you have any opinion on this?

My thinking is that the "last scheduled event branch selection wins"
isn't the behaviour we actually want, and either:

(a) Conflicting events shouldn't be scheduled concurrently (e.g. treat
    that like running out of counters).

(b) The HW filters should be configured to allow anything permited by
    any of the events, and SW filtering should remove the unexpected
    records on a per-event basis.

... but I imagine (b) is hard maybe? I don't know if LBR tells you which
CPU mode the src/dst were in.

Mark.
Anshuman Khandual June 3, 2024, 9:18 a.m. UTC | #7
On 5/30/24 15:33, James Clark wrote:
> 
> 
> On 05/04/2024 03:46, Anshuman Khandual wrote:
>> This series enables perf branch stack sampling support on arm64 platform
>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>> the relevant register definitions could be accessed here.
>>
>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>
>> This series applies on 6.9-rc2.
>>
>> Also this series is being hosted below for quick access, review and test.
>>
>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>
>> There are still some open questions regarding handling multiple perf events
>> with different privilege branch filters getting on the same PMU, supporting
>> guest branch stack tracing from the host etc. Finally also looking for some
>> suggestions regarding supporting BRBE inside the guest. The series has been
>> re-organized completely as suggested earlier.
> 
> For guest support I'm still of this opinion:
> 
>   * No support for the host looking into guests (the addresses don't
>     make sense anyway without also running Perf record in the guest)
>   * Save and restore the host buffer and registers on guest switch (if
>     it was ever used by either host or guest)
>   * Let the guest do whatever it wants with BRBE without any
>     virtualisation
> 
> Merging this with the current PMU virtualistion stuff seems like a lot
> of work for no use case (host looking into guests). Having said that, it
> might not even be worth discussing on this patchset apart from "no guest
> support", and we can do it later to avoid confusion that it's being
> proposed for this version.

Agreed, let's just have "no guest support" for now in this proposed series
without any more additional changes to keep things simpler and separated.
I will also update the cover letter next time around making this clear.
Mark Rutland June 3, 2024, 9:39 a.m. UTC | #8
On Mon, Jun 03, 2024 at 02:48:58PM +0530, Anshuman Khandual wrote:
> 
> 
> On 5/30/24 15:33, James Clark wrote:
> > 
> > 
> > On 05/04/2024 03:46, Anshuman Khandual wrote:
> >> This series enables perf branch stack sampling support on arm64 platform
> >> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> >> the relevant register definitions could be accessed here.
> >>
> >> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> >>
> >> This series applies on 6.9-rc2.
> >>
> >> Also this series is being hosted below for quick access, review and test.
> >>
> >> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
> >>
> >> There are still some open questions regarding handling multiple perf events
> >> with different privilege branch filters getting on the same PMU, supporting
> >> guest branch stack tracing from the host etc. Finally also looking for some
> >> suggestions regarding supporting BRBE inside the guest. The series has been
> >> re-organized completely as suggested earlier.
> > 
> > For guest support I'm still of this opinion:
> > 
> >   * No support for the host looking into guests (the addresses don't
> >     make sense anyway without also running Perf record in the guest)
> >   * Save and restore the host buffer and registers on guest switch (if
> >     it was ever used by either host or guest)
> >   * Let the guest do whatever it wants with BRBE without any
> >     virtualisation
> > 
> > Merging this with the current PMU virtualistion stuff seems like a lot
> > of work for no use case (host looking into guests). Having said that, it
> > might not even be worth discussing on this patchset apart from "no guest
> > support", and we can do it later to avoid confusion that it's being
> > proposed for this version.
> 
> Agreed, let's just have "no guest support" for now in this proposed series
> without any more additional changes to keep things simpler and separated.
> I will also update the cover letter next time around making this clear.

FWIW, that sounds good to me.

Mark.
Anshuman Khandual June 6, 2024, 3:57 a.m. UTC | #9
On 5/30/24 15:17, James Clark wrote:
> 
> 
> On 05/04/2024 03:46, Anshuman Khandual wrote:
>> This series enables perf branch stack sampling support on arm64 platform
>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>> the relevant register definitions could be accessed here.
>>
>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>
>> This series applies on 6.9-rc2.
>>
>> Also this series is being hosted below for quick access, review and test.
>>
>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>
>> There are still some open questions regarding handling multiple perf events
>> with different privilege branch filters getting on the same PMU, supporting
>> guest branch stack tracing from the host etc. Finally also looking for some
>> suggestions regarding supporting BRBE inside the guest. The series has been
>> re-organized completely as suggested earlier.
>>
>> - Anshuman
>>
> [...]
>>
>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>
>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>> remain the same for all the events during a perf record session.
>>
>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>
>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>
>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>> remain the same for all events. Let's consider the following example
>>
>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>
>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>
>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>> PMU and hence captured branch records only get passed on to matching events
>> during a PMU interrupt.
>>
> 
> Hi Anshuman,
> 
> Surely because of this example we should merge? At least we have to try
> to make the most common basic command lines work. Unless we expect all
> tools to know whether the branch buffer is shared between PMUs on each
> architecture or not. The driver knows though, so can merge the settings
> because it all has to go into one BRBE.
> 
> Merging the settings in tools would be a much harder problem.

Alright, makes sense.

> 
> Any user that doesn't have permission for anything in the merged result
> can continue to get nothing.
> 
> And we can always add filtering in the kernel later on if we want to to
> make it appear to behave even more normally.

Understood.

> 
>> static int
>> armpmu_add(struct perf_event *event, int flags)
>> {
>> 	........
>> 	if (has_branch_stack(event)) {
>> 		/*
>> 		 * Reset branch records buffer if a new task event gets
>> 		 * scheduled on a PMU which might have existing records.
>> 		 * Otherwise older branch records present in the buffer
>> 		 * might leak into the new task event.
>> 		 */
>> 		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>> 			hw_events->brbe_context = event->ctx;
>> 			if (armpmu->branch_reset)
>> 				armpmu->branch_reset();
>> 		}
>> 		hw_events->brbe_users++;
>> Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
>> 	}
>> 	........
>> }
>>
>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>
> 
> I can't see any use case where anyone would want the override behavior.
> Especially if you imagine multiple users not even aware of each other.
> Either the current "no records for mismatches" or the merged one make sense.

Hence I had enlisted all the three available options.
Anshuman Khandual June 6, 2024, 4:58 a.m. UTC | #10
On 5/30/24 23:11, Mark Rutland wrote:
> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
>> On 05/04/2024 03:46, Anshuman Khandual wrote:
>>> This series enables perf branch stack sampling support on arm64 platform
>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>>> the relevant register definitions could be accessed here.
>>>
>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>>
>>> This series applies on 6.9-rc2.
>>>
>>> Also this series is being hosted below for quick access, review and test.
>>>
>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>>
>>> There are still some open questions regarding handling multiple perf events
>>> with different privilege branch filters getting on the same PMU, supporting
>>> guest branch stack tracing from the host etc. Finally also looking for some
>>> suggestions regarding supporting BRBE inside the guest. The series has been
>>> re-organized completely as suggested earlier.
>>>
>>> - Anshuman
>>>
>> [...]
>>>
>>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>>
>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>>> remain the same for all the events during a perf record session.
>>>
>>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>>
>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>>
>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>>> remain the same for all events. Let's consider the following example
>>>
>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>>
>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>>
>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>>> PMU and hence captured branch records only get passed on to matching events
>>> during a PMU interrupt.
>>>
>>
>> Hi Anshuman,
>>
>> Surely because of this example we should merge? At least we have to try
>> to make the most common basic command lines work. Unless we expect all
>> tools to know whether the branch buffer is shared between PMUs on each
>> architecture or not. The driver knows though, so can merge the settings
>> because it all has to go into one BRBE.
> 
> The difficulty here is that these are opened as independent events (not
> in the same event group), and so from the driver's PoV, this is no
> different two two users independently doing:
> 
> 	perf record -e event:u -j any,save_type -p ${SOME_PID}
> 
> 	perf record -e event:k -j any,save_type -p ${SOME_PID}
> 
> ... where either would be surprised to get the merged result.

Right, that's the problem. The proposed idea here ensures that each event
here will get only expected branch records, even though sample size might
get reduced as the HW filters overrides might not be evenly split between
them during the perf session.

> 
> So, if we merge the filters into the most permissive set, we *must*
> filter them when handing them to userspace so that each event gets the
> expected set of branch records.

Agreed, if the branch filters get merged to the most permissive set via
an OR semantics, then results must be filtered before being pushed into
the ring buffer for each individual event that has overflown during the
PMU IRQ.

> 
> Assuming we do that, for Anshuman's case above:
> 
> 	perf record -e cycles:u -e instructions:k -j any,save_type ls	
> 
> ... the overflow of the cycles evnt will only record user branch
> records, and the overflow of the instructions event will only record
> kernel branch records.

Right.

> 
> I think it's arguable either way as to whether users want that or merged
> records; we should probably figure out with the tools folk what the
> desired behaviour is for that command line, but as above I don't think
> that we can have the kernel give both events merged records unless
> userspace asks for that explicitly.

Right, we should not give merged results unless explicitly asked by the
event. Otherwise that might break the semantics.

> 
>> Merging the settings in tools would be a much harder problem.
> 
> I can see that it'd be harder to do that up-front when parsing each
> event independently, but there is a phase where the tool knows about all
> of the events and *might* be able to do that, where the driver doesn't
> really know at any point that these events are related.
> 
> Regardless, I assume that if the user passes:
> 
> 	perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls
> 
> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and
> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
> is sufficient.
Kernel filtering will not be required in this case as "-j any,u,k," overrides
both event's individual privilege request i.e cycles:u and instructions:k. So
both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER
and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter
is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL).

> 
>> Any user that doesn't have permission for anything in the merged result
>> can continue to get nothing.
>>
>> And we can always add filtering in the kernel later on if we want to to
>> make it appear to behave even more normally.
> 
> As above, I think if we merge the HW filters in the kernel then the
> kernel must do SW filtering. I don't think that's something we can leave
> for later.

Alright.

> 
>>> static int
>>> armpmu_add(struct perf_event *event, int flags)
>>> {
>>> 	........
>>> 	if (has_branch_stack(event)) {
>>> 		/*
>>> 		 * Reset branch records buffer if a new task event gets
>>> 		 * scheduled on a PMU which might have existing records.
>>> 		 * Otherwise older branch records present in the buffer
>>> 		 * might leak into the new task event.
>>> 		 */
>>> 		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>>> 			hw_events->brbe_context = event->ctx;
>>> 			if (armpmu->branch_reset)
>>> 				armpmu->branch_reset();
>>> 		}
>>> 		hw_events->brbe_users++;
>>> Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
>>> 	}
>>> 	........
>>> }
>>>
>>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>>
>>
>> I can't see any use case where anyone would want the override behavior.
>> Especially if you imagine multiple users not even aware of each other.
> 
> I completely agree that one event overriding another is not an
> acceptable solution.
> 
>> Either the current "no records for mismatches" or the merged one make sense.
> 
> I think our options are:
> 
> 1) Do not allow events with conflicting HW filters to be scheduled (e.g.
>    treating this like a counter constraint).

That's the easiest solution and will keep things simple but downside being
the sample size will probably get much reduced. But such scenarios will be
rare, and hence won't matter much.

> 
> 2) Allow events with conflicting HW filters to be scheduled, merge the
>    active HW filters, and SW filter the records in the kernel.
> 
> 3) Allow events with conflicting branch filters to be scheduled, but
>    only those which match the "active" filter get records.

That's the proposed solution. "Active" filter gets decided on which event
comes last thus override the previous and PMU interrupt handler delivers
branch records only to the matching events.

> 
> So far (2) seems to me like the best option, and IIUC that's what x86
> does with LBR. I suspect that also justifies discarding records at
> context switch time, since we can't know how many relevant records were
> discarded due to conflicting records (and so cannot safely stitch
> records), and users have to expect that they may get fewer records than
> may exist in HW anyway.

So if we implement merging branch filters requests and SW filtering for the
captured branch records, we should also drop saving and stitching mechanism
completely ?

Coming back to the implementation for option (2), the following code change
(tentative and untested) will merge branch filter requests and drop the event
filter check during PMU interrupt.

diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 45ac2d0ca04c..9afa4e48d957 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h
                armv8pmu_branch_stack_reset();
        }
        hw_events->branch_users++;
-       hw_events->branch_sample_type = event->attr.branch_sample_type;
+
+       /*
+        * Merge all branch filter requests from different perf
+        * events being added into this PMU. This includes both
+        * privilege and branch type filters.
+        */
+       hw_events->branch_sample_type |= event->attr.branch_sample_type;
 }
 
 void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events)
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 6137ae4ba7c3..c5311d5365cc 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
                return;
 
        /*
-        * Overflowed event's branch_sample_type does not match the configured
-        * branch filters in the BRBE HW. So the captured branch records here
-        * cannot be co-related to the overflowed event. Report to the user as
-        * if no branch records have been captured, and flush branch records.
-        * The same scenario is applicable when the current task context does
-        * not match with overflown event.
+        * When the current task context does not match with the PMU overflown
+        * event, the captured branch records here cannot be co-related to the
+        * overflowed event. Report to the user as if no branch records have
+        * been captured, and flush branch records.
         */
-       if ((cpuc->branch_sample_type != event->attr.branch_sample_type) ||
-           (event->ctx->task && cpuc->branch_context != event->ctx))
+       if (event->ctx->task && cpuc->branch_context != event->ctx)
                return;
 
        /*

and something like the following change (tentative and untested) implements the
required SW branch records filtering.

diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index c5311d5365cc..d2390657c466 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
        armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
 }
 
+static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry)
+{
+       u64 br_type = event->attr.branch_sample_type;
+       u64 mask = PERF_BR_UNCOND;
+
+       if (br_type & PERF_SAMPLE_BRANCH_ANY)
+               return true;
+
+       if (entry->type == PERF_BR_UNKNOWN)
+               return true;
+
+       if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP)
+               mask |= PERF_BR_IND;
+
+       if (br_type & PERF_SAMPLE_BRANCH_COND) {
+               mask &= ~PERF_BR_UNCOND;
+               mask |= PERF_BR_COND;
+       }
+
+       if (br_type & PERF_SAMPLE_BRANCH_CALL)
+               mask |= PERF_BR_CALL;
+
+       if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
+               mask |= PERF_BR_IND_CALL;
+
+       if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+               mask |= PERF_BR_CALL;
+               mask |= PERF_BR_IRQ;
+               mask |= PERF_BR_SYSCALL;
+               mask |= PERF_BR_SERROR;
+               if (br_type & PERF_SAMPLE_BRANCH_COND)
+                       mask |= PERF_BR_COND_CALL;
+       }
+
+       if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) {
+               mask |= PERF_BR_RET;
+               mask |= PERF_BR_ERET;
+               mask |= PERF_BR_SYSRET;
+               if (br_type & PERF_SAMPLE_BRANCH_COND)
+                       mask |= PERF_BR_COND_RET;
+       }
+       return mask & entry->type;
+}
+
+static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry)
+{
+       u64 br_type = event->attr.branch_sample_type;
+
+       if (!(br_type & PERF_SAMPLE_BRANCH_USER)) {
+               if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to))
+                       return false;
+       }
+
+       if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) {
+               if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
+                       return false;
+       }
+
+       if (!(br_type & PERF_SAMPLE_BRANCH_HV)) {
+               if (is_kernel_in_hyp_mode()) {
+                       if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
+                               return false;
+               }
+       }
+       return true;
+}
+
+static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc,
+                                 struct branch_records *event_records)
+{
+       struct perf_branch_entry *entry;
+       int idx, count;
+
+       memset(event_records, 0, sizeof(struct branch_records));
+       for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) {
+               entry = &cpuc->branches->branch_entries[idx];
+               if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry))
+                       continue;
+
+               memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry));
+               count++;
+       }
+}
+
 static void read_branch_records(struct pmu_hw_events *cpuc,
                                struct perf_event *event,
                                struct perf_sample_data *data,
                                bool *branch_captured)
 {
+       struct branch_records event_records;
+
        /*
         * CPU specific branch records buffer must have been allocated already
         * for the hardware records to be captured and processed further.
@@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
                armv8pmu_branch_read(cpuc, event);
                *branch_captured = true;
        }
+
+       /*
+        * Filter captured branch records
+        *
+        * PMU captured branch records would contain samples applicable for
+        * the agregated branch filters, for all events that got scheduled
+        * on this PMU. Hence the branch records need to be filtered first
+        * so that each individual event get samples they had requested.
+        */
+       if (cpuc->branch_sample_type != event->attr.branch_sample_type) {
+               filter_branch_records(event, cpuc, &event_records);
+               perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL);
+               return;
+       }
        perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL);
 }
Anshuman Khandual June 6, 2024, 6:27 a.m. UTC | #11
On 6/6/24 10:28, Anshuman Khandual wrote:
> On 5/30/24 23:11, Mark Rutland wrote:
>> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
>>> On 05/04/2024 03:46, Anshuman Khandual wrote:
>>>> This series enables perf branch stack sampling support on arm64 platform
>>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>>>> the relevant register definitions could be accessed here.
>>>>
>>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>>>
>>>> This series applies on 6.9-rc2.
>>>>
>>>> Also this series is being hosted below for quick access, review and test.
>>>>
>>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>>>
>>>> There are still some open questions regarding handling multiple perf events
>>>> with different privilege branch filters getting on the same PMU, supporting
>>>> guest branch stack tracing from the host etc. Finally also looking for some
>>>> suggestions regarding supporting BRBE inside the guest. The series has been
>>>> re-organized completely as suggested earlier.
>>>>
>>>> - Anshuman
>>>>
>>> [...]
>>>>
>>>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>>>
>>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>>>> remain the same for all the events during a perf record session.
>>>>
>>>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>>>
>>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>>>
>>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>>>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>>>> remain the same for all events. Let's consider the following example
>>>>
>>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>>>
>>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>>>
>>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>>>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>>>> PMU and hence captured branch records only get passed on to matching events
>>>> during a PMU interrupt.
>>>>
>>>
>>> Hi Anshuman,
>>>
>>> Surely because of this example we should merge? At least we have to try
>>> to make the most common basic command lines work. Unless we expect all
>>> tools to know whether the branch buffer is shared between PMUs on each
>>> architecture or not. The driver knows though, so can merge the settings
>>> because it all has to go into one BRBE.
>>
>> The difficulty here is that these are opened as independent events (not
>> in the same event group), and so from the driver's PoV, this is no
>> different two two users independently doing:
>>
>> 	perf record -e event:u -j any,save_type -p ${SOME_PID}
>>
>> 	perf record -e event:k -j any,save_type -p ${SOME_PID}
>>
>> ... where either would be surprised to get the merged result.
> 
> Right, that's the problem. The proposed idea here ensures that each event
> here will get only expected branch records, even though sample size might
> get reduced as the HW filters overrides might not be evenly split between
> them during the perf session.
> 
>>
>> So, if we merge the filters into the most permissive set, we *must*
>> filter them when handing them to userspace so that each event gets the
>> expected set of branch records.
> 
> Agreed, if the branch filters get merged to the most permissive set via
> an OR semantics, then results must be filtered before being pushed into
> the ring buffer for each individual event that has overflown during the
> PMU IRQ.
> 
>>
>> Assuming we do that, for Anshuman's case above:
>>
>> 	perf record -e cycles:u -e instructions:k -j any,save_type ls	
>>
>> ... the overflow of the cycles evnt will only record user branch
>> records, and the overflow of the instructions event will only record
>> kernel branch records.
> 
> Right.
> 
>>
>> I think it's arguable either way as to whether users want that or merged
>> records; we should probably figure out with the tools folk what the
>> desired behaviour is for that command line, but as above I don't think
>> that we can have the kernel give both events merged records unless
>> userspace asks for that explicitly.
> 
> Right, we should not give merged results unless explicitly asked by the
> event. Otherwise that might break the semantics.
> 
>>
>>> Merging the settings in tools would be a much harder problem.
>>
>> I can see that it'd be harder to do that up-front when parsing each
>> event independently, but there is a phase where the tool knows about all
>> of the events and *might* be able to do that, where the driver doesn't
>> really know at any point that these events are related.
>>
>> Regardless, I assume that if the user passes:
>>
>> 	perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls
>>
>> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and
>> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
>> is sufficient.
> Kernel filtering will not be required in this case as "-j any,u,k," overrides
> both event's individual privilege request i.e cycles:u and instructions:k. So
> both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER
> and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter
> is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL).
> 
>>
>>> Any user that doesn't have permission for anything in the merged result
>>> can continue to get nothing.
>>>
>>> And we can always add filtering in the kernel later on if we want to to
>>> make it appear to behave even more normally.
>>
>> As above, I think if we merge the HW filters in the kernel then the
>> kernel must do SW filtering. I don't think that's something we can leave
>> for later.
> 
> Alright.
> 
>>
>>>> static int
>>>> armpmu_add(struct perf_event *event, int flags)
>>>> {
>>>> 	........
>>>> 	if (has_branch_stack(event)) {
>>>> 		/*
>>>> 		 * Reset branch records buffer if a new task event gets
>>>> 		 * scheduled on a PMU which might have existing records.
>>>> 		 * Otherwise older branch records present in the buffer
>>>> 		 * might leak into the new task event.
>>>> 		 */
>>>> 		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>>>> 			hw_events->brbe_context = event->ctx;
>>>> 			if (armpmu->branch_reset)
>>>> 				armpmu->branch_reset();
>>>> 		}
>>>> 		hw_events->brbe_users++;
>>>> Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
>>>> 	}
>>>> 	........
>>>> }
>>>>
>>>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>>>
>>>
>>> I can't see any use case where anyone would want the override behavior.
>>> Especially if you imagine multiple users not even aware of each other.
>>
>> I completely agree that one event overriding another is not an
>> acceptable solution.
>>
>>> Either the current "no records for mismatches" or the merged one make sense.
>>
>> I think our options are:
>>
>> 1) Do not allow events with conflicting HW filters to be scheduled (e.g.
>>    treating this like a counter constraint).
> 
> That's the easiest solution and will keep things simple but downside being
> the sample size will probably get much reduced. But such scenarios will be
> rare, and hence won't matter much.
> 
>>
>> 2) Allow events with conflicting HW filters to be scheduled, merge the
>>    active HW filters, and SW filter the records in the kernel.
>>
>> 3) Allow events with conflicting branch filters to be scheduled, but
>>    only those which match the "active" filter get records.
> 
> That's the proposed solution. "Active" filter gets decided on which event
> comes last thus override the previous and PMU interrupt handler delivers
> branch records only to the matching events.
> 
>>
>> So far (2) seems to me like the best option, and IIUC that's what x86
>> does with LBR. I suspect that also justifies discarding records at
>> context switch time, since we can't know how many relevant records were
>> discarded due to conflicting records (and so cannot safely stitch
>> records), and users have to expect that they may get fewer records than
>> may exist in HW anyway.
> 
> So if we implement merging branch filters requests and SW filtering for the
> captured branch records, we should also drop saving and stitching mechanism
> completely ?
> 
> Coming back to the implementation for option (2), the following code change
> (tentative and untested) will merge branch filter requests and drop the event
> filter check during PMU interrupt.
> 
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index 45ac2d0ca04c..9afa4e48d957 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
> @@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h
>                 armv8pmu_branch_stack_reset();
>         }
>         hw_events->branch_users++;
> -       hw_events->branch_sample_type = event->attr.branch_sample_type;
> +
> +       /*
> +        * Merge all branch filter requests from different perf
> +        * events being added into this PMU. This includes both
> +        * privilege and branch type filters.
> +        */
> +       hw_events->branch_sample_type |= event->attr.branch_sample_type;
>  }
>  
>  void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events)
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 6137ae4ba7c3..c5311d5365cc 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
>                 return;
>  
>         /*
> -        * Overflowed event's branch_sample_type does not match the configured
> -        * branch filters in the BRBE HW. So the captured branch records here
> -        * cannot be co-related to the overflowed event. Report to the user as
> -        * if no branch records have been captured, and flush branch records.
> -        * The same scenario is applicable when the current task context does
> -        * not match with overflown event.
> +        * When the current task context does not match with the PMU overflown
> +        * event, the captured branch records here cannot be co-related to the
> +        * overflowed event. Report to the user as if no branch records have
> +        * been captured, and flush branch records.
>          */
> -       if ((cpuc->branch_sample_type != event->attr.branch_sample_type) ||
> -           (event->ctx->task && cpuc->branch_context != event->ctx))
> +       if (event->ctx->task && cpuc->branch_context != event->ctx)
>                 return;
>  
>         /*
> 
> and something like the following change (tentative and untested) implements the
> required SW branch records filtering.
> 
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index c5311d5365cc..d2390657c466 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
>         armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
>  }
>  
> +static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> +       u64 br_type = event->attr.branch_sample_type;
> +       u64 mask = PERF_BR_UNCOND;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY)
> +               return true;
> +
> +       if (entry->type == PERF_BR_UNKNOWN)
> +               return true;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP)
> +               mask |= PERF_BR_IND;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_COND) {
> +               mask &= ~PERF_BR_UNCOND;
> +               mask |= PERF_BR_COND;
> +       }
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_CALL)
> +               mask |= PERF_BR_CALL;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
> +               mask |= PERF_BR_IND_CALL;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
> +               mask |= PERF_BR_CALL;
> +               mask |= PERF_BR_IRQ;
> +               mask |= PERF_BR_SYSCALL;
> +               mask |= PERF_BR_SERROR;
> +               if (br_type & PERF_SAMPLE_BRANCH_COND)
> +                       mask |= PERF_BR_COND_CALL;
> +       }
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) {
> +               mask |= PERF_BR_RET;
> +               mask |= PERF_BR_ERET;
> +               mask |= PERF_BR_SYSRET;
> +               if (br_type & PERF_SAMPLE_BRANCH_COND)
> +                       mask |= PERF_BR_COND_RET;
> +       }
> +       return mask & entry->type;
> +}
> +
> +static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> +       u64 br_type = event->attr.branch_sample_type;
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_USER)) {
> +               if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to))
> +                       return false;
> +       }
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) {
> +               if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> +                       return false;
> +       }
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_HV)) {
> +               if (is_kernel_in_hyp_mode()) {
> +                       if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> +                               return false;
> +               }
> +       }
> +       return true;
> +}
> +
> +static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc,
> +                                 struct branch_records *event_records)
> +{
> +       struct perf_branch_entry *entry;
> +       int idx, count;
> +
> +       memset(event_records, 0, sizeof(struct branch_records));
> +       for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) {
> +               entry = &cpuc->branches->branch_entries[idx];
> +               if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry))
> +                       continue;
> +
> +               memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry));
> +               count++;
> +       }
> +}
> +
>  static void read_branch_records(struct pmu_hw_events *cpuc,
>                                 struct perf_event *event,
>                                 struct perf_sample_data *data,
>                                 bool *branch_captured)
>  {
> +       struct branch_records event_records;
> +
>         /*
>          * CPU specific branch records buffer must have been allocated already
>          * for the hardware records to be captured and processed further.
> @@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
>                 armv8pmu_branch_read(cpuc, event);
>                 *branch_captured = true;
>         }
> +
> +       /*
> +        * Filter captured branch records
> +        *
> +        * PMU captured branch records would contain samples applicable for
> +        * the agregated branch filters, for all events that got scheduled
> +        * on this PMU. Hence the branch records need to be filtered first
> +        * so that each individual event get samples they had requested.
> +        */
> +       if (cpuc->branch_sample_type != event->attr.branch_sample_type) {
> +               filter_branch_records(event, cpuc, &event_records);
> +               perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL);
> +               return;
> +       }
>         perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL);
>  }

But now with the above SW filtering enabled during branch record processing, there
are two possible problems

- Potential for RCU stalls, observed them some times
- IRQ handling gets delayed while processing these records repeatedly for each event
James Clark June 6, 2024, 11:01 a.m. UTC | #12
On 06/06/2024 05:58, Anshuman Khandual wrote:
> On 5/30/24 23:11, Mark Rutland wrote:
>> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
>>> On 05/04/2024 03:46, Anshuman Khandual wrote:
>>>> This series enables perf branch stack sampling support on arm64 platform
>>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>>>> the relevant register definitions could be accessed here.
>>>>
>>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>>>
>>>> This series applies on 6.9-rc2.
>>>>
>>>> Also this series is being hosted below for quick access, review and test.
>>>>
>>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>>>
>>>> There are still some open questions regarding handling multiple perf events
>>>> with different privilege branch filters getting on the same PMU, supporting
>>>> guest branch stack tracing from the host etc. Finally also looking for some
>>>> suggestions regarding supporting BRBE inside the guest. The series has been
>>>> re-organized completely as suggested earlier.
>>>>
>>>> - Anshuman
>>>>
>>> [...]
>>>>
>>>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>>>
>>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>>>> remain the same for all the events during a perf record session.
>>>>
>>>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>>>
>>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>>>
>>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>>>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>>>> remain the same for all events. Let's consider the following example
>>>>
>>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>>>
>>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>>>
>>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>>>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>>>> PMU and hence captured branch records only get passed on to matching events
>>>> during a PMU interrupt.
>>>>
>>>
>>> Hi Anshuman,
>>>
>>> Surely because of this example we should merge? At least we have to try
>>> to make the most common basic command lines work. Unless we expect all
>>> tools to know whether the branch buffer is shared between PMUs on each
>>> architecture or not. The driver knows though, so can merge the settings
>>> because it all has to go into one BRBE.
>>
>> The difficulty here is that these are opened as independent events (not
>> in the same event group), and so from the driver's PoV, this is no
>> different two two users independently doing:
>>
>> 	perf record -e event:u -j any,save_type -p ${SOME_PID}
>>
>> 	perf record -e event:k -j any,save_type -p ${SOME_PID}
>>
>> ... where either would be surprised to get the merged result.
> 
> Right, that's the problem. The proposed idea here ensures that each event
> here will get only expected branch records, even though sample size might
> get reduced as the HW filters overrides might not be evenly split between
> them during the perf session.
> 
>>
>> So, if we merge the filters into the most permissive set, we *must*
>> filter them when handing them to userspace so that each event gets the
>> expected set of branch records.
> 
> Agreed, if the branch filters get merged to the most permissive set via
> an OR semantics, then results must be filtered before being pushed into
> the ring buffer for each individual event that has overflown during the
> PMU IRQ.
> 
>>
>> Assuming we do that, for Anshuman's case above:
>>
>> 	perf record -e cycles:u -e instructions:k -j any,save_type ls	
>>
>> ... the overflow of the cycles evnt will only record user branch
>> records, and the overflow of the instructions event will only record
>> kernel branch records.
> 
> Right.
> 
>>
>> I think it's arguable either way as to whether users want that or merged
>> records; we should probably figure out with the tools folk what the
>> desired behaviour is for that command line, but as above I don't think
>> that we can have the kernel give both events merged records unless
>> userspace asks for that explicitly.
> 
> Right, we should not give merged results unless explicitly asked by the
> event. Otherwise that might break the semantics.
> 
>>
>>> Merging the settings in tools would be a much harder problem.
>>
>> I can see that it'd be harder to do that up-front when parsing each
>> event independently, but there is a phase where the tool knows about all
>> of the events and *might* be able to do that, where the driver doesn't
>> really know at any point that these events are related.
>>
>> Regardless, I assume that if the user passes:
>>
>> 	perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls
>>
>> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and
>> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
>> is sufficient.
> Kernel filtering will not be required in this case as "-j any,u,k," overrides
> both event's individual privilege request i.e cycles:u and instructions:k. So
> both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER
> and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter
> is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL).
> 
>>
>>> Any user that doesn't have permission for anything in the merged result
>>> can continue to get nothing.
>>>
>>> And we can always add filtering in the kernel later on if we want to to
>>> make it appear to behave even more normally.
>>
>> As above, I think if we merge the HW filters in the kernel then the
>> kernel must do SW filtering. I don't think that's something we can leave
>> for later.
> 
> Alright.
> 
>>
>>>> static int
>>>> armpmu_add(struct perf_event *event, int flags)
>>>> {
>>>> 	........
>>>> 	if (has_branch_stack(event)) {
>>>> 		/*
>>>> 		 * Reset branch records buffer if a new task event gets
>>>> 		 * scheduled on a PMU which might have existing records.
>>>> 		 * Otherwise older branch records present in the buffer
>>>> 		 * might leak into the new task event.
>>>> 		 */
>>>> 		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>>>> 			hw_events->brbe_context = event->ctx;
>>>> 			if (armpmu->branch_reset)
>>>> 				armpmu->branch_reset();
>>>> 		}
>>>> 		hw_events->brbe_users++;
>>>> Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
>>>> 	}
>>>> 	........
>>>> }
>>>>
>>>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>>>
>>>
>>> I can't see any use case where anyone would want the override behavior.
>>> Especially if you imagine multiple users not even aware of each other.
>>
>> I completely agree that one event overriding another is not an
>> acceptable solution.
>>
>>> Either the current "no records for mismatches" or the merged one make sense.
>>
>> I think our options are:
>>
>> 1) Do not allow events with conflicting HW filters to be scheduled (e.g.
>>    treating this like a counter constraint).
> 
> That's the easiest solution and will keep things simple but downside being
> the sample size will probably get much reduced. But such scenarios will be
> rare, and hence won't matter much.
> 
>>
>> 2) Allow events with conflicting HW filters to be scheduled, merge the
>>    active HW filters, and SW filter the records in the kernel.
>>
>> 3) Allow events with conflicting branch filters to be scheduled, but
>>    only those which match the "active" filter get records.
> 
> That's the proposed solution. "Active" filter gets decided on which event
> comes last thus override the previous and PMU interrupt handler delivers
> branch records only to the matching events.
> 
>>
>> So far (2) seems to me like the best option, and IIUC that's what x86
>> does with LBR. I suspect that also justifies discarding records at
>> context switch time, since we can't know how many relevant records were
>> discarded due to conflicting records (and so cannot safely stitch
>> records), and users have to expect that they may get fewer records than
>> may exist in HW anyway.
> 
> So if we implement merging branch filters requests and SW filtering for the
> captured branch records, we should also drop saving and stitching mechanism
> completely ?
> 
> Coming back to the implementation for option (2), the following code change
> (tentative and untested) will merge branch filter requests and drop the event
> filter check during PMU interrupt.
> 
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index 45ac2d0ca04c..9afa4e48d957 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
> @@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h
>                 armv8pmu_branch_stack_reset();
>         }
>         hw_events->branch_users++;
> -       hw_events->branch_sample_type = event->attr.branch_sample_type;
> +
> +       /*
> +        * Merge all branch filter requests from different perf
> +        * events being added into this PMU. This includes both
> +        * privilege and branch type filters.
> +        */
> +       hw_events->branch_sample_type |= event->attr.branch_sample_type;
>  }
>  
>  void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events)
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 6137ae4ba7c3..c5311d5365cc 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
>                 return;
>  
>         /*
> -        * Overflowed event's branch_sample_type does not match the configured
> -        * branch filters in the BRBE HW. So the captured branch records here
> -        * cannot be co-related to the overflowed event. Report to the user as
> -        * if no branch records have been captured, and flush branch records.
> -        * The same scenario is applicable when the current task context does
> -        * not match with overflown event.
> +        * When the current task context does not match with the PMU overflown
> +        * event, the captured branch records here cannot be co-related to the
> +        * overflowed event. Report to the user as if no branch records have
> +        * been captured, and flush branch records.
>          */
> -       if ((cpuc->branch_sample_type != event->attr.branch_sample_type) ||
> -           (event->ctx->task && cpuc->branch_context != event->ctx))
> +       if (event->ctx->task && cpuc->branch_context != event->ctx)
>                 return;
>  
>         /*
> 
> and something like the following change (tentative and untested) implements the
> required SW branch records filtering.
> 
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index c5311d5365cc..d2390657c466 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
>         armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
>  }
>  
> +static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> +       u64 br_type = event->attr.branch_sample_type;
> +       u64 mask = PERF_BR_UNCOND;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY)
> +               return true;
> +
> +       if (entry->type == PERF_BR_UNKNOWN)
> +               return true;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP)
> +               mask |= PERF_BR_IND;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_COND) {
> +               mask &= ~PERF_BR_UNCOND;
> +               mask |= PERF_BR_COND;
> +       }
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_CALL)
> +               mask |= PERF_BR_CALL;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
> +               mask |= PERF_BR_IND_CALL;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
> +               mask |= PERF_BR_CALL;
> +               mask |= PERF_BR_IRQ;
> +               mask |= PERF_BR_SYSCALL;
> +               mask |= PERF_BR_SERROR;
> +               if (br_type & PERF_SAMPLE_BRANCH_COND)
> +                       mask |= PERF_BR_COND_CALL;
> +       }
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) {
> +               mask |= PERF_BR_RET;
> +               mask |= PERF_BR_ERET;
> +               mask |= PERF_BR_SYSRET;
> +               if (br_type & PERF_SAMPLE_BRANCH_COND)
> +                       mask |= PERF_BR_COND_RET;
> +       }
> +       return mask & entry->type;
> +}

You can build this mask once for each event because the options don't
change. At the moment it's built for each record which seems excessive.

> +
> +static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> +       u64 br_type = event->attr.branch_sample_type;
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_USER)) {
> +               if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to))
> +                       return false;
> +       }
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) {
> +               if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> +                       return false;
> +       }
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_HV)) {
> +               if (is_kernel_in_hyp_mode()) {
> +                       if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> +                               return false;
> +               }
> +       }
> +       return true;
> +}
> +
> +static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc,
> +                                 struct branch_records *event_records)
> +{
> +       struct perf_branch_entry *entry;
> +       int idx, count;
> +
> +       memset(event_records, 0, sizeof(struct branch_records));
> +       for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) {
> +               entry = &cpuc->branches->branch_entries[idx];
> +               if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry))
> +                       continue;
> +
> +               memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry));
> +               count++;
> +       }
> +}
> +
>  static void read_branch_records(struct pmu_hw_events *cpuc,
>                                 struct perf_event *event,
>                                 struct perf_sample_data *data,
>                                 bool *branch_captured)
>  {
> +       struct branch_records event_records;
> +
>         /*
>          * CPU specific branch records buffer must have been allocated already
>          * for the hardware records to be captured and processed further.
> @@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
>                 armv8pmu_branch_read(cpuc, event);
>                 *branch_captured = true;
>         }
> +
> +       /*
> +        * Filter captured branch records
> +        *
> +        * PMU captured branch records would contain samples applicable for
> +        * the agregated branch filters, for all events that got scheduled
> +        * on this PMU. Hence the branch records need to be filtered first
> +        * so that each individual event get samples they had requested.
> +        */
> +       if (cpuc->branch_sample_type != event->attr.branch_sample_type) {
> +               filter_branch_records(event, cpuc, &event_records);
> +               perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL);
> +               return;
> +       }
>         perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL);
>  }
>  
>