[v5,2/5] perf: Add SNOOP_PEER flag to perf mem data struct

Message ID	20220408195344.32764-3-alisaidi@amazon.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Ali Saidi <alisaidi@amazon.com> To: <linux-kernel@vger.kernel.org>, <linux-perf-users@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org>, <german.gomez@arm.com>, <leo.yan@linaro.org>, <acme@kernel.org> CC: <alisaidi@amazon.com>, <benh@kernel.crashing.org>, <Nick.Forrington@arm.com>, <alexander.shishkin@linux.intel.com>, <andrew.kilroy@arm.com>, <james.clark@arm.com>, <john.garry@huawei.com>, <jolsa@kernel.org>, <kjain@linux.ibm.com>, <lihuafei1@huawei.com>, <mark.rutland@arm.com>, <mathieu.poirier@linaro.org>, <mingo@redhat.com>, <namhyung@kernel.org>, <peterz@infradead.org>, <will@kernel.org> Subject: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct Date: Fri, 8 Apr 2022 19:53:41 +0000 Message-ID: <20220408195344.32764-3-alisaidi@amazon.com> In-Reply-To: <20220408195344.32764-1-alisaidi@amazon.com> References: <20220408195344.32764-1-alisaidi@amazon.com> MIME-Version: 1.0 Precedence: Bulk Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	perf: arm-spe: Decode SPE source and use for perf c2c \| expand [v4,0/4] perf: arm-spe: Decode SPE source and use for perf c2c [v5,1/5] tools: arm64: Import cputype.h [v5,2/5] perf: Add SNOOP_PEER flag to perf mem data struct [v5,3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER [v5,4/5] perf arm-spe: Use SPE data source for neoverse cores [v5,5/5] perf mem: Support mem_lvl_num in c2c command

Ali Saidi April 8, 2022, 7:53 p.m. UTC

Add a flag to the perf mem data struct to signal that a request caused a
cache-to-cache transfer of a line from a peer of the requestor and
wasn't sourced from a lower cache level.  The line being moved from one
peer cache to another has latency and performance implications. On Arm64
Neoverse systems the data source can indicate a cache-to-cache transfer
but not if the line is dirty or clean, so instead of overloading HITM
define a new flag that indicates this type of transfer.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
---
 include/uapi/linux/perf_event.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Leo Yan April 20, 2022, 8:20 a.m. UTC | #1

On Fri, Apr 08, 2022 at 07:53:41PM +0000, Ali Saidi wrote:
> Add a flag to the perf mem data struct to signal that a request caused a
> cache-to-cache transfer of a line from a peer of the requestor and
> wasn't sourced from a lower cache level.  The line being moved from one
> peer cache to another has latency and performance implications. On Arm64
> Neoverse systems the data source can indicate a cache-to-cache transfer
> but not if the line is dirty or clean, so instead of overloading HITM
> define a new flag that indicates this type of transfer.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>

The patch looks good to me:
Reviewed-by: Leo Yan <leo.yan@linaro.org>

Sine this is a common flag, it's better if x86 or PowerPC maintainers
could take a look for this new snooping type.  Thanks!

Leo

> ---
>  include/uapi/linux/perf_event.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..c9e58c79f3e5 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1308,7 +1308,7 @@ union perf_mem_data_src {
>  #define PERF_MEM_SNOOP_SHIFT	19
>  
>  #define PERF_MEM_SNOOPX_FWD	0x01 /* forward */
> -/* 1 free */
> +#define PERF_MEM_SNOOPX_PEER	0x02 /* xfer from peer */
>  #define PERF_MEM_SNOOPX_SHIFT  38
>  
>  /* locked instruction */
> -- 
> 2.32.0
>

Liang, Kan April 20, 2022, 6:43 p.m. UTC | #2

On 4/8/2022 3:53 PM, Ali Saidi wrote:
> Add a flag to the perf mem data struct to signal that a request caused a
> cache-to-cache transfer of a line from a peer of the requestor and
> wasn't sourced from a lower cache level.

It sounds similar to the Forward state. Why can't the 
PERF_MEM_SNOOPX_FWD be reused?

Thanks,
Kan

> The line being moved from one
> peer cache to another has latency and performance implications. On Arm64
> Neoverse systems the data source can indicate a cache-to-cache transfer
> but not if the line is dirty or clean, so instead of overloading HITM
> define a new flag that indicates this type of transfer.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>   include/uapi/linux/perf_event.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..c9e58c79f3e5 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1308,7 +1308,7 @@ union perf_mem_data_src {
>   #define PERF_MEM_SNOOP_SHIFT	19
>   
>   #define PERF_MEM_SNOOPX_FWD	0x01 /* forward */
> -/* 1 free */
> +#define PERF_MEM_SNOOPX_PEER	0x02 /* xfer from peer */
>   #define PERF_MEM_SNOOPX_SHIFT  38
>   
>   /* locked instruction */

Ali Saidi April 22, 2022, 6:49 p.m. UTC | #3

On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> > Add a flag to the perf mem data struct to signal that a request caused a
> > cache-to-cache transfer of a line from a peer of the requestor and
> > wasn't sourced from a lower cache level.
> 
> It sounds similar to the Forward state. Why can't the 
> PERF_MEM_SNOOPX_FWD be reused?

Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
the semantics align between architectures.

Thanks,

Ali

Liang, Kan April 22, 2022, 9:08 p.m. UTC | #4

On 4/22/2022 2:49 PM, Ali Saidi wrote:
> On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
>> On 4/8/2022 3:53 PM, Ali Saidi wrote:
>>> Add a flag to the perf mem data struct to signal that a request caused a
>>> cache-to-cache transfer of a line from a peer of the requestor and
>>> wasn't sourced from a lower cache level.
>>
>> It sounds similar to the Forward state. Why can't the
>> PERF_MEM_SNOOPX_FWD be reused?
> 
> Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> the semantics align between architectures.
> 

+ Andi

As my understanding, the SNOOPX_FWD means the Forward state, which is a 
non-modified (clean) cache-to-cache copy.
https://en.wikipedia.org/wiki/MESIF_protocol

Thanks,
Kan

Ali Saidi April 22, 2022, 9:22 p.m. UTC | #5

On Fri, 22 Apr 2022 21:43:28, Kan Liang wrote:
> On 4/22/2022 2:49 PM, Ali Saidi wrote:
> > On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> >> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> >>> Add a flag to the perf mem data struct to signal that a request caused a
> >>> cache-to-cache transfer of a line from a peer of the requestor and
> >>> wasn't sourced from a lower cache level.
> >>
> >> It sounds similar to the Forward state. Why can't the
> >> PERF_MEM_SNOOPX_FWD be reused?
> > 
> > Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> > the semantics align between architectures.
> > 
> 
> + Andi
> 
> As my understanding, the SNOOPX_FWD means the Forward state, which is a 
> non-modified (clean) cache-to-cache copy.
> https://en.wikipedia.org/wiki/MESIF_protocol
  
In this case the semantics are different. We know the line was transferred from
another peer cache, but don't know if it was clean, dirty, or if the receiving core
now has exclusive ownership of it.

Thanks,

Ali

Leo Yan April 23, 2022, 6:38 a.m. UTC | #6

On Fri, Apr 22, 2022 at 09:22:49PM +0000, Ali Saidi wrote:
> 
> On Fri, 22 Apr 2022 21:43:28, Kan Liang wrote:
> > On 4/22/2022 2:49 PM, Ali Saidi wrote:
> > > On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> > >> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> > >>> Add a flag to the perf mem data struct to signal that a request caused a
> > >>> cache-to-cache transfer of a line from a peer of the requestor and
> > >>> wasn't sourced from a lower cache level.
> > >>
> > >> It sounds similar to the Forward state. Why can't the
> > >> PERF_MEM_SNOOPX_FWD be reused?
> > > 
> > > Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> > > the semantics align between architectures.
> > > 
> > 
> > + Andi
> > 
> > As my understanding, the SNOOPX_FWD means the Forward state, which is a 
> > non-modified (clean) cache-to-cache copy.
> > https://en.wikipedia.org/wiki/MESIF_protocol
>   
> In this case the semantics are different. We know the line was transferred from
> another peer cache, but don't know if it was clean, dirty, or if the receiving core
> now has exclusive ownership of it.

In the spec "Intel 64 and IA-32 Architectures Software Developer's Manual,
Volume 3B: System Programming Guide, Part 2", section "18.8.1.3 Off-core
Response Performance Monitoring in the Processor Core", it defines the
REMOTE_CACHE_FWD as:

"L3 Miss: local homed requests that missed the L3 cache and was serviced
by forwarded data following a cross package snoop where no modified copies
found. (Remote home requests are not counted)".

Except SNOOPX_FWD means a no modified cache snooping, it also means it's
a cache conherency from *remote* socket.  This is quite different from we
define SNOOPX_PEER, which only snoop from peer CPU or clusters.

If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
this would be easier for us to distinguish the semantics and support the
statistics for SNOOPX_FWD and SNOOPX_PEER separately.

I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.

Thanks,
Leo

Andi Kleen April 23, 2022, 12:53 p.m. UTC | #7

> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> a cache conherency from *remote* socket.  This is quite different from we
> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>
> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> this would be easier for us to distinguish the semantics and support the
> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>
> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.

Yes seems better to keep using a separate flag if they don't exactly match.

It's not that we're short on flags anyways.

-Andi

Leo Yan April 24, 2022, 11:43 a.m. UTC | #8

On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
> 
> > Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> > a cache conherency from *remote* socket.  This is quite different from we
> > define SNOOPX_PEER, which only snoop from peer CPU or clusters.
> > 
> > If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> > this would be easier for us to distinguish the semantics and support the
> > statistics for SNOOPX_FWD and SNOOPX_PEER separately.
> > 
> > I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
> 
> Yes seems better to keep using a separate flag if they don't exactly match.
> 
> It's not that we're short on flags anyways.

Thanks for confirmation.

Leo

Liang, Kan April 25, 2022, 5:01 p.m. UTC | #9

On 4/24/2022 7:43 AM, Leo Yan wrote:
> On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
>>
>>> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
>>> a cache conherency from *remote* socket.  This is quite different from we
>>> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>>>

The FWD doesn't have to be *remote*. The definition you quoted is just 
for the "L3 Miss", which is indeed a remote forward. But we still have 
cross-core FWD. See Table 19-101.

Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to 
indicate the remote FWD, not just SNOOPX_FWD.

>>> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
>>> this would be easier for us to distinguish the semantics and support the
>>> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>>>
>>> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
>>
>> Yes seems better to keep using a separate flag if they don't exactly match.
>>

Yes, I agree with Andi. If you still think the existing flag combination 
doesn't match your requirement, a new separate flag should be 
introduced. I'm not familiar with ARM. I think I will leave it to you 
and the maintainer to decide.

Thanks,
Kan

Leo Yan April 27, 2022, 4:19 p.m. UTC | #10

Hi Kan,

On Mon, Apr 25, 2022 at 01:01:40PM -0400, Liang, Kan wrote:
> 
> 
> On 4/24/2022 7:43 AM, Leo Yan wrote:
> > On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
> > > 
> > > > Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> > > > a cache conherency from *remote* socket.  This is quite different from we
> > > > define SNOOPX_PEER, which only snoop from peer CPU or clusters.
> > > > 
> 
> The FWD doesn't have to be *remote*. The definition you quoted is just for
> the "L3 Miss", which is indeed a remote forward. But we still have
> cross-core FWD. See Table 19-101.
> 
> Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to
> indicate the remote FWD, not just SNOOPX_FWD.

Thanks a lot for the info.

> > > > If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> > > > this would be easier for us to distinguish the semantics and support the
> > > > statistics for SNOOPX_FWD and SNOOPX_PEER separately.
> > > > 
> > > > I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
> > > 
> > > Yes seems better to keep using a separate flag if they don't exactly match.
> > > 
> 
> Yes, I agree with Andi. If you still think the existing flag combination
> doesn't match your requirement, a new separate flag should be introduced.
> I'm not familiar with ARM. I think I will leave it to you and the maintainer
> to decide.

It's a bit difficult for me to make decision is because now SNOOPX_FWD
is not used in the file util/mem-events.c, so I am not very sure if
SNOOPX_FWD has the consistent usage across different arches.

On the other hand, I sent a patch for 'peer' flag statistics [1], you
could review it and it only stats for L2 and L3 cache level for local
node.

The main purpose for my sending this email is if you think the FWD can
be the consistent for both arches, and even the new added display mode
is also useful for x86 arch (we can rename it as 'fwd' display mode),
then I am very glad to unify the flag.

Thanks,
Leo

[1] https://lore.kernel.org/lkml/20220427155013.1833222-5-leo.yan@linaro.org/

Liang, Kan April 27, 2022, 7:29 p.m. UTC | #11

On 4/27/2022 12:19 PM, Leo Yan wrote:
> Hi Kan,
> 
> On Mon, Apr 25, 2022 at 01:01:40PM -0400, Liang, Kan wrote:
>>
>>
>> On 4/24/2022 7:43 AM, Leo Yan wrote:
>>> On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
>>>>
>>>>> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
>>>>> a cache conherency from *remote* socket.  This is quite different from we
>>>>> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>>>>>
>>
>> The FWD doesn't have to be *remote*. The definition you quoted is just for
>> the "L3 Miss", which is indeed a remote forward. But we still have
>> cross-core FWD. See Table 19-101.
>>
>> Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to
>> indicate the remote FWD, not just SNOOPX_FWD.
> 
> Thanks a lot for the info.
> 
>>>>> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
>>>>> this would be easier for us to distinguish the semantics and support the
>>>>> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>>>>>
>>>>> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
>>>>
>>>> Yes seems better to keep using a separate flag if they don't exactly match.
>>>>
>>
>> Yes, I agree with Andi. If you still think the existing flag combination
>> doesn't match your requirement, a new separate flag should be introduced.
>> I'm not familiar with ARM. I think I will leave it to you and the maintainer
>> to decide.
> 
> It's a bit difficult for me to make decision is because now SNOOPX_FWD
> is not used in the file util/mem-events.c, so I am not very sure if
> SNOOPX_FWD has the consistent usage across different arches.

No, it's used in the file util/mem-events.c
See perf_mem__snp_scnprintf().

> 
> On the other hand, I sent a patch for 'peer' flag statistics [1], you
> could review it and it only stats for L2 and L3 cache level for local
> node.

If it's for the local node, why don't you use the hop level which is 
introduced recently by Power? The below seems a good fit.

PERF_MEM_LVLNUM_ANY_CACHE | PERF_MEM_HOPS_0?

/* hop level */
#define PERF_MEM_HOPS_0		0x01 /* remote core, same node */
#define PERF_MEM_HOPS_1		0x02 /* remote node, same socket */
#define PERF_MEM_HOPS_2		0x03 /* remote socket, same board */
#define PERF_MEM_HOPS_3		0x04 /* remote board */
/* 5-7 available */
#define PERF_MEM_HOPS_SHIFT	43

Thanks,
Kan

> 
> The main purpose for my sending this email is if you think the FWD can
> be the consistent for both arches, and even the new added display mode
> is also useful for x86 arch (we can rename it as 'fwd' display mode),
> then I am very glad to unify the flag.
> 
> Thanks,
> Leo
> 
> [1] https://lore.kernel.org/lkml/20220427155013.1833222-5-leo.yan@linaro.org/

Leo Yan April 29, 2022, 9:28 a.m. UTC | #12

On Wed, Apr 27, 2022 at 03:29:31PM -0400, Liang, Kan wrote:

[...]

> > It's a bit difficult for me to make decision is because now SNOOPX_FWD
> > is not used in the file util/mem-events.c, so I am not very sure if
> > SNOOPX_FWD has the consistent usage across different arches.
> 
> No, it's used in the file util/mem-events.c
> See perf_mem__snp_scnprintf().

Right.  Actually I mean FWD flag is not for statistics.

> > On the other hand, I sent a patch for 'peer' flag statistics [1], you
> > could review it and it only stats for L2 and L3 cache level for local
> > node.
> 
> If it's for the local node, why don't you use the hop level which is
> introduced recently by Power? The below seems a good fit.
> 
> PERF_MEM_LVLNUM_ANY_CACHE | PERF_MEM_HOPS_0?
> 
> /* hop level */
> #define PERF_MEM_HOPS_0		0x01 /* remote core, same node */
> #define PERF_MEM_HOPS_1		0x02 /* remote node, same socket */
> #define PERF_MEM_HOPS_2		0x03 /* remote socket, same board */
> #define PERF_MEM_HOPS_3		0x04 /* remote board */
> /* 5-7 available */
> #define PERF_MEM_HOPS_SHIFT	43

Thanks for reminding.  I have considered HOPS flags during the
discussion with Ali for the flags, you could see PERF_MEM_HOPS_0 is for
"remote core, same node", this could introduce confusion for Arm
Neoverse CPUs.  Another thinking is how we consume the flags in perf c2c
tool, perf c2c tool uses snoop flag to find out the costly cache
conherency operations, if we use PERF_MEM_HOPS_0 that means we need to
extend perf c2c tool to support two kinds of flags: snoop flag and hop
flag, so it would introduce complexity for perf c2c code.

If we step back to review current flags, we can see different arches
have different memory model (and implementations), it is a bit painful
when we try to unify the flags.  So at current stage, I prefer to use
PEER flag for Arm arches, essentially it's not too bad that now we
introduce one bit, and later we may consider to consolidate memory
flags in more general way.

Thanks,
Leo

[v5,2/5] perf: Add SNOOP_PEER flag to perf mem data struct

Commit Message

Comments

Patch