[RFC,0/4] net: openvswitch: Add sample multicasting.

Message ID	20240307151849.394962-1-amorenoz@redhat.com (mailing list archive)
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9E3F3EA8A for <netdev@vger.kernel.org>; Thu, 7 Mar 2024 15:18:57 +0000 (UTC) From: Adrian Moreno <amorenoz@redhat.com> To: netdev@vger.kernel.org, dev@openvswitch.org Cc: Adrian Moreno <amorenoz@redhat.com>, cmi@nvidia.com, yotam.gi@gmail.com, i.maximets@ovn.org, aconole@redhat.com, echaudro@redhat.com, horms@kernel.org Subject: [RFC PATCH 0/4] net: openvswitch: Add sample multicasting. Date: Thu, 7 Mar 2024 16:18:44 +0100 Message-ID: <20240307151849.394962-1-amorenoz@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	net: openvswitch: Add sample multicasting. \| expand [RFC,0/4] net: openvswitch: Add sample multicasting. [RFC,1/4] net:openvswitch: Support multicasting userspace ... [RFC,2/4] openvswitch:trace: Add ovs_dp_monitor tracepoint. [RFC,3/4] net:openvswitch: Avoid extra copy if no listeners. [RFC,4/4] net:openvswitch: Add multicasted packets to stats

Adrian Moreno March 7, 2024, 3:18 p.m. UTC

** Background **
Currently, OVS supports several packet sampling mechanisms (sFlow,
per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
userspace action that needs to be handled by ovs-vswitchd's handler
threads only to be forwarded to some third party application that
will somehow process the sample and provide observability on the
datapath.

The fact that sampled traffic share netlink sockets and handler thread
time with upcalls, apart from being a performance bottleneck in the
sample extraction itself, can severely compromise the datapath,
yielding this solution unfit for highly loaded production systems.

Users are left with little options other than guessing what sampling
rate will be OK for their traffic pattern and system load and dealing
with the lost accuracy.

** Proposal **
In this RFC, I'd like to request feedback on an attempt to fix this
situation by adding a flag to the userspace action to indicate the
upcall should be sent to a netlink multicast group instead of unicasted
to ovs-vswitchd.

This would allow for other processes to read samples directly, freeing
the netlink sockets and handler threads to process packet upcalls.

** Notes on tc-offloading **
I am aware of the efforts being made to offload the sample action with
the help of psample.
I did consider using psample to multicast the samples. However, I
found a limitation that I'd like to discuss:
I would like to support OVN-driven per-flow (IPFIX) sampling because
it allows OVN to insert two 32-bit values (obs_domain_id and
ovs_point_id) that can be used to enrich the sample with "high level"
controller metadata (see debug_drop_domain_id NBDB option in ovn-nb(5)).

The existing fields in psample_metadata are not enough to carry this
information. Would it be possible to extend this struct to make room for
some extra "application-specific" metadata?

** Alternatives **
An alternative approach that I'm considering (apart from using psample
as explained above) is to use a brand-new action. This lead to a cleaner
separation of concerns with existing userspace action (used for slow
paths and OFP_CONTROLLER actions) and cleaner statistics.
Also, ovs-vswitchd could more easily make the layout of this
new userdata part of the public API, allowing third party sample
collectors to decode it.

I am currently exploring this alternative but wanted to send the RFC to
get some early feedback, guidance or ideas.

Adrian Moreno (4):
  net:openvswitch: Support multicasting userspace ...
  openvswitch:trace: Add ovs_dp_monitor tracepoint.
  net:openvswitch: Avoid extra copy if no listeners.
  net:openvswitch: Add multicasted packets to stats

 include/uapi/linux/openvswitch.h    |  8 +++-
 net/openvswitch/actions.c           |  5 ++
 net/openvswitch/datapath.c          | 33 ++++++++++++--
 net/openvswitch/datapath.h          |  1 +
 net/openvswitch/flow_netlink.c      |  6 ++-
 net/openvswitch/openvswitch_trace.h | 71 +++++++++++++++++++++++++++++
 net/openvswitch/vport.c             |  8 ++++
 net/openvswitch/vport.h             |  1 +
 8 files changed, 125 insertions(+), 8 deletions(-)

Ilya Maximets March 7, 2024, 4:54 p.m. UTC | #1

On 3/7/24 16:18, Adrian Moreno wrote:
> ** Background **
> Currently, OVS supports several packet sampling mechanisms (sFlow,
> per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
> userspace action that needs to be handled by ovs-vswitchd's handler
> threads only to be forwarded to some third party application that
> will somehow process the sample and provide observability on the
> datapath.
> 
> The fact that sampled traffic share netlink sockets and handler thread
> time with upcalls, apart from being a performance bottleneck in the
> sample extraction itself, can severely compromise the datapath,
> yielding this solution unfit for highly loaded production systems.
> 
> Users are left with little options other than guessing what sampling
> rate will be OK for their traffic pattern and system load and dealing
> with the lost accuracy.
> 
> ** Proposal **
> In this RFC, I'd like to request feedback on an attempt to fix this
> situation by adding a flag to the userspace action to indicate the
> upcall should be sent to a netlink multicast group instead of unicasted
> to ovs-vswitchd.
> 
> This would allow for other processes to read samples directly, freeing
> the netlink sockets and handler threads to process packet upcalls.
> 
> ** Notes on tc-offloading **
> I am aware of the efforts being made to offload the sample action with
> the help of psample.
> I did consider using psample to multicast the samples. However, I
> found a limitation that I'd like to discuss:
> I would like to support OVN-driven per-flow (IPFIX) sampling because
> it allows OVN to insert two 32-bit values (obs_domain_id and
> ovs_point_id) that can be used to enrich the sample with "high level"
> controller metadata (see debug_drop_domain_id NBDB option in ovn-nb(5)).
> 
> The existing fields in psample_metadata are not enough to carry this
> information. Would it be possible to extend this struct to make room for
> some extra "application-specific" metadata?
> 
> ** Alternatives **
> An alternative approach that I'm considering (apart from using psample
> as explained above) is to use a brand-new action. This lead to a cleaner
> separation of concerns with existing userspace action (used for slow
> paths and OFP_CONTROLLER actions) and cleaner statistics.
> Also, ovs-vswitchd could more easily make the layout of this
> new userdata part of the public API, allowing third party sample
> collectors to decode it.
> 
> I am currently exploring this alternative but wanted to send the RFC to
> get some early feedback, guidance or ideas.

Hi, Adrian.  Thanks for the patches!

Though I'm not sure if broadcasting is generally the best approach.
These messages contain opaque information that is not actually
parsable by any other entity than a process that created the action.
And I don't think the structure of these opaque fields should become
part of uAPI in neither kernel nor OVS in userspace.

The userspace() action already has a OVS_USERSPACE_ATTR_PID argument.
And it is not actually used when OVS_DP_F_DISPATCH_UPCALL_PER_CPU is
enabled.  All known users of OVS_DP_F_DISPATCH_UPCALL_PER_CPU are
setting the OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid
that kernel could generate.

So, with a minimal and pretty much backward compatible change in
output_userspace() function, we can honor OVS_USERSPACE_ATTR_PID if
it's not U32_MAX.  This way userspace process can open a separate
socket and configure sampling to redirect all packets there while
normal MISS upcalls would still arrive to per-cpu sockets.  This
should cover the performance concern.

For the case without per-cpu dispatch, the feature comes for free
if userspace application wants to use it.  However, there is no
currently supported version of OVS that doesn't use per-cpu dispatch
when available.

What do you think?

Best regards, Ilya Maximets.

Adrian Moreno March 7, 2024, 8:59 p.m. UTC | #2

On 3/7/24 17:54, Ilya Maximets wrote:
> On 3/7/24 16:18, Adrian Moreno wrote:
>> ** Background **
>> Currently, OVS supports several packet sampling mechanisms (sFlow,
>> per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
>> userspace action that needs to be handled by ovs-vswitchd's handler
>> threads only to be forwarded to some third party application that
>> will somehow process the sample and provide observability on the
>> datapath.
>>
>> The fact that sampled traffic share netlink sockets and handler thread
>> time with upcalls, apart from being a performance bottleneck in the
>> sample extraction itself, can severely compromise the datapath,
>> yielding this solution unfit for highly loaded production systems.
>>
>> Users are left with little options other than guessing what sampling
>> rate will be OK for their traffic pattern and system load and dealing
>> with the lost accuracy.
>>
>> ** Proposal **
>> In this RFC, I'd like to request feedback on an attempt to fix this
>> situation by adding a flag to the userspace action to indicate the
>> upcall should be sent to a netlink multicast group instead of unicasted
>> to ovs-vswitchd.
>>
>> This would allow for other processes to read samples directly, freeing
>> the netlink sockets and handler threads to process packet upcalls.
>>
>> ** Notes on tc-offloading **
>> I am aware of the efforts being made to offload the sample action with
>> the help of psample.
>> I did consider using psample to multicast the samples. However, I
>> found a limitation that I'd like to discuss:
>> I would like to support OVN-driven per-flow (IPFIX) sampling because
>> it allows OVN to insert two 32-bit values (obs_domain_id and
>> ovs_point_id) that can be used to enrich the sample with "high level"
>> controller metadata (see debug_drop_domain_id NBDB option in ovn-nb(5)).
>>
>> The existing fields in psample_metadata are not enough to carry this
>> information. Would it be possible to extend this struct to make room for
>> some extra "application-specific" metadata?
>>
>> ** Alternatives **
>> An alternative approach that I'm considering (apart from using psample
>> as explained above) is to use a brand-new action. This lead to a cleaner
>> separation of concerns with existing userspace action (used for slow
>> paths and OFP_CONTROLLER actions) and cleaner statistics.
>> Also, ovs-vswitchd could more easily make the layout of this
>> new userdata part of the public API, allowing third party sample
>> collectors to decode it.
>>
>> I am currently exploring this alternative but wanted to send the RFC to
>> get some early feedback, guidance or ideas.
> 
> 
> Hi, Adrian.  Thanks for the patches!
> 

Thanks for the quick feedback.
Also adding Dumitru who I missed to include in the original CC list.

> Though I'm not sure if broadcasting is generally the best approach.
> These messages contain opaque information that is not actually
> parsable by any other entity than a process that created the action.
> And I don't think the structure of these opaque fields should become
> part of uAPI in neither kernel nor OVS in userspace.
> 

I understand this can be cumbersome, specially given the opaque field is 
currently also used for some purely-internal OVS actions (e.g: CONTROLLER).

However, for features such as OVN-driven per-flow sampling, where OVN-generated 
identifiers are placed in obs_domain_id and obs_point_id, it would be _really_ 
useful if this opaque value could be somehow decoded by external programs.

Two ideas come to mind to try to alleviate the potential maintainability issues:
- As I suggested, using a new action maybe makes things easier. By splitting the 
current "user_action_cookie" in two, one for internal actions and one for 
"observability" actions, we could expose the latter in the OVS userspace API 
without having to expose the former.
- Exposing functions in OVS that decode the opaque value. Third party 
applications could link against, say, libopenvswitch.so and use it to extract 
obs_{domain,point}_ids.

What do you think?

> The userspace() action already has a OVS_USERSPACE_ATTR_PID argument.
> And it is not actually used when OVS_DP_F_DISPATCH_UPCALL_PER_CPU is
> enabled.  All known users of OVS_DP_F_DISPATCH_UPCALL_PER_CPU are
> setting the OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid
> that kernel could generate.
> 
> So, with a minimal and pretty much backward compatible change in
> output_userspace() function, we can honor OVS_USERSPACE_ATTR_PID if
> it's not U32_MAX.  This way userspace process can open a separate
> socket and configure sampling to redirect all packets there while
> normal MISS upcalls would still arrive to per-cpu sockets.  This
> should cover the performance concern.
>

Do you mean creating a new thread to process samples or using handlers?
The latter would still have performance impact and the former would likely fail 
to process all samples in a timely manner if there are many.

Besides, the current userspace tc-offloading series uses netlink broadcast with 
psample, can't we do the same for non-offloaded actions? It enable building 
external observability applications without overloading OVS.


> For the case without per-cpu dispatch, the feature comes for free
> if userspace application wants to use it.  However, there is no
> currently supported version of OVS that doesn't use per-cpu dispatch
> when available.
>  > What do you think?
>  > Best regards, Ilya Maximets.
>

Ilya Maximets March 7, 2024, 9:29 p.m. UTC | #3

On 3/7/24 21:59, Adrian Moreno wrote:
> 
> 
> On 3/7/24 17:54, Ilya Maximets wrote:
>> On 3/7/24 16:18, Adrian Moreno wrote:
>>> ** Background **
>>> Currently, OVS supports several packet sampling mechanisms (sFlow,
>>> per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
>>> userspace action that needs to be handled by ovs-vswitchd's handler
>>> threads only to be forwarded to some third party application that
>>> will somehow process the sample and provide observability on the
>>> datapath.
>>>
>>> The fact that sampled traffic share netlink sockets and handler thread
>>> time with upcalls, apart from being a performance bottleneck in the
>>> sample extraction itself, can severely compromise the datapath,
>>> yielding this solution unfit for highly loaded production systems.
>>>
>>> Users are left with little options other than guessing what sampling
>>> rate will be OK for their traffic pattern and system load and dealing
>>> with the lost accuracy.
>>>
>>> ** Proposal **
>>> In this RFC, I'd like to request feedback on an attempt to fix this
>>> situation by adding a flag to the userspace action to indicate the
>>> upcall should be sent to a netlink multicast group instead of unicasted
>>> to ovs-vswitchd.
>>>
>>> This would allow for other processes to read samples directly, freeing
>>> the netlink sockets and handler threads to process packet upcalls.
>>>
>>> ** Notes on tc-offloading **
>>> I am aware of the efforts being made to offload the sample action with
>>> the help of psample.
>>> I did consider using psample to multicast the samples. However, I
>>> found a limitation that I'd like to discuss:
>>> I would like to support OVN-driven per-flow (IPFIX) sampling because
>>> it allows OVN to insert two 32-bit values (obs_domain_id and
>>> ovs_point_id) that can be used to enrich the sample with "high level"
>>> controller metadata (see debug_drop_domain_id NBDB option in ovn-nb(5)).
>>>
>>> The existing fields in psample_metadata are not enough to carry this
>>> information. Would it be possible to extend this struct to make room for
>>> some extra "application-specific" metadata?
>>>
>>> ** Alternatives **
>>> An alternative approach that I'm considering (apart from using psample
>>> as explained above) is to use a brand-new action. This lead to a cleaner
>>> separation of concerns with existing userspace action (used for slow
>>> paths and OFP_CONTROLLER actions) and cleaner statistics.
>>> Also, ovs-vswitchd could more easily make the layout of this
>>> new userdata part of the public API, allowing third party sample
>>> collectors to decode it.
>>>
>>> I am currently exploring this alternative but wanted to send the RFC to
>>> get some early feedback, guidance or ideas.
>>
>>
>> Hi, Adrian.  Thanks for the patches!
>>
> 
> Thanks for the quick feedback.
> Also adding Dumitru who I missed to include in the original CC list.
> 
>> Though I'm not sure if broadcasting is generally the best approach.
>> These messages contain opaque information that is not actually
>> parsable by any other entity than a process that created the action.
>> And I don't think the structure of these opaque fields should become
>> part of uAPI in neither kernel nor OVS in userspace.
>>
> 
> I understand this can be cumbersome, specially given the opaque field is 
> currently also used for some purely-internal OVS actions (e.g: CONTROLLER).
> 
> However, for features such as OVN-driven per-flow sampling, where OVN-generated 
> identifiers are placed in obs_domain_id and obs_point_id, it would be _really_ 
> useful if this opaque value could be somehow decoded by external programs.
> 
> Two ideas come to mind to try to alleviate the potential maintainability issues:
> - As I suggested, using a new action maybe makes things easier. By splitting the 
> current "user_action_cookie" in two, one for internal actions and one for 
> "observability" actions, we could expose the latter in the OVS userspace API 
> without having to expose the former.
> - Exposing functions in OVS that decode the opaque value. Third party 
> applications could link against, say, libopenvswitch.so and use it to extract 
> obs_{domain,point}_ids.

Linking with OVS libraries is practically the same as just exposing
the internal structure, because once the external application is
running it either must have the same library version as the process
that installs the action, or it may not be able to parse the message.

Any form of exposing to an external application will freeze the
opaque arguments and effectively make them a form of uAPI.

The separate action with a defined uAPI solves this problem by just
creating a new uAPI, but I'm not sure why it is needed.

> 
> What do you think?
> 
>> The userspace() action already has a OVS_USERSPACE_ATTR_PID argument.
>> And it is not actually used when OVS_DP_F_DISPATCH_UPCALL_PER_CPU is
>> enabled.  All known users of OVS_DP_F_DISPATCH_UPCALL_PER_CPU are
>> setting the OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid
>> that kernel could generate.
>>
>> So, with a minimal and pretty much backward compatible change in
>> output_userspace() function, we can honor OVS_USERSPACE_ATTR_PID if
>> it's not U32_MAX.  This way userspace process can open a separate
>> socket and configure sampling to redirect all packets there while
>> normal MISS upcalls would still arrive to per-cpu sockets.  This
>> should cover the performance concern.
>>
> 
> Do you mean creating a new thread to process samples or using handlers?
> The latter would still have performance impact and the former would likely fail 
> to process all samples in a timely manner if there are many.
> 
> Besides, the current userspace tc-offloading series uses netlink broadcast with 
> psample, can't we do the same for non-offloaded actions? It enable building 
> external observability applications without overloading OVS.

Creating a separate thread solves the performance issue.  But you can
also write a separate application that would communicate its PID to the
running OVS daemon.  Let's say the same application that configures
sampling in the OVS database can also write a PID there.

The thing is that existence of external application immediately breaks
opacity of the arguments and forces us to define uAPI.  However, if
there is an explicit communication between that application and OVS
userpsace daemon, then we can establish a contract (structure of opaque
values) between these two userspace applications without defining that
contract in the kernel uAPI.  But if we're going with multicast, that
anyone can subscribe to, then we have to define that contract in the
kernel uAPI.

Also, in order for this observability to work with userspace datapath
we'll have to implement userspace-to-userspace netlink multicast (does
that even exist?).  Running the sample collection within OVS as a thread
will be much less painful.

One other thing worth mentioning is that the PID approach I suggested
is just a minor tweak of what is already supported in the kernel.  It
doesn't prohibit introduction of a new action or a multicast group in
the future.  While premature uAPI definition may end up with another
action that nobody uses.  It can be added later if end up being
actually necessary.

Best regards, Ilya Maximets.

> 
> 
>> For the case without per-cpu dispatch, the feature comes for free
>> if userspace application wants to use it.  However, there is no
>> currently supported version of OVS that doesn't use per-cpu dispatch
>> when available.
>>  > What do you think?
>>  > Best regards, Ilya Maximets.
>>
>

Ilya Maximets March 8, 2024, 2:24 p.m. UTC | #4

On 3/7/24 22:29, Ilya Maximets wrote:
> On 3/7/24 21:59, Adrian Moreno wrote:
>> 
>> 
>> On 3/7/24 17:54, Ilya Maximets wrote:
>>> On 3/7/24 16:18, Adrian Moreno wrote:
>>>> ** Background ** Currently, OVS supports several packet 
>>>> sampling mechanisms (sFlow, per-bridge IPFIX, per-flow IPFIX). 
>>>> These end up being translated into a userspace action that 
>>>> needs to be handled by ovs-vswitchd's handler threads only to 
>>>> be forwarded to some third party application that will somehow 
>>>> process the sample and provide observability on the datapath.
>>>> 
>>>> The fact that sampled traffic share netlink sockets and
>>>> handler thread time with upcalls, apart from being a
>>>> performance bottleneck in the sample extraction itself, can
>>>> severely compromise the datapath, yielding this solution unfit
>>>> for highly loaded production systems.
>>>> 
>>>> Users are left with little options other than guessing what 
>>>> sampling rate will be OK for their traffic pattern and system 
>>>> load and dealing with the lost accuracy.
>>>> 
>>>> ** Proposal ** In this RFC, I'd like to request feedback on an 
>>>> attempt to fix this situation by adding a flag to the userspace
>>>> action to indicate the upcall should be sent to a netlink
>>>> multicast group instead of unicasted to ovs-vswitchd.
>>>> 
>>>> This would allow for other processes to read samples directly, 
>>>> freeing the netlink sockets and handler threads to process 
>>>> packet upcalls.
>>>> 
>>>> ** Notes on tc-offloading ** I am aware of the efforts being 
>>>> made to offload the sample action with the help of psample. I 
>>>> did consider using psample to multicast the samples. However, I
>>>> found a limitation that I'd like to discuss: I would like to 
>>>> support OVN-driven per-flow (IPFIX) sampling because it allows 
>>>> OVN to insert two 32-bit values (obs_domain_id and 
>>>> ovs_point_id) that can be used to enrich the sample with "high 
>>>> level" controller metadata (see debug_drop_domain_id NBDB 
>>>> option in ovn-nb(5)).
>>>> 
>>>> The existing fields in psample_metadata are not enough to
>>>> carry this information. Would it be possible to extend this
>>>> struct to make room for some extra "application-specific"
>>>> metadata?
>>>> 
>>>> ** Alternatives ** An alternative approach that I'm
>>>> considering (apart from using psample as explained above) is to
>>>> use a brand-new action. This lead to a cleaner separation of
>>>> concerns with existing userspace action (used for slow paths
>>>> and OFP_CONTROLLER actions) and cleaner statistics. Also, 
>>>> ovs-vswitchd could more easily make the layout of this new 
>>>> userdata part of the public API, allowing third party sample 
>>>> collectors to decode it.
>>>> 
>>>> I am currently exploring this alternative but wanted to send 
>>>> the RFC to get some early feedback, guidance or ideas.
>>> 
>>> 
>>> Hi, Adrian.  Thanks for the patches!
>>> 
>> 
>> Thanks for the quick feedback. Also adding Dumitru who I missed to 
>> include in the original CC list.
>> 
>>> Though I'm not sure if broadcasting is generally the best 
>>> approach. These messages contain opaque information that is not 
>>> actually parsable by any other entity than a process that
>>> created the action. And I don't think the structure of these
>>> opaque fields should become part of uAPI in neither kernel nor
>>> OVS in userspace.
>>> 
>> 
>> I understand this can be cumbersome, specially given the opaque 
>> field is currently also used for some purely-internal OVS actions 
>> (e.g: CONTROLLER).
>> 
>> However, for features such as OVN-driven per-flow sampling, where 
>> OVN-generated identifiers are placed in obs_domain_id and 
>> obs_point_id, it would be _really_ useful if this opaque value 
>> could be somehow decoded by external programs.
>> 
>> Two ideas come to mind to try to alleviate the potential 
>> maintainability issues: - As I suggested, using a new action maybe 
>> makes things easier. By splitting the current "user_action_cookie" 
>> in two, one for internal actions and one for "observability" 
>> actions, we could expose the latter in the OVS userspace API 
>> without having to expose the former. - Exposing functions in OVS 
>> that decode the opaque value. Third party applications could link 
>> against, say, libopenvswitch.so and use it to extract 
>> obs_{domain,point}_ids.
> 
> Linking with OVS libraries is practically the same as just exposing 
> the internal structure, because once the external application is 
> running it either must have the same library version as the process 
> that installs the action, or it may not be able to parse the 
> message.
> 
> Any form of exposing to an external application will freeze the 
> opaque arguments and effectively make them a form of uAPI.
> 
> The separate action with a defined uAPI solves this problem by just 
> creating a new uAPI, but I'm not sure why it is needed.
> 
>> 
>> What do you think?
>> 
>>> The userspace() action already has a OVS_USERSPACE_ATTR_PID 
>>> argument. And it is not actually used when 
>>> OVS_DP_F_DISPATCH_UPCALL_PER_CPU is enabled.  All known users of 
>>> OVS_DP_F_DISPATCH_UPCALL_PER_CPU are setting the 
>>> OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid that 
>>> kernel could generate.
>>> 
>>> So, with a minimal and pretty much backward compatible change in
>>>  output_userspace() function, we can honor
>>> OVS_USERSPACE_ATTR_PID if it's not U32_MAX.  This way userspace
>>> process can open a separate socket and configure sampling to
>>> redirect all packets there while normal MISS upcalls would still
>>> arrive to per-cpu sockets.  This should cover the performance
>>> concern.
>>> 
>> 
>> Do you mean creating a new thread to process samples or using 
>> handlers? The latter would still have performance impact and the 
>> former would likely fail to process all samples in a timely manner 
>> if there are many.
>> 
>> Besides, the current userspace tc-offloading series uses netlink 
>> broadcast with psample, can't we do the same for non-offloaded 
>> actions? It enable building external observability applications 
>> without overloading OVS.
> 
> Creating a separate thread solves the performance issue.  But you can
> also write a separate application that would communicate its PID to
> the running OVS daemon.  Let's say the same application that 
> configures sampling in the OVS database can also write a PID there.
> 
> The thing is that existence of external application immediately 
> breaks opacity of the arguments and forces us to define uAPI. 
> However, if there is an explicit communication between that 
> application and OVS userpsace daemon, then we can establish a 
> contract (structure of opaque values) between these two userspace 
> applications without defining that contract in the kernel uAPI.  But 
> if we're going with multicast, that anyone can subscribe to, then we 
> have to define that contract in the kernel uAPI.
> 
> Also, in order for this observability to work with userspace datapath
> we'll have to implement userspace-to-userspace netlink multicast
> (does that even exist?).  Running the sample collection within OVS as
> a thread will be much less painful.
> 
> One other thing worth mentioning is that the PID approach I suggested
> is just a minor tweak of what is already supported in the kernel.  It
> doesn't prohibit introduction of a new action or a multicast group in
> the future.  While premature uAPI definition may end up with another
> action that nobody uses.  It can be added later if end up being
> actually necessary.

Thinking more about this problem, it seems to make some sense to have
a way to ask OVS for sampling that multiple observers can subscribe to.
Unicast socket will not allow such functionality.  However, I still don't
think creation of a new multicast group for that purpose is justified.
Kernel already has a generic sampling mechanism (psample) with a multicast
group created specifically for a very similar purpose.  So, instead of
re-inventing it, we can add a small modification to the OVS'es sampling
action allowing it to sample to psample instead of OVS'es own unicast
sockets.  This can be achieved by adding a new OVS_SAMPLE_ATTR that
would tell to direct packets to psample instead of executing actions.
Or adding a new OVS_USERSPACE_ATTR that would do the same thing but from
a userspace() action instead, i.e. direct packets to psample instead of
OVS'es own sockets copying OVS_PACKET_ATTR_USERDATA into the
PSAMPLE_ATTR_SAMPLE_GROUP.  Might be cleaner this way, not sure.

Form a perspective of an OVS userspace daemon, this functionality can
be clearly exposed as a separate sampling mechanism alongside IPFIX,
sFlow and NetFlow.

I see you eluded to this approach in the original cover letter above.
So, I'd vote for it instead.  Hopefully the psample API can be extended
to be more flexible and allow larger userdata to be passed in.  Maybe have
SAMPLE_SUBGROUP in addition to the existing PSAMPLE_ATTR_SAMPLE_GROUP ?
OTOH, it is not really IPFIX, it's a different interface, so it might have
different requirements.  In any case OVS may check that userdata fits
into psample arguments and reject flows that are not compatible.

> 
> Best regards, Ilya Maximets.
> 
>> 
>> 
>>> For the case without per-cpu dispatch, the feature comes for free
>>> if userspace application wants to use it.  However, there is no
>>> currently supported version of OVS that doesn't use per-cpu 
>>> dispatch when available.
>>>> What do you think? Best regards, Ilya Maximets.
>>> 
>> 
>

Adrian Moreno March 13, 2024, 8:28 a.m. UTC | #5

On 3/8/24 15:24, Ilya Maximets wrote:
> On 3/7/24 22:29, Ilya Maximets wrote:
>> On 3/7/24 21:59, Adrian Moreno wrote:
>>>
>>>
>>> On 3/7/24 17:54, Ilya Maximets wrote:
>>>> On 3/7/24 16:18, Adrian Moreno wrote:
>>>>> ** Background ** Currently, OVS supports several packet
>>>>> sampling mechanisms (sFlow, per-bridge IPFIX, per-flow IPFIX).
>>>>> These end up being translated into a userspace action that
>>>>> needs to be handled by ovs-vswitchd's handler threads only to
>>>>> be forwarded to some third party application that will somehow
>>>>> process the sample and provide observability on the datapath.
>>>>>
>>>>> The fact that sampled traffic share netlink sockets and
>>>>> handler thread time with upcalls, apart from being a
>>>>> performance bottleneck in the sample extraction itself, can
>>>>> severely compromise the datapath, yielding this solution unfit
>>>>> for highly loaded production systems.
>>>>>
>>>>> Users are left with little options other than guessing what
>>>>> sampling rate will be OK for their traffic pattern and system
>>>>> load and dealing with the lost accuracy.
>>>>>
>>>>> ** Proposal ** In this RFC, I'd like to request feedback on an
>>>>> attempt to fix this situation by adding a flag to the userspace
>>>>> action to indicate the upcall should be sent to a netlink
>>>>> multicast group instead of unicasted to ovs-vswitchd.
>>>>>
>>>>> This would allow for other processes to read samples directly,
>>>>> freeing the netlink sockets and handler threads to process
>>>>> packet upcalls.
>>>>>
>>>>> ** Notes on tc-offloading ** I am aware of the efforts being
>>>>> made to offload the sample action with the help of psample. I
>>>>> did consider using psample to multicast the samples. However, I
>>>>> found a limitation that I'd like to discuss: I would like to
>>>>> support OVN-driven per-flow (IPFIX) sampling because it allows
>>>>> OVN to insert two 32-bit values (obs_domain_id and
>>>>> ovs_point_id) that can be used to enrich the sample with "high
>>>>> level" controller metadata (see debug_drop_domain_id NBDB
>>>>> option in ovn-nb(5)).
>>>>>
>>>>> The existing fields in psample_metadata are not enough to
>>>>> carry this information. Would it be possible to extend this
>>>>> struct to make room for some extra "application-specific"
>>>>> metadata?
>>>>>
>>>>> ** Alternatives ** An alternative approach that I'm
>>>>> considering (apart from using psample as explained above) is to
>>>>> use a brand-new action. This lead to a cleaner separation of
>>>>> concerns with existing userspace action (used for slow paths
>>>>> and OFP_CONTROLLER actions) and cleaner statistics. Also,
>>>>> ovs-vswitchd could more easily make the layout of this new
>>>>> userdata part of the public API, allowing third party sample
>>>>> collectors to decode it.
>>>>>
>>>>> I am currently exploring this alternative but wanted to send
>>>>> the RFC to get some early feedback, guidance or ideas.
>>>>
>>>>
>>>> Hi, Adrian.  Thanks for the patches!
>>>>
>>>
>>> Thanks for the quick feedback. Also adding Dumitru who I missed to
>>> include in the original CC list.
>>>
>>>> Though I'm not sure if broadcasting is generally the best
>>>> approach. These messages contain opaque information that is not
>>>> actually parsable by any other entity than a process that
>>>> created the action. And I don't think the structure of these
>>>> opaque fields should become part of uAPI in neither kernel nor
>>>> OVS in userspace.
>>>>
>>>
>>> I understand this can be cumbersome, specially given the opaque
>>> field is currently also used for some purely-internal OVS actions
>>> (e.g: CONTROLLER).
>>>
>>> However, for features such as OVN-driven per-flow sampling, where
>>> OVN-generated identifiers are placed in obs_domain_id and
>>> obs_point_id, it would be _really_ useful if this opaque value
>>> could be somehow decoded by external programs.
>>>
>>> Two ideas come to mind to try to alleviate the potential
>>> maintainability issues: - As I suggested, using a new action maybe
>>> makes things easier. By splitting the current "user_action_cookie"
>>> in two, one for internal actions and one for "observability"
>>> actions, we could expose the latter in the OVS userspace API
>>> without having to expose the former. - Exposing functions in OVS
>>> that decode the opaque value. Third party applications could link
>>> against, say, libopenvswitch.so and use it to extract
>>> obs_{domain,point}_ids.
>>
>> Linking with OVS libraries is practically the same as just exposing
>> the internal structure, because once the external application is
>> running it either must have the same library version as the process
>> that installs the action, or it may not be able to parse the
>> message.
>>
>> Any form of exposing to an external application will freeze the
>> opaque arguments and effectively make them a form of uAPI.
>>
>> The separate action with a defined uAPI solves this problem by just
>> creating a new uAPI, but I'm not sure why it is needed.
>>
>>>
>>> What do you think?
>>>
>>>> The userspace() action already has a OVS_USERSPACE_ATTR_PID
>>>> argument. And it is not actually used when
>>>> OVS_DP_F_DISPATCH_UPCALL_PER_CPU is enabled.  All known users of
>>>> OVS_DP_F_DISPATCH_UPCALL_PER_CPU are setting the
>>>> OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid that
>>>> kernel could generate.
>>>>
>>>> So, with a minimal and pretty much backward compatible change in
>>>>   output_userspace() function, we can honor
>>>> OVS_USERSPACE_ATTR_PID if it's not U32_MAX.  This way userspace
>>>> process can open a separate socket and configure sampling to
>>>> redirect all packets there while normal MISS upcalls would still
>>>> arrive to per-cpu sockets.  This should cover the performance
>>>> concern.
>>>>
>>>
>>> Do you mean creating a new thread to process samples or using
>>> handlers? The latter would still have performance impact and the
>>> former would likely fail to process all samples in a timely manner
>>> if there are many.
>>>
>>> Besides, the current userspace tc-offloading series uses netlink
>>> broadcast with psample, can't we do the same for non-offloaded
>>> actions? It enable building external observability applications
>>> without overloading OVS.
>>
>> Creating a separate thread solves the performance issue.  But you can
>> also write a separate application that would communicate its PID to
>> the running OVS daemon.  Let's say the same application that
>> configures sampling in the OVS database can also write a PID there.
>>
>> The thing is that existence of external application immediately
>> breaks opacity of the arguments and forces us to define uAPI.
>> However, if there is an explicit communication between that
>> application and OVS userpsace daemon, then we can establish a
>> contract (structure of opaque values) between these two userspace
>> applications without defining that contract in the kernel uAPI.  But
>> if we're going with multicast, that anyone can subscribe to, then we
>> have to define that contract in the kernel uAPI.
>>
>> Also, in order for this observability to work with userspace datapath
>> we'll have to implement userspace-to-userspace netlink multicast
>> (does that even exist?).  Running the sample collection within OVS as
>> a thread will be much less painful.
>>
>> One other thing worth mentioning is that the PID approach I suggested
>> is just a minor tweak of what is already supported in the kernel.  It
>> doesn't prohibit introduction of a new action or a multicast group in
>> the future.  While premature uAPI definition may end up with another
>> action that nobody uses.  It can be added later if end up being
>> actually necessary.
> 
> Thinking more about this problem, it seems to make some sense to have
> a way to ask OVS for sampling that multiple observers can subscribe to.
> Unicast socket will not allow such functionality.  However, I still don't
> think creation of a new multicast group for that purpose is justified.
> Kernel already has a generic sampling mechanism (psample) with a multicast
> group created specifically for a very similar purpose.  So, instead of
> re-inventing it, we can add a small modification to the OVS'es sampling
> action allowing it to sample to psample instead of OVS'es own unicast
> sockets.  This can be achieved by adding a new OVS_SAMPLE_ATTR that
> would tell to direct packets to psample instead of executing actions.
> Or adding a new OVS_USERSPACE_ATTR that would do the same thing but from
> a userspace() action instead, i.e. direct packets to psample instead of
> OVS'es own sockets copying OVS_PACKET_ATTR_USERDATA into the
> PSAMPLE_ATTR_SAMPLE_GROUP.  Might be cleaner this way, not sure.
> 
> Form a perspective of an OVS userspace daemon, this functionality can
> be clearly exposed as a separate sampling mechanism alongside IPFIX,
> sFlow and NetFlow.
> 
> I see you eluded to this approach in the original cover letter above.
> So, I'd vote for it instead.  Hopefully the psample API can be extended
> to be more flexible and allow larger userdata to be passed in.  Maybe have
> SAMPLE_SUBGROUP in addition to the existing PSAMPLE_ATTR_SAMPLE_GROUP ?
> OTOH, it is not really IPFIX, it's a different interface, so it might have
> different requirements.  In any case OVS may check that userdata fits
> into psample arguments and reject flows that are not compatible.
> 
Thanks for your feedback Ilya.

I'll send an RFC_v2 with the proposed alternative.

>>
>> Best regards, Ilya Maximets.
>>
>>>
>>>
>>>> For the case without per-cpu dispatch, the feature comes for free
>>>> if userspace application wants to use it.  However, there is no
>>>> currently supported version of OVS that doesn't use per-cpu
>>>> dispatch when available.
>>>>> What do you think? Best regards, Ilya Maximets.
>>>>
>>>
>>
>

[RFC,0/4] net: openvswitch: Add sample multicasting.

Message

Comments