diff mbox series

[resend] netlink.7: note not reliable if NETLINK_NO_ENOBUFS

Message ID 20210304205728.34477-1-aahringo@redhat.com (mailing list archive)
State Not Applicable
Headers show
Series [resend] netlink.7: note not reliable if NETLINK_NO_ENOBUFS | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Alexander Aring March 4, 2021, 8:57 p.m. UTC
This patch adds a note to the netlink manpage that if NETLINK_NO_ENOBUFS
is set there is no additional handling to make netlink reliable. It just
disables the error notification. The used word "avoid" receiving ENOBUFS
errors can be interpreted that netlink tries to do some additional queue
handling to avoid that such scenario occurs at all, e.g. like zerocopy
which tries to avoid memory copy. However disable is not the right word
here as well that in some cases ENOBUFS can be still received. This
patch makes clear that there will no additional handling to put netlink
in a more reliable mode.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
---
resend:
 - forgot linux-man mailinglist in cc, sorry.

 man7/netlink.7 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Pablo Neira Ayuso March 5, 2021, 3:04 a.m. UTC | #1
Hi Alexander,

On Thu, Mar 04, 2021 at 03:57:28PM -0500, Alexander Aring wrote:
> This patch adds a note to the netlink manpage that if NETLINK_NO_ENOBUFS
> is set there is no additional handling to make netlink reliable. It just
> disables the error notification.

A bit more background on this toggle.

NETLINK_NO_ENOBUFS also disables netlink broadcast congestion control
which kicks in when the socket buffer gets full. The existing
congestion control algorithm keeps dropping netlink event messages
until the queue is emptied. Note that it might take a while until your
userspace process fully empties the socket queue that is congested
(and during that time _your process is losing every netlink event_).

The usual approach when your process hits ENOBUFS is to resync via
NLM_F_DUMP unicast request. However, getting back to sync with the
kernel subsystem might be expensive if the number of items that are
exposed via netlink is huge.

Note that some people select very large socket buffer queue for
netlink sockets when they notice ENOBUFS. This might however makes
things worse because, as I said, congestion control drops every
netlink message until the queue is emptied. Selecting a large socket
buffer might help to postpone the ENOBUFS error, but once your process
hits ENOBUFS, then the netlink congestion control kicks in and you
will make you lose a lot of event messages (until the queue is empty
again!).

So NETLINK_NO_ENOBUFS from userspace makes sense if:

1) You are subscribed to a netlink broadcast group (so it does _not_
   make sense for unicast netlink sockets).
2) The kernel subsystem delivers the netlink messages you are
   subscribed to from atomic context (e.g. network packet path, if
   the netlink event is triggered by network packets, your process
   might get spammed with a lot of netlink messages in little time,
   depending on your network workload).
3) Your process does not want to resync on lost netlink messages.
   Your process assumes that events might get lost but it does not
   case / it does not want to make any specific action in such case.
4) You want to disable the netlink broadcast congestion control.

To provide an example kernel subsystem, this toggle can be useful with
the connection tracking system, when monitoring for new connection
events in a soft real-time fashion.

> The used word "avoid" receiving ENOBUFS errors can be interpreted
> that netlink tries to do some additional queue handling to avoid
> that such scenario occurs at all, e.g. like zerocopy which tries to
> avoid memory copy. However disable is not the right word here as
> well that in some cases ENOBUFS can be still received. This patch
> makes clear that there will no additional handling to put netlink in
> a more reliable mode.

Right, the NETLINK_NO_ENOBUFS toggle alone itself is not making
netlink more reliable for the broadcast scenario, it just changes the
way it netlink broadcast deals with congestion: userspace process gets
no reports on lost messages and netlink congestion control is
disabled.
Alexander Aring March 5, 2021, 7:43 p.m. UTC | #2
Hi Pablo,

I appreciate your very detailed response. Thank you.

On Thu, Mar 4, 2021 at 10:04 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> Hi Alexander,
>
> On Thu, Mar 04, 2021 at 03:57:28PM -0500, Alexander Aring wrote:
> > This patch adds a note to the netlink manpage that if NETLINK_NO_ENOBUFS
> > is set there is no additional handling to make netlink reliable. It just
> > disables the error notification.
>
> A bit more background on this toggle.
>
> NETLINK_NO_ENOBUFS also disables netlink broadcast congestion control
> which kicks in when the socket buffer gets full. The existing
> congestion control algorithm keeps dropping netlink event messages
> until the queue is emptied. Note that it might take a while until your
> userspace process fully empties the socket queue that is congested
> (and during that time _your process is losing every netlink event_).
>
> The usual approach when your process hits ENOBUFS is to resync via
> NLM_F_DUMP unicast request. However, getting back to sync with the
> kernel subsystem might be expensive if the number of items that are
> exposed via netlink is huge.
>
> Note that some people select very large socket buffer queue for
> netlink sockets when they notice ENOBUFS. This might however makes
> things worse because, as I said, congestion control drops every
> netlink message until the queue is emptied. Selecting a large socket
> buffer might help to postpone the ENOBUFS error, but once your process
> hits ENOBUFS, then the netlink congestion control kicks in and you
> will make you lose a lot of event messages (until the queue is empty
> again!).
>
> So NETLINK_NO_ENOBUFS from userspace makes sense if:
>
> 1) You are subscribed to a netlink broadcast group (so it does _not_
>    make sense for unicast netlink sockets).
> 2) The kernel subsystem delivers the netlink messages you are
>    subscribed to from atomic context (e.g. network packet path, if
>    the netlink event is triggered by network packets, your process
>    might get spammed with a lot of netlink messages in little time,
>    depending on your network workload).
> 3) Your process does not want to resync on lost netlink messages.
>    Your process assumes that events might get lost but it does not
>    case / it does not want to make any specific action in such case.
> 4) You want to disable the netlink broadcast congestion control.
>
> To provide an example kernel subsystem, this toggle can be useful with
> the connection tracking system, when monitoring for new connection
> events in a soft real-time fashion.
>

Can we just copy paste your above list and the connection tracking
example into the netlink manpage? I think it's good to have a
checklist like that to see if this option fits.

> > The used word "avoid" receiving ENOBUFS errors can be interpreted
> > that netlink tries to do some additional queue handling to avoid
> > that such scenario occurs at all, e.g. like zerocopy which tries to
> > avoid memory copy. However disable is not the right word here as
> > well that in some cases ENOBUFS can be still received. This patch
> > makes clear that there will no additional handling to put netlink in
> > a more reliable mode.
>
> Right, the NETLINK_NO_ENOBUFS toggle alone itself is not making
> netlink more reliable for the broadcast scenario, it just changes the
> way it netlink broadcast deals with congestion: userspace process gets
> no reports on lost messages and netlink congestion control is
> disabled.
>

Just out of curiosity:

If I understand correctly, the connection tracking netlink interface
is an exception here because it has its own handling of dealing with
congestion ("more reliable"?) so you need to disable the "default
congestion control"?
Does connection tracking always do it's own congestion algorithm, so
it's recommended to turn NETLINK_NO_ENOBUFS on when using it?

Thanks.

- Alex
Pablo Neira Ayuso March 5, 2021, 8:36 p.m. UTC | #3
On Fri, Mar 05, 2021 at 02:43:05PM -0500, Alexander Ahring Oder Aring wrote:
> Hi Pablo,
> 
> I appreciate your very detailed response. Thank you.
> 
> On Thu, Mar 4, 2021 at 10:04 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > Hi Alexander,
> >
> > On Thu, Mar 04, 2021 at 03:57:28PM -0500, Alexander Aring wrote:
> > > This patch adds a note to the netlink manpage that if NETLINK_NO_ENOBUFS
> > > is set there is no additional handling to make netlink reliable. It just
> > > disables the error notification.
> >
> > A bit more background on this toggle.
> >
> > NETLINK_NO_ENOBUFS also disables netlink broadcast congestion control
> > which kicks in when the socket buffer gets full. The existing
> > congestion control algorithm keeps dropping netlink event messages
> > until the queue is emptied. Note that it might take a while until your
> > userspace process fully empties the socket queue that is congested
> > (and during that time _your process is losing every netlink event_).
> >
> > The usual approach when your process hits ENOBUFS is to resync via
> > NLM_F_DUMP unicast request. However, getting back to sync with the
> > kernel subsystem might be expensive if the number of items that are
> > exposed via netlink is huge.
> >
> > Note that some people select very large socket buffer queue for
> > netlink sockets when they notice ENOBUFS. This might however makes
> > things worse because, as I said, congestion control drops every
> > netlink message until the queue is emptied. Selecting a large socket
> > buffer might help to postpone the ENOBUFS error, but once your process
> > hits ENOBUFS, then the netlink congestion control kicks in and you
> > will make you lose a lot of event messages (until the queue is empty
> > again!).
> >
> > So NETLINK_NO_ENOBUFS from userspace makes sense if:
> >
> > 1) You are subscribed to a netlink broadcast group (so it does _not_
> >    make sense for unicast netlink sockets).
> > 2) The kernel subsystem delivers the netlink messages you are
> >    subscribed to from atomic context (e.g. network packet path, if
> >    the netlink event is triggered by network packets, your process
> >    might get spammed with a lot of netlink messages in little time,
> >    depending on your network workload).
> > 3) Your process does not want to resync on lost netlink messages.
> >    Your process assumes that events might get lost but it does not
> >    case / it does not want to make any specific action in such case.
> > 4) You want to disable the netlink broadcast congestion control.
> >
> > To provide an example kernel subsystem, this toggle can be useful with
> > the connection tracking system, when monitoring for new connection
> > events in a soft real-time fashion.
> >
> 
> Can we just copy paste your above list and the connection tracking
> example into the netlink manpage? I think it's good to have a
> checklist like that to see if this option fits.

You probably want to include information on how netlink congestion
control works. I don't think many people know how netlink congestion
works and that it kicks in when the userspace process hits ENOBUFS.

> > > The used word "avoid" receiving ENOBUFS errors can be interpreted
> > > that netlink tries to do some additional queue handling to avoid
> > > that such scenario occurs at all, e.g. like zerocopy which tries to
> > > avoid memory copy. However disable is not the right word here as
> > > well that in some cases ENOBUFS can be still received. This patch
> > > makes clear that there will no additional handling to put netlink in
> > > a more reliable mode.
> >
> > Right, the NETLINK_NO_ENOBUFS toggle alone itself is not making
> > netlink more reliable for the broadcast scenario, it just changes the
> > way it netlink broadcast deals with congestion: userspace process gets
> > no reports on lost messages and netlink congestion control is
> > disabled.
> >
> 
> Just out of curiosity:
> 
> If I understand correctly, the connection tracking netlink interface
> is an exception here because it has its own handling of dealing with
> congestion ("more reliable"?) so you need to disable the "default
> congestion control"?

In conntrack, you have to combine NETLINK_NO_ENOBUFS with
NETLINK_BROADCAST_ERROR, then it's the kernel turns on the "more
reliable" event delivery.

> Does connection tracking always do it's own congestion algorithm, so
> it's recommended to turn NETLINK_NO_ENOBUFS on when using it?

It depends, if the user wants to know that events are lost, then
default behaviour is good (ENOBUFS is reported to userspace). If the
user does not care about lost events and wants to disable netlink
congestion control. As I said, disabling netlink congestion control
might help you avoid a large burst of lost events when you hit ENOBUFS.
Florian Westphal March 5, 2021, 11:21 p.m. UTC | #4
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > If I understand correctly, the connection tracking netlink interface
> > is an exception here because it has its own handling of dealing with
> > congestion ("more reliable"?) so you need to disable the "default
> > congestion control"?
> 
> In conntrack, you have to combine NETLINK_NO_ENOBUFS with
> NETLINK_BROADCAST_ERROR, then it's the kernel turns on the "more
> reliable" event delivery.

The "more reliable" event delivery guarantees that the kernel will
deliver at least the DESTROY notification (connection close).

If the userspace program is stuck, kernel has to hold on the expired
entries.  Eventually conntrack stops accepting new connections because
the table is full.

So this feature can't be recommended as a best-practice for conntrack
either.
Pablo Neira Ayuso March 6, 2021, 12:10 a.m. UTC | #5
On Sat, Mar 06, 2021 at 12:21:59AM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > If I understand correctly, the connection tracking netlink interface
> > > is an exception here because it has its own handling of dealing with
> > > congestion ("more reliable"?) so you need to disable the "default
> > > congestion control"?
> > 
> > In conntrack, you have to combine NETLINK_NO_ENOBUFS with
> > NETLINK_BROADCAST_ERROR, then it's the kernel turns on the "more
> > reliable" event delivery.
> 
> The "more reliable" event delivery guarantees that the kernel will
> deliver at least the DESTROY notification (connection close).
> 
> If the userspace program is stuck, kernel has to hold on the expired
> entries.  Eventually conntrack stops accepting new connections because
> the table is full.
> 
> So this feature can't be recommended as a best-practice for conntrack
> either.

There are two use-cases for this:

- If you run conntrackd and you really want to sure you backup firewall
  does not get out of sync.

- If you run ulogd2 and you want to make sure your connection log is
  complete (no events got lost).

In both cases, this might comes at the cost of dropping packets if the
table gets full. So it's placing the pressure on the conntrack side.
With the right policy you could restrict the number of connection per
second.

I agree though that combination of NETLINK_NO_ENOBUFS and
NETLINK_BROADCAST_ERROR only makes sense for very specific use-cases.
diff mbox series

Patch

diff --git a/man7/netlink.7 b/man7/netlink.7
index c69bb62bf..2cb0d1a55 100644
--- a/man7/netlink.7
+++ b/man7/netlink.7
@@ -478,7 +478,7 @@  errors.
 .\"	Author: Pablo Neira Ayuso <pablo@netfilter.org>
 This flag can be used by unicast and broadcast listeners to avoid receiving
 .B ENOBUFS
-errors.
+errors. Note it does not turn netlink into any kind of more reliable mode.
 .TP
 .BR NETLINK_LISTEN_ALL_NSID " (since Linux 4.2)"
 .\"	commit 59324cf35aba5336b611074028777838a963d03b