Message ID | 20240604135438.2613064-1-nicolas.dichtel@6wind.com (mailing list archive) |
---|---|
State | Awaiting Upstream |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [nf] netfilter: restore default behavior for nf_conntrack_events | expand |
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote: > Since the below commit, there are regressions for legacy setups: > 1/ conntracks are created while there are no listener > 2/ a listener starts and dumps all conntracks to get the current state > 3/ conntracks deleted before the listener has started are not advertised > > This is problematic in containers, where conntracks could be created early. > This sysctl is part of unsafe sysctl and could not be changed easily in > some environments. > > Let's switch back to the legacy behavior. :-( Would it be possible to resolve this for containers by setting the container default to 1 if init_net had it changed to 1 at netns creation time?
Le 05/06/2024 à 10:55, Florian Westphal a écrit : > Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote: >> Since the below commit, there are regressions for legacy setups: >> 1/ conntracks are created while there are no listener >> 2/ a listener starts and dumps all conntracks to get the current state >> 3/ conntracks deleted before the listener has started are not advertised >> >> This is problematic in containers, where conntracks could be created early. >> This sysctl is part of unsafe sysctl and could not be changed easily in >> some environments. >> >> Let's switch back to the legacy behavior. > > :-( > > Would it be possible to resolve this for containers by setting > the container default to 1 if init_net had it changed to 1 at netns > creation time? When we have access to the host, it is possible to allow the configuration of this (unsafe) sysctl for the pod. But there are cases where we don't have access to the host. https://docs.openshift.com/container-platform/4.9/nodes/containers/nodes-containers-sysctls.html#nodes-containers-sysctls-unsafe_nodes-containers-using Regards, Nicolas
On Wed, Jun 05, 2024 at 11:09:31AM +0200, Nicolas Dichtel wrote: > Le 05/06/2024 à 10:55, Florian Westphal a écrit : > > Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote: > >> Since the below commit, there are regressions for legacy setups: > >> 1/ conntracks are created while there are no listener > >> 2/ a listener starts and dumps all conntracks to get the current state > >> 3/ conntracks deleted before the listener has started are not advertised > >> > >> This is problematic in containers, where conntracks could be created early. > >> This sysctl is part of unsafe sysctl and could not be changed easily in > >> some environments. > >> > >> Let's switch back to the legacy behavior. > > > > :-( > > > > Would it be possible to resolve this for containers by setting > > the container default to 1 if init_net had it changed to 1 at netns > > creation time? > > When we have access to the host, it is possible to allow the configuration of > this (unsafe) sysctl for the pod. But there are cases where we don't have access > to the host. > > https://docs.openshift.com/container-platform/4.9/nodes/containers/nodes-containers-sysctls.html#nodes-containers-sysctls-unsafe_nodes-containers-using conntrack is enabled on-demand by the ruleset these days, such monitor process could be created _before_ the ruleset is loaded?
Le 05/06/2024 à 20:41, Pablo Neira Ayuso a écrit : > On Wed, Jun 05, 2024 at 11:09:31AM +0200, Nicolas Dichtel wrote: >> Le 05/06/2024 à 10:55, Florian Westphal a écrit : >>> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote: >>>> Since the below commit, there are regressions for legacy setups: >>>> 1/ conntracks are created while there are no listener >>>> 2/ a listener starts and dumps all conntracks to get the current state >>>> 3/ conntracks deleted before the listener has started are not advertised >>>> >>>> This is problematic in containers, where conntracks could be created early. >>>> This sysctl is part of unsafe sysctl and could not be changed easily in >>>> some environments. >>>> >>>> Let's switch back to the legacy behavior. >>> >>> :-( >>> >>> Would it be possible to resolve this for containers by setting >>> the container default to 1 if init_net had it changed to 1 at netns >>> creation time? >> >> When we have access to the host, it is possible to allow the configuration of >> this (unsafe) sysctl for the pod. But there are cases where we don't have access >> to the host. >> >> https://docs.openshift.com/container-platform/4.9/nodes/containers/nodes-containers-sysctls.html#nodes-containers-sysctls-unsafe_nodes-containers-using > > conntrack is enabled on-demand by the ruleset these days, such monitor > process could be created _before_ the ruleset is loaded? It's not so easy :) There are several modules in the system. I understand it's "sad" to keep nf_conntrack_events=1, but this change breaks the backward compatibility. A container migrated to a host with a recent kernel is broken. Usually, in the networking stack, sysctl are added to keep the legacy behavior and enable new systems to use "modern" features. There are a lot of examples :)
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote: > I understand it's "sad" to keep nf_conntrack_events=1, but this change breaks > the backward compatibility. A container migrated to a host with a recent kernel > is broken. > Usually, in the networking stack, sysctl are added to keep the legacy behavior > and enable new systems to use "modern" features. There are a lot of examples :) Weeks of work down the drain. I wonder if we can make any changes aside from bug fixes in the future.
Le 06/06/2024 à 10:53, Florian Westphal a écrit : > Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote: >> I understand it's "sad" to keep nf_conntrack_events=1, but this change breaks >> the backward compatibility. A container migrated to a host with a recent kernel >> is broken. >> Usually, in the networking stack, sysctl are added to keep the legacy behavior >> and enable new systems to use "modern" features. There are a lot of examples :) > > Weeks of work down the drain. I wonder if we can make any changes aside > from bug fixes in the future. The commit doesn't remove the optimization, it only keeps the existing behavior. Systems that require this optimization, could still turn nf_conntrack_event to 2.
Hi Nicolas, On Tue, Jun 04, 2024 at 03:54:38PM +0200, Nicolas Dichtel wrote: > Since the below commit, there are regressions for legacy setups: > 1/ conntracks are created while there are no listener > 2/ a listener starts and dumps all conntracks to get the current state > 3/ conntracks deleted before the listener has started are not advertised > > This is problematic in containers, where conntracks could be created early. > This sysctl is part of unsafe sysctl and could not be changed easily in > some environments. > > Let's switch back to the legacy behavior. Maybe it is possible to annotate destroy events in a percpu area if the conntrack extension is not available. This code used to follow such approach time ago.
Hi Pablo, Le 26/06/2024 à 13:41, Pablo Neira Ayuso a écrit : > Hi Nicolas, > > On Tue, Jun 04, 2024 at 03:54:38PM +0200, Nicolas Dichtel wrote: >> Since the below commit, there are regressions for legacy setups: >> 1/ conntracks are created while there are no listener >> 2/ a listener starts and dumps all conntracks to get the current state >> 3/ conntracks deleted before the listener has started are not advertised >> >> This is problematic in containers, where conntracks could be created early. >> This sysctl is part of unsafe sysctl and could not be changed easily in >> some environments. >> >> Let's switch back to the legacy behavior. > > Maybe it is possible to annotate destroy events in a percpu area if > the conntrack extension is not available. This code used to follow > such approach time ago. Thanks for the feedback. I was wondering if just sending the destroy event would be possible. TBH, I'm not very familiar with this part of the code, I need to dig a bit. I won't have time for this right now, any help would be appreciated.
On Wed, Jul 03, 2024 at 09:37:59AM +0200, Nicolas Dichtel wrote: > Hi Pablo, > > Le 26/06/2024 à 13:41, Pablo Neira Ayuso a écrit : > > Hi Nicolas, > > > > On Tue, Jun 04, 2024 at 03:54:38PM +0200, Nicolas Dichtel wrote: > >> Since the below commit, there are regressions for legacy setups: > >> 1/ conntracks are created while there are no listener > >> 2/ a listener starts and dumps all conntracks to get the current state > >> 3/ conntracks deleted before the listener has started are not advertised > >> > >> This is problematic in containers, where conntracks could be created early. > >> This sysctl is part of unsafe sysctl and could not be changed easily in > >> some environments. > >> > >> Let's switch back to the legacy behavior. > > > > Maybe it is possible to annotate destroy events in a percpu area if > > the conntrack extension is not available. This code used to follow > > such approach time ago. > > Thanks for the feedback. I was wondering if just sending the destroy event would > be possible. TBH, I'm not very familiar with this part of the code, I need to > dig a bit. I won't have time for this right now, any help would be appreciated. I will take a look and get back to you.
diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst index c383a394c665..edc04f99e1aa 100644 --- a/Documentation/networking/nf_conntrack-sysctl.rst +++ b/Documentation/networking/nf_conntrack-sysctl.rst @@ -34,13 +34,15 @@ nf_conntrack_count - INTEGER (read-only) nf_conntrack_events - BOOLEAN - 0 - disabled - - 1 - enabled - - 2 - auto (default) + - 1 - enabled (default) + - 2 - auto If this option is enabled, the connection tracking code will provide userspace with connection tracking events via ctnetlink. - The default allocates the extension if a userspace program is - listening to ctnetlink events. + The 'auto' allocates the extension if a userspace program is + listening to ctnetlink events. Note that conntracks created + before the first listener has started won't trigger any netlink + event. nf_conntrack_expect_max - INTEGER Maximum size of expectation table. Default value is diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c index 69948e1d6974..4c8559529e18 100644 --- a/net/netfilter/nf_conntrack_ecache.c +++ b/net/netfilter/nf_conntrack_ecache.c @@ -334,7 +334,7 @@ bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp } EXPORT_SYMBOL_GPL(nf_ct_ecache_ext_add); -#define NF_CT_EVENTS_DEFAULT 2 +#define NF_CT_EVENTS_DEFAULT 1 static int nf_ct_events __read_mostly = NF_CT_EVENTS_DEFAULT; void nf_conntrack_ecache_pernet_init(struct net *net)
Since the below commit, there are regressions for legacy setups: 1/ conntracks are created while there are no listener 2/ a listener starts and dumps all conntracks to get the current state 3/ conntracks deleted before the listener has started are not advertised This is problematic in containers, where conntracks could be created early. This sysctl is part of unsafe sysctl and could not be changed easily in some environments. Let's switch back to the legacy behavior. CC: stable@vger.kernel.org Fixes: 90d1daa45849 ("netfilter: conntrack: add nf_conntrack_events autodetect mode") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- Documentation/networking/nf_conntrack-sysctl.rst | 10 ++++++---- net/netfilter/nf_conntrack_ecache.c | 2 +- 2 files changed, 7 insertions(+), 5 deletions(-)