mbox series

[RFC,net-next,v2,0/2] Mitigate the Issue of Expired Routes in Linux IPv6 Routing Tables

Message ID 20230517183337.190591-1-kuifeng@meta.com (mailing list archive)
Headers show
Series Mitigate the Issue of Expired Routes in Linux IPv6 Routing Tables | expand

Message

Kui-Feng Lee May 17, 2023, 6:33 p.m. UTC
This RFC is resent to ensure maintainers getting awared.  Also remove
some forward declarations that we don't use anymore.

The size of a Linux IPv6 routing table can become a big problem if not
managed appropriately.  Now, Linux has a garbage collector to remove
expired routes periodically.  However, this may lead to a situation in
which the routing path is blocked for a long period due to an
excessive number of routes.

For example, years ago, there is a commit about "ICMPv6 Packet too big
messages".  The root cause is that malicious ICMPv6 packets were sent
back for every small packet sent to them. These packets have to
lookup/insert a new route, putting hosts under high stress due to
contention on a spinlock while one is stuck in fib6_run_gc().

Why Route Expires
=================

Users can add IPv6 routes with an expiration time manually. However,
the Neighbor Discovery protocol may also generate routes that can
expire.  For example, Router Advertisement (RA) messages may create a
default route with an expiration time. [RFC 4861] For IPv4, it is not
possible to set an expiration time for a route, and there is no RA, so
there is no need to worry about such issues.

Create Routes with Expires
==========================

You can create routes with expires with the  command.

For example,

    ip -6 route add 2001:b000:591::3 via fe80::5054:ff:fe12:3457 \ 
        dev enp0s3 expires 30

The route that has been generated will be deleted automatically in 30
seconds.

GC of FIB6
==========

The function called fib6_run_gc() is responsible for performing
garbage collection (GC) for the Linux IPv6 stack. It checks for the
expiration of every route by traversing the tries of routing
tables. The time taken to traverse a routing table increases with its
size. Holding the routing table lock during traversal is particularly
undesirable. Therefore, it is preferable to keep the lock for the
shortest possible duration.

Solution
========

The cause of the issue is keeping the routing table locked during the
traversal of large tries. To address this, the patchset eliminates
garbage collection that does the tries traversal and introduces
individual timers for each route that eventually expires.  Walking
trials are no longer necessary with the timers. Additionally, the time
required to handle a timer is consistent.

If the expiration time is long, the timer becomes less precise. The
drawback is that the longer the expiration time, the less accurate the
timer.

Kui-Feng Lee (2):
  net: Remove expired routes with separated timers.
  net: Remove unused code and variables.

 include/net/ip6_fib.h    |  21 ++---
 include/net/ip6_route.h  |   2 -
 include/net/netns/ipv6.h |   6 --
 net/ipv6/addrconf.c      |   8 +-
 net/ipv6/ip6_fib.c       | 162 ++++++++++++++++++---------------------
 net/ipv6/ndisc.c         |   4 +-
 net/ipv6/route.c         | 120 ++---------------------------
 7 files changed, 95 insertions(+), 228 deletions(-)

Comments

David Ahern May 18, 2023, 3:21 a.m. UTC | #1
On 5/17/23 12:33 PM, Kui-Feng Lee wrote:
> This RFC is resent to ensure maintainers getting awared.  Also remove
> some forward declarations that we don't use anymore.
> 
> The size of a Linux IPv6 routing table can become a big problem if not
> managed appropriately.  Now, Linux has a garbage collector to remove
> expired routes periodically.  However, this may lead to a situation in
> which the routing path is blocked for a long period due to an
> excessive number of routes.

I take it this problem was seen internally to your org? Can you give
representative numbers on how many routes, stats on the blocked time,
and reason for the large time block (I am guessing the notifier)?

> 
> For example, years ago, there is a commit about "ICMPv6 Packet too big
> messages".  The root cause is that malicious ICMPv6 packets were sent
> back for every small packet sent to them. These packets have to
> lookup/insert a new route, putting hosts under high stress due to
> contention on a spinlock while one is stuck in fib6_run_gc().
> 
> Why Route Expires
> =================
> 
> Users can add IPv6 routes with an expiration time manually. However,
> the Neighbor Discovery protocol may also generate routes that can
> expire.  For example, Router Advertisement (RA) messages may create a
> default route with an expiration time. [RFC 4861] For IPv4, it is not
> possible to set an expiration time for a route, and there is no RA, so
> there is no need to worry about such issues.
> 
> Create Routes with Expires
> ==========================
> 
> You can create routes with expires with the  command.
> 
> For example,
> 
>     ip -6 route add 2001:b000:591::3 via fe80::5054:ff:fe12:3457 \ 
>         dev enp0s3 expires 30


> 
> The route that has been generated will be deleted automatically in 30
> seconds.
> 
> GC of FIB6
> ==========
> 
> The function called fib6_run_gc() is responsible for performing
> garbage collection (GC) for the Linux IPv6 stack. It checks for the
> expiration of every route by traversing the tries of routing
> tables. The time taken to traverse a routing table increases with its
> size. Holding the routing table lock during traversal is particularly
> undesirable. Therefore, it is preferable to keep the lock for the
> shortest possible duration.
> 
> Solution
> ========
> 
> The cause of the issue is keeping the routing table locked during the
> traversal of large tries. To address this, the patchset eliminates
> garbage collection that does the tries traversal and introduces
> individual timers for each route that eventually expires.  Walking
> trials are no longer necessary with the timers. Additionally, the time
> required to handle a timer is consistent.

And then for the number of routes mentioned above what does that mean
for having a timer per route? If this is 10's or 100's of 1000s of
expired routes what does that mean for the timer subsystem dealing with
that number of entries on top of other timers and the impact on the
timer softirq? ie., are you just moving the problem around.

did you consider other solutions? e.g., if it is the notifier, then
perhaps the entries can be deleted from the fib and then put into a list
for cleanup in a worker thread.

> 
> If the expiration time is long, the timer becomes less precise. The
> drawback is that the longer the expiration time, the less accurate the
> timer.
> 
> Kui-Feng Lee (2):
>   net: Remove expired routes with separated timers.
>   net: Remove unused code and variables.
> 
>  include/net/ip6_fib.h    |  21 ++---
>  include/net/ip6_route.h  |   2 -
>  include/net/netns/ipv6.h |   6 --
>  net/ipv6/addrconf.c      |   8 +-
>  net/ipv6/ip6_fib.c       | 162 ++++++++++++++++++---------------------
>  net/ipv6/ndisc.c         |   4 +-
>  net/ipv6/route.c         | 120 ++---------------------------
>  7 files changed, 95 insertions(+), 228 deletions(-)
>
Kui-Feng Lee May 18, 2023, 5:40 a.m. UTC | #2
On 5/17/23 20:21, David Ahern wrote:
> On 5/17/23 12:33 PM, Kui-Feng Lee wrote:
>> This RFC is resent to ensure maintainers getting awared.  Also remove
>> some forward declarations that we don't use anymore.
>>
>> The size of a Linux IPv6 routing table can become a big problem if not
>> managed appropriately.  Now, Linux has a garbage collector to remove
>> expired routes periodically.  However, this may lead to a situation in
>> which the routing path is blocked for a long period due to an
>> excessive number of routes.
> 
> I take it this problem was seen internally to your org? Can you give
> representative numbers on how many routes, stats on the blocked time,
> and reason for the large time block (I am guessing the notifier)?

We don't have existing incidents so far.  Consider it as
a potential issue.

> 
>>
>> For example, years ago, there is a commit about "ICMPv6 Packet too big
>> messages".  The root cause is that malicious ICMPv6 packets were sent
>> back for every small packet sent to them. These packets have to
>> lookup/insert a new route, putting hosts under high stress due to
>> contention on a spinlock while one is stuck in fib6_run_gc().
>>
>> Why Route Expires
>> =================
>>
>> Users can add IPv6 routes with an expiration time manually. However,
>> the Neighbor Discovery protocol may also generate routes that can
>> expire.  For example, Router Advertisement (RA) messages may create a
>> default route with an expiration time. [RFC 4861] For IPv4, it is not
>> possible to set an expiration time for a route, and there is no RA, so
>> there is no need to worry about such issues.
>>
>> Create Routes with Expires
>> ==========================
>>
>> You can create routes with expires with the  command.
>>
>> For example,
>>
>>      ip -6 route add 2001:b000:591::3 via fe80::5054:ff:fe12:3457 \
>>          dev enp0s3 expires 30
> 
> 
>>
>> The route that has been generated will be deleted automatically in 30
>> seconds.
>>
>> GC of FIB6
>> ==========
>>
>> The function called fib6_run_gc() is responsible for performing
>> garbage collection (GC) for the Linux IPv6 stack. It checks for the
>> expiration of every route by traversing the tries of routing
>> tables. The time taken to traverse a routing table increases with its
>> size. Holding the routing table lock during traversal is particularly
>> undesirable. Therefore, it is preferable to keep the lock for the
>> shortest possible duration.
>>
>> Solution
>> ========
>>
>> The cause of the issue is keeping the routing table locked during the
>> traversal of large tries. To address this, the patchset eliminates
>> garbage collection that does the tries traversal and introduces
>> individual timers for each route that eventually expires.  Walking
>> trials are no longer necessary with the timers. Additionally, the time
>> required to handle a timer is consistent.
> 
> And then for the number of routes mentioned above what does that mean
> for having a timer per route? If this is 10's or 100's of 1000s of
> expired routes what does that mean for the timer subsystem dealing with
> that number of entries on top of other timers and the impact on the
> timer softirq? ie., are you just moving the problem around.

Yes, each expired route has a timer.  But, not all routes have expire
times.  The timer subsystem will handle every single one. Let me
address the timer subsystem later.

> 
> did you consider other solutions? e.g., if it is the notifier, then
> perhaps the entries can be deleted from the fib and then put into a list
> for cleanup in a worker thread.

Yes, I considered to keep a separated list of routes that is expiring,
just like what neighbor tables do.  However, we need to sort them in the
order of expire times.  Other solutions can be a RB-tree or priority
queues. However, later, I went to the timers solution suggested by
Martin Lau.

If I read it correctly, the timer subsystem handles each
timer with a constant time.  It puts timers into buckets and levels.
Every level means different granularity.  For example, it has
granularity of 1ms, 8ms (level 0), 64ms, 512ms, ... up to 4 hours
(level 8) for 1000Hz.  Each level (granularity) has 64 buckets.
Every bucket represent a time slot. That means level 0 holds
timers that is expiring in 0ms~63ms in its 64 buckets, level 1 holds
timers that is expiring in 64ms~511ms, ... so on.  What the timer
subsystem does is to emit every timers in the corresponding current
buckets of every level.  In other word, it checks the current bucket
from level 0 ~ level 8, and emit timers if there is any timer
in the buckets.

In contrast, the current GC has to walk every tree even only one route
expired.  Timers is far better. It emits every timer in the
buckets associated with current time, no search needed.  The current GC
is triggered by a timer as well.  So, it should reduce the computation
of the timer softirq.

However, just like what I mentioned earlier, the drawback of timers are
its granularity can vary.  The longer expiration time means more coarse-
grained.  But, it probably is not a big issue.

> 
>>
>> If the expiration time is long, the timer becomes less precise. The
>> drawback is that the longer the expiration time, the less accurate the
>> timer.
>>
>> Kui-Feng Lee (2):
>>    net: Remove expired routes with separated timers.
>>    net: Remove unused code and variables.
>>
>>   include/net/ip6_fib.h    |  21 ++---
>>   include/net/ip6_route.h  |   2 -
>>   include/net/netns/ipv6.h |   6 --
>>   net/ipv6/addrconf.c      |   8 +-
>>   net/ipv6/ip6_fib.c       | 162 ++++++++++++++++++---------------------
>>   net/ipv6/ndisc.c         |   4 +-
>>   net/ipv6/route.c         | 120 ++---------------------------
>>   7 files changed, 95 insertions(+), 228 deletions(-)
>>
> 
>
David Ahern May 18, 2023, 3:28 p.m. UTC | #3
On 5/17/23 11:40 PM, Kui-Feng Lee wrote:
> 
> 
> On 5/17/23 20:21, David Ahern wrote:
>> On 5/17/23 12:33 PM, Kui-Feng Lee wrote:
>>> This RFC is resent to ensure maintainers getting awared.  Also remove
>>> some forward declarations that we don't use anymore.
>>>
>>> The size of a Linux IPv6 routing table can become a big problem if not
>>> managed appropriately.  Now, Linux has a garbage collector to remove
>>> expired routes periodically.  However, this may lead to a situation in
>>> which the routing path is blocked for a long period due to an
>>> excessive number of routes.
>>
>> I take it this problem was seen internally to your org? Can you give
>> representative numbers on how many routes, stats on the blocked time,
>> and reason for the large time block (I am guessing the notifier)?
> 
> We don't have existing incidents so far.  Consider it as
> a potential issue.

So no data to compare how the system was operating before and after.

...

> 
> In contrast, the current GC has to walk every tree even only one route
> expired.  

As I recall the largest overhead is systems (e.g., switchdev) handling
the notifier. The tree walk scaling problem can be addressed with a much
simpler change -- e.g., add a list_head per fib6_table for fib6_info
entries that can expire and make the list time sorted. Then the gc only
needs to walk those lists up to the expired point.
Stephen Hemminger May 18, 2023, 3:43 p.m. UTC | #4
On Wed, 17 May 2023 22:40:08 -0700
Kui-Feng Lee <sinquersw@gmail.com> wrote:

> >> Solution
> >> ========
> >>
> >> The cause of the issue is keeping the routing table locked during the
> >> traversal of large tries. To address this, the patchset eliminates
> >> garbage collection that does the tries traversal and introduces
> >> individual timers for each route that eventually expires.  Walking
> >> trials are no longer necessary with the timers. Additionally, the time
> >> required to handle a timer is consistent.  
> > 
> > And then for the number of routes mentioned above what does that mean
> > for having a timer per route? If this is 10's or 100's of 1000s of
> > expired routes what does that mean for the timer subsystem dealing with
> > that number of entries on top of other timers and the impact on the
> > timer softirq? ie., are you just moving the problem around.  
> 
> Yes, each expired route has a timer.  But, not all routes have expire
> times.  The timer subsystem will handle every single one. Let me
> address the timer subsystem later.
> 
> > 
> > did you consider other solutions? e.g., if it is the notifier, then
> > perhaps the entries can be deleted from the fib and then put into a list
> > for cleanup in a worker thread.  
> 
> Yes, I considered to keep a separated list of routes that is expiring,
> just like what neighbor tables do.  However, we need to sort them in the
> order of expire times.  Other solutions can be a RB-tree or priority
> queues. However, later, I went to the timers solution suggested by
> Martin Lau.
> 
> If I read it correctly, the timer subsystem handles each
> timer with a constant time.  It puts timers into buckets and levels.
> Every level means different granularity.  For example, it has
> granularity of 1ms, 8ms (level 0), 64ms, 512ms, ... up to 4 hours
> (level 8) for 1000Hz.  Each level (granularity) has 64 buckets.
> Every bucket represent a time slot. That means level 0 holds
> timers that is expiring in 0ms~63ms in its 64 buckets, level 1 holds
> timers that is expiring in 64ms~511ms, ... so on.  What the timer
> subsystem does is to emit every timers in the corresponding current
> buckets of every level.  In other word, it checks the current bucket
> from level 0 ~ level 8, and emit timers if there is any timer
> in the buckets.
> 
> In contrast, the current GC has to walk every tree even only one route
> expired.  Timers is far better. It emits every timer in the
> buckets associated with current time, no search needed.  The current GC
> is triggered by a timer as well.  So, it should reduce the computation
> of the timer softirq.
> 
> However, just like what I mentioned earlier, the drawback of timers are
> its granularity can vary.  The longer expiration time means more coarse-
> grained.  But, it probably is not a big issue.

If Linux is used on backbone router it can easily have 3 million routes
to deal with. That won't make timer subsystem happy.
Kui-Feng Lee May 18, 2023, 4:59 p.m. UTC | #5
On 5/18/23 08:43, Stephen Hemminger wrote:
> On Wed, 17 May 2023 22:40:08 -0700
> Kui-Feng Lee <sinquersw@gmail.com> wrote:
> 
>>>> Solution
>>>> ========
>>>>
>>>> The cause of the issue is keeping the routing table locked during the
>>>> traversal of large tries. To address this, the patchset eliminates
>>>> garbage collection that does the tries traversal and introduces
>>>> individual timers for each route that eventually expires.  Walking
>>>> trials are no longer necessary with the timers. Additionally, the time
>>>> required to handle a timer is consistent.
>>>
>>> And then for the number of routes mentioned above what does that mean
>>> for having a timer per route? If this is 10's or 100's of 1000s of
>>> expired routes what does that mean for the timer subsystem dealing with
>>> that number of entries on top of other timers and the impact on the
>>> timer softirq? ie., are you just moving the problem around.
>>
>> Yes, each expired route has a timer.  But, not all routes have expire
>> times.  The timer subsystem will handle every single one. Let me
>> address the timer subsystem later.
>>
>>>
>>> did you consider other solutions? e.g., if it is the notifier, then
>>> perhaps the entries can be deleted from the fib and then put into a list
>>> for cleanup in a worker thread.
>>
>> Yes, I considered to keep a separated list of routes that is expiring,
>> just like what neighbor tables do.  However, we need to sort them in the
>> order of expire times.  Other solutions can be a RB-tree or priority
>> queues. However, later, I went to the timers solution suggested by
>> Martin Lau.
>>
>> If I read it correctly, the timer subsystem handles each
>> timer with a constant time.  It puts timers into buckets and levels.
>> Every level means different granularity.  For example, it has
>> granularity of 1ms, 8ms (level 0), 64ms, 512ms, ... up to 4 hours
>> (level 8) for 1000Hz.  Each level (granularity) has 64 buckets.
>> Every bucket represent a time slot. That means level 0 holds
>> timers that is expiring in 0ms~63ms in its 64 buckets, level 1 holds
>> timers that is expiring in 64ms~511ms, ... so on.  What the timer
>> subsystem does is to emit every timers in the corresponding current
>> buckets of every level.  In other word, it checks the current bucket
>> from level 0 ~ level 8, and emit timers if there is any timer
>> in the buckets.
>>
>> In contrast, the current GC has to walk every tree even only one route
>> expired.  Timers is far better. It emits every timer in the
>> buckets associated with current time, no search needed.  The current GC
>> is triggered by a timer as well.  So, it should reduce the computation
>> of the timer softirq.
>>
>> However, just like what I mentioned earlier, the drawback of timers are
>> its granularity can vary.  The longer expiration time means more coarse-
>> grained.  But, it probably is not a big issue.
> 
> If Linux is used on backbone router it can easily have 3 million routes
> to deal with. That won't make timer subsystem happy.

I will run experiments to get numbers to see how the compact
actually is.
Kui-Feng Lee May 18, 2023, 6:51 p.m. UTC | #6
On 5/18/23 08:28, David Ahern wrote:
> On 5/17/23 11:40 PM, Kui-Feng Lee wrote:
>>
>>
>> On 5/17/23 20:21, David Ahern wrote:
>>> On 5/17/23 12:33 PM, Kui-Feng Lee wrote:
>>>> This RFC is resent to ensure maintainers getting awared.  Also remove
>>>> some forward declarations that we don't use anymore.
>>>>
>>>> The size of a Linux IPv6 routing table can become a big problem if not
>>>> managed appropriately.  Now, Linux has a garbage collector to remove
>>>> expired routes periodically.  However, this may lead to a situation in
>>>> which the routing path is blocked for a long period due to an
>>>> excessive number of routes.
>>>
>>> I take it this problem was seen internally to your org? Can you give
>>> representative numbers on how many routes, stats on the blocked time,
>>> and reason for the large time block (I am guessing the notifier)?
>>
>> We don't have existing incidents so far.  Consider it as
>> a potential issue.
> 
> So no data to compare how the system was operating before and after.

I can generate traffic to test it.

> 
> ...
> 
>>
>> In contrast, the current GC has to walk every tree even only one route
>> expired.
> 
> As I recall the largest overhead is systems (e.g., switchdev) handling
> the notifier. The tree walk scaling problem can be addressed with a much
> simpler change -- e.g., add a list_head per fib6_table for fib6_info
> entries that can expire and make the list time sorted. Then the gc only
> needs to walk those lists up to the expired point.

This is one of solutions I considered at beginning.
With this approach, we can have a maximum
number of entries like what neighbor tables do.
Remove entries only if the list reach the maximum without running
a GC timer.  However, it can be very inefficient to insert a new entry
ordered. Stephen mentioned 3 million routes on backbone router
in another message.  We may need something more complicated
like RB-tree or HEAP to reduce the overhead.
David Ahern May 19, 2023, 2:35 a.m. UTC | #7
On 5/18/23 12:51 PM, Kui-Feng Lee wrote:
> This is one of solutions I considered at beginning.
> With this approach, we can have a maximum
> number of entries like what neighbor tables do.
> Remove entries only if the list reach the maximum without running
> a GC timer.  However, it can be very inefficient to insert a new entry
> ordered. Stephen mentioned 3 million routes on backbone router
> in another message.  We may need something more complicated
> like RB-tree or HEAP to reduce the overhead.

I do not believe so. Not every route will have an expiration timer. Only
those are added to the new, time ordered list_head.

And you can not cap the number of entries in a FIB.