diff mbox series

[net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Message ID e6c4681dd9205d702ae2e6124e20c6210520e76e.1683326865.git.peilin.ye@bytedance.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series net/sched: Fixes for sch_ingress and sch_clsact | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1390 this patch: 1390
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 148 this patch: 148
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1410 this patch: 1410
netdev/checkpatch warning WARNING: From:/Signed-off-by: email address mismatch: 'From: Peilin Ye <yepeilin.cs@gmail.com>' != 'Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>' WARNING: line length of 84 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Peilin Ye May 6, 2023, 12:16 a.m. UTC
mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
access the same miniq_{in,e}gress pointer(s) concurrently with the new
Qdisc, causing race conditions [1] including a use-after-free in
mini_qdisc_pair_swap() reported by syzbot:

 BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
  print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
  print_report mm/kasan/report.c:430 [inline]
  kasan_report+0x11c/0x130 mm/kasan/report.c:536
  mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
  tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
  tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
  tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
  tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
  tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
finished.

To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
qdisc_destroy() for "old" before grafting "new".

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-lockless filter requests.  Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

  Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
  adds a flower filter X to A.

  Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
  b2) to replace A, then adds a flower filter Y to B.

 Thread 1               A's refcnt   Thread 2
  RTM_NEWQDISC (A, RTNL-locked)
   qdisc_create(A)               1
   qdisc_graft(A)                9

  RTM_NEWTFILTER (X, RTNL-lockless)
   __tcf_qdisc_find(A)          10
   tcf_chain0_head_change(A)
   mini_qdisc_pair_swap(A) (1st)
            |
            |                         RTM_NEWQDISC (B, RTNL-locked)
           RCU                   2     qdisc_graft(B)
            |                    1     notify_and_destroy(A)
            |
   tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-lockless)
   qdisc_destroy(A)                    tcf_chain0_head_change(B)
   tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
   mini_qdisc_pair_swap(A) (3rd)                |
           ...                                 ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is just one of the possible consequences of concurrently accessing
net_device::miniq_{in,e}gress pointers.  The point is clear, however:
B's first call to mini_qdisc_pair_swap() should take place after A's
last call, in qdisc_destroy().

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 include/net/sch_generic.h |  8 ++++++++
 net/sched/sch_api.c       | 26 +++++++++++++++++++++-----
 net/sched/sch_generic.c   | 14 +++++++++++---
 3 files changed, 40 insertions(+), 8 deletions(-)

Comments

Jamal Hadi Salim May 8, 2023, 11:32 a.m. UTC | #1
On Fri, May 5, 2023 at 8:16 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
> requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
> access the same miniq_{in,e}gress pointer(s) concurrently with the new
> Qdisc, causing race conditions [1] including a use-after-free in
> mini_qdisc_pair_swap() reported by syzbot:
>
>  BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>  Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
>  Call Trace:
>   <TASK>
>   __dump_stack lib/dump_stack.c:88 [inline]
>   dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
>   print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
>   print_report mm/kasan/report.c:430 [inline]
>   kasan_report+0x11c/0x130 mm/kasan/report.c:536
>   mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>   tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
>   tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
>   tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
>   tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
>   tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
> after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
> finished.
>
> To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
> clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
> qdisc_destroy() for "old" before grafting "new".
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-lockless filter requests.  Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
>
>   Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
>   adds a flower filter X to A.
>
>   Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
>   b2) to replace A, then adds a flower filter Y to B.
>
>  Thread 1               A's refcnt   Thread 2
>   RTM_NEWQDISC (A, RTNL-locked)
>    qdisc_create(A)               1
>    qdisc_graft(A)                9
>
>   RTM_NEWTFILTER (X, RTNL-lockless)
>    __tcf_qdisc_find(A)          10
>    tcf_chain0_head_change(A)
>    mini_qdisc_pair_swap(A) (1st)
>             |
>             |                         RTM_NEWQDISC (B, RTNL-locked)
>            RCU                   2     qdisc_graft(B)
>             |                    1     notify_and_destroy(A)
>             |
>    tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-lockless)
>    qdisc_destroy(A)                    tcf_chain0_head_change(B)
>    tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
>    mini_qdisc_pair_swap(A) (3rd)                |
>            ...                                 ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers.  The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
> Cc: Hillf Danton <hdanton@sina.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>



Thanks for the excellent analysis Peilin and for chasing this to the
end. I have no doubt it was a lot of fun! ;->

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

> ---
>  include/net/sch_generic.h |  8 ++++++++
>  net/sched/sch_api.c       | 26 +++++++++++++++++++++-----
>  net/sched/sch_generic.c   | 14 +++++++++++---
>  3 files changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
>         refcount_inc(&qdisc->refcnt);
>  }
>
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> +       if (qdisc->flags & TCQ_F_BUILTIN)
> +               return true;
> +       return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
>  /* Intended to be used by unlocked users, when concurrent qdisc release is
>   * possible.
>   */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
>  struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
>                               struct Qdisc *qdisc);
>  void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
>  void qdisc_put(struct Qdisc *qdisc);
>  void qdisc_put_unlocked(struct Qdisc *qdisc);
>  void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..a2d07bc8ded6 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>                 if ((q && q->flags & TCQ_F_INGRESS) ||
>                     (new && new->flags & TCQ_F_INGRESS)) {
>                         ingress = 1;
> -                       if (!dev_ingress_queue(dev)) {
> +                       dev_queue = dev_ingress_queue(dev);
> +                       if (!dev_queue) {
>                                 NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
>                                 return -ENOENT;
>                         }
> +
> +                       /* This is the counterpart of that qdisc_refcount_inc_nz() call in
> +                        * __tcf_qdisc_find() for RTNL-lockless filter requests.
> +                        */
> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> +                               NL_SET_ERR_MSG(extack,
> +                                              "Current ingress or clsact Qdisc has ongoing filter request(s)");
> +                               return -EBUSY;
> +                       }
>                 }
>
>                 if (dev->flags & IFF_UP)
> @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>                                 qdisc_put(old);
>                         }
>                 } else {
> -                       dev_queue = dev_ingress_queue(dev);
> -                       old = dev_graft_qdisc(dev_queue, new);
> +                       old = dev_graft_qdisc(dev_queue, NULL);
> +
> +                       /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> +                        * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> +                        * pointer(s) in mini_qdisc_pair_swap().
> +                        */
> +                       qdisc_notify(net, skb, n, classid, old, new, extack);
> +                       qdisc_destroy(old);
> +
> +                       dev_graft_qdisc(dev_queue, new);
>                 }
>
>  skip:
> @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
>                         if (new && new->ops->attach)
>                                 new->ops->attach(new);
> -               } else {
> -                       notify_and_destroy(net, skb, n, classid, old, new, extack);
>                 }
>
>                 if (dev->flags & IFF_UP)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
>         qdisc_free(q);
>  }
>
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
>  {
>         const struct Qdisc_ops  *ops = qdisc->ops;
>
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
>         call_rcu(&qdisc->rcu, qdisc_free_cb);
>  }
>
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> +       if (qdisc->flags & TCQ_F_BUILTIN)
> +               return;
> +
> +       __qdisc_destroy(qdisc);
> +}
> +
>  void qdisc_put(struct Qdisc *qdisc)
>  {
>         if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
>             !refcount_dec_and_test(&qdisc->refcnt))
>                 return;
>
> -       qdisc_destroy(qdisc);
> +       __qdisc_destroy(qdisc);
>  }
>  EXPORT_SYMBOL(qdisc_put);
>
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
>             !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
>                 return;
>
> -       qdisc_destroy(qdisc);
> +       __qdisc_destroy(qdisc);
>         rtnl_unlock();
>  }
>  EXPORT_SYMBOL(qdisc_put_unlocked);
> --
> 2.20.1
>
Pedro Tammela May 8, 2023, 2:12 p.m. UTC | #2
On 05/05/2023 21:16, Peilin Ye wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> clsact Qdiscs and miniq_egress.
> 
> Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
> requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
> access the same miniq_{in,e}gress pointer(s) concurrently with the new
> Qdisc, causing race conditions [1] including a use-after-free in
> mini_qdisc_pair_swap() reported by syzbot:
> 
>   BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>   Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
>   Call Trace:
>    <TASK>
>    __dump_stack lib/dump_stack.c:88 [inline]
>    dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
>    print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
>    print_report mm/kasan/report.c:430 [inline]
>    kasan_report+0x11c/0x130 mm/kasan/report.c:536
>    mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>    tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
>    tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
>    tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
>    tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
>    tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
> 
> The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
> after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
> finished.
> 
> To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
> clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
> qdisc_destroy() for "old" before grafting "new".
> 
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-lockless filter requests.  Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
> 
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> 
>    Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
>    adds a flower filter X to A.
> 
>    Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
>    b2) to replace A, then adds a flower filter Y to B.
> 
>   Thread 1               A's refcnt   Thread 2
>    RTM_NEWQDISC (A, RTNL-locked)
>     qdisc_create(A)               1
>     qdisc_graft(A)                9
> 
>    RTM_NEWTFILTER (X, RTNL-lockless)
>     __tcf_qdisc_find(A)          10
>     tcf_chain0_head_change(A)
>     mini_qdisc_pair_swap(A) (1st)
>              |
>              |                         RTM_NEWQDISC (B, RTNL-locked)
>             RCU                   2     qdisc_graft(B)
>              |                    1     notify_and_destroy(A)
>              |
>     tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-lockless)
>     qdisc_destroy(A)                    tcf_chain0_head_change(B)
>     tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
>     mini_qdisc_pair_swap(A) (3rd)                |
>             ...                                 ...
> 
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
> 
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers.  The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().
> 
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
> Cc: Hillf Danton <hdanton@sina.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

Thanks for chasing this!

Tested-by: Pedro Tammela <pctammela@mojatatu.com>

> ---
>   include/net/sch_generic.h |  8 ++++++++
>   net/sched/sch_api.c       | 26 +++++++++++++++++++++-----
>   net/sched/sch_generic.c   | 14 +++++++++++---
>   3 files changed, 40 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
>   	refcount_inc(&qdisc->refcnt);
>   }
>   
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> +	if (qdisc->flags & TCQ_F_BUILTIN)
> +		return true;
> +	return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
>   /* Intended to be used by unlocked users, when concurrent qdisc release is
>    * possible.
>    */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
>   struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
>   			      struct Qdisc *qdisc);
>   void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
>   void qdisc_put(struct Qdisc *qdisc);
>   void qdisc_put_unlocked(struct Qdisc *qdisc);
>   void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..a2d07bc8ded6 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   		if ((q && q->flags & TCQ_F_INGRESS) ||
>   		    (new && new->flags & TCQ_F_INGRESS)) {
>   			ingress = 1;
> -			if (!dev_ingress_queue(dev)) {
> +			dev_queue = dev_ingress_queue(dev);
> +			if (!dev_queue) {
>   				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
>   				return -ENOENT;
>   			}
> +
> +			/* This is the counterpart of that qdisc_refcount_inc_nz() call in
> +			 * __tcf_qdisc_find() for RTNL-lockless filter requests.
> +			 */
> +			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> +				NL_SET_ERR_MSG(extack,
> +					       "Current ingress or clsact Qdisc has ongoing filter request(s)");
> +				return -EBUSY;
> +			}
>   		}
>   
>   		if (dev->flags & IFF_UP)
> @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   				qdisc_put(old);
>   			}
>   		} else {
> -			dev_queue = dev_ingress_queue(dev);
> -			old = dev_graft_qdisc(dev_queue, new);
> +			old = dev_graft_qdisc(dev_queue, NULL);
> +
> +			/* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> +			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> +			 * pointer(s) in mini_qdisc_pair_swap().
> +			 */
> +			qdisc_notify(net, skb, n, classid, old, new, extack);
> +			qdisc_destroy(old);
> +
> +			dev_graft_qdisc(dev_queue, new);
>   		}
>   
>   skip:
> @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   
>   			if (new && new->ops->attach)
>   				new->ops->attach(new);
> -		} else {
> -			notify_and_destroy(net, skb, n, classid, old, new, extack);
>   		}
>   
>   		if (dev->flags & IFF_UP)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
>   	qdisc_free(q);
>   }
>   
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
>   {
>   	const struct Qdisc_ops  *ops = qdisc->ops;
>   
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
>   	call_rcu(&qdisc->rcu, qdisc_free_cb);
>   }
>   
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> +	if (qdisc->flags & TCQ_F_BUILTIN)
> +		return;
> +
> +	__qdisc_destroy(qdisc);
> +}
> +
>   void qdisc_put(struct Qdisc *qdisc)
>   {
>   	if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
>   	    !refcount_dec_and_test(&qdisc->refcnt))
>   		return;
>   
> -	qdisc_destroy(qdisc);
> +	__qdisc_destroy(qdisc);
>   }
>   EXPORT_SYMBOL(qdisc_put);
>   
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
>   	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
>   		return;
>   
> -	qdisc_destroy(qdisc);
> +	__qdisc_destroy(qdisc);
>   	rtnl_unlock();
>   }
>   EXPORT_SYMBOL(qdisc_put_unlocked);
Peilin Ye May 8, 2023, 9:58 p.m. UTC | #3
On Mon, May 08, 2023 at 07:32:01AM -0400, Jamal Hadi Salim wrote:
> Thanks for the excellent analysis Peilin and for chasing this to the
> end. I have no doubt it was a lot of fun! ;->

Of course ;-)

> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

Thanks for reviewing this!

Peilin Ye
Peilin Ye May 8, 2023, 10:01 p.m. UTC | #4
On Mon, May 08, 2023 at 11:12:24AM -0300, Pedro Tammela wrote:
> Thanks for chasing this!
> 
> Tested-by: Pedro Tammela <pctammela@mojatatu.com>

Thanks for testing, Pedro!

Thanks,
Peilin Ye
Jakub Kicinski May 9, 2023, 1:33 a.m. UTC | #5
On Fri,  5 May 2023 17:16:10 -0700 Peilin Ye wrote:
>   Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
>   adds a flower filter X to A.
> 
>   Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
>   b2) to replace A, then adds a flower filter Y to B.
> 
>  Thread 1               A's refcnt   Thread 2
>   RTM_NEWQDISC (A, RTNL-locked)
>    qdisc_create(A)               1
>    qdisc_graft(A)                9
> 
>   RTM_NEWTFILTER (X, RTNL-lockless)
>    __tcf_qdisc_find(A)          10
>    tcf_chain0_head_change(A)
>    mini_qdisc_pair_swap(A) (1st)
>             |
>             |                         RTM_NEWQDISC (B, RTNL-locked)
>            RCU                   2     qdisc_graft(B)
>             |                    1     notify_and_destroy(A)
>             |
>    tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-lockless)
>    qdisc_destroy(A)                    tcf_chain0_head_change(B)
>    tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
>    mini_qdisc_pair_swap(A) (3rd)                |
>            ...                                 ...
> 
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
> 
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers.  The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().

Great analysis, thanks for squashing this bug.

Have you considered creating a fix more localized to the miniq
implementation? It seems that having per-device miniq pointers is
incompatible with using reference counted objects. So miniq is 
a more natural place to solve the problem. Otherwise workarounds 
in the core keep piling up (here qdisc_graft()).

Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
If active qdisc is neither a1 nor a2 we should leave the dev state
alone.
Peilin Ye May 10, 2023, 8:11 p.m. UTC | #6
On Mon, May 08, 2023 at 06:33:24PM -0700, Jakub Kicinski wrote:
> Great analysis, thanks for squashing this bug.

Thanks, happy to help!

> Have you considered creating a fix more localized to the miniq
> implementation? It seems that having per-device miniq pointers is
> incompatible with using reference counted objects. So miniq is
> a more natural place to solve the problem. Otherwise workarounds
> in the core keep piling up (here qdisc_graft()).
>
> Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
> If active qdisc is neither a1 nor a2 we should leave the dev state
> alone.

Yes, I have tried fixing this in mini_qdisc_pair_swap(), but I am afraid
it is hard:

(3rd) is called from ->destroy(), so currently it uses RCU_INIT_POINTER()
to set dev->miniq_ingress to NULL.  It will need a logic like:

  I am A.  Set dev->miniq_ingress to NULL, if and only if it is a1 or a2,
  and do it atomically.

We need more than a cmpxchg() to implement this "set NULL iff a1 or a2".
Additionally:

On Fri,  5 May 2023 17:16:10 -0700 Peilin Ye wrote:
>   Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
>   adds a flower filter X to A.
> 
>   Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
>   b2) to replace A, then adds a flower filter Y to B.
> 
>  Thread 1               A's refcnt   Thread 2
>   RTM_NEWQDISC (A, RTNL-locked)
>    qdisc_create(A)               1
>    qdisc_graft(A)                9
> 
>   RTM_NEWTFILTER (X, RTNL-lockless)
>    __tcf_qdisc_find(A)          10
>    tcf_chain0_head_change(A)
>    mini_qdisc_pair_swap(A) (1st)
>             |
>             |                         RTM_NEWQDISC (B, RTNL-locked)
>            RCU                   2     qdisc_graft(B)
>             |                    1     notify_and_destroy(A)
>             |
>    tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-lockless)
>    qdisc_destroy(A)                    tcf_chain0_head_change(B)
>    tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
>    mini_qdisc_pair_swap(A) (3rd)                |
>            ...                                 ...

Looking at the code, I think there is no guarantee that (1st) cannot
happen after (2nd), although unlikely?  Can RTNL-lockless RTM_NEWTFILTER
handlers get preempted?

If (1st) happens later than (2nd), we will need to make (1st) no-op, by
detecting that we are the "old" Qdisc.  I am not sure there is any
(clean) way to do it.  I even thought about:

  (1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
  (2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping".  If false, it
      means we are the "old" Qdisc (have been replaced), and should do
      nothing.

However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
egress one, so I can't container_of() during step (1) ...

Eventually I created [5,6/6].  It is a workaround indeed, in the sense
that it changes sch_api.c to avoid a mini Qdisc issue.  However I think it
makes the code correct in a relatively understandable way, without slowing
down mini_qdisc_pair_swap() or sch_handle_*gress().

Thanks,
Peilin Ye
Jakub Kicinski May 10, 2023, 11:15 p.m. UTC | #7
On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:
> On Fri,  5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> >   Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> >   adds a flower filter X to A.
> > 
> >   Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> >   b2) to replace A, then adds a flower filter Y to B.
> > 
> >  Thread 1               A's refcnt   Thread 2
> >   RTM_NEWQDISC (A, RTNL-locked)
> >    qdisc_create(A)               1
> >    qdisc_graft(A)                9
> > 
> >   RTM_NEWTFILTER (X, RTNL-lockless)
> >    __tcf_qdisc_find(A)          10
> >    tcf_chain0_head_change(A)
> >    mini_qdisc_pair_swap(A) (1st)
> >             |
> >             |                         RTM_NEWQDISC (B, RTNL-locked)
> >            RCU                   2     qdisc_graft(B)
> >             |                    1     notify_and_destroy(A)
> >             |
> >    tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-lockless)
> >    qdisc_destroy(A)                    tcf_chain0_head_change(B)
> >    tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
> >    mini_qdisc_pair_swap(A) (3rd)                |
> >            ...                                 ...  
> 
> Looking at the code, I think there is no guarantee that (1st) cannot
> happen after (2nd), although unlikely?  Can RTNL-lockless RTM_NEWTFILTER
> handlers get preempted?

Right, we need qdisc_graft(B) to update the appropriate dev pointer 
to point to b1. With that the ordering should not matter. Probably
using the ->attach() callback?

> If (1st) happens later than (2nd), we will need to make (1st) no-op, by
> detecting that we are the "old" Qdisc.  I am not sure there is any
> (clean) way to do it.  I even thought about:
> 
>   (1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
>   (2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping".  If false, it
>       means we are the "old" Qdisc (have been replaced), and should do
>       nothing.
> 
> However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
> egress one, so I can't container_of() during step (1) ...

And we can't be using multiple pieces of information to make 
the decision since AFAIU mini_qdisc_pair_swap() can race with
qdisc_graft().

My thinking was to make sure that dev->miniq_* pointers always point
to one of the miniqs of the currently attached qdisc. Right now, on 
a quick look, those pointers are not initialized during initial graft,
only when first filter is added, and may be cleared when filters are
removed. But I don't think that's strictly required, miniq with no
filters should be fine.

> Eventually I created [5,6/6].  It is a workaround indeed, in the sense
> that it changes sch_api.c to avoid a mini Qdisc issue.  However I think it
> makes the code correct in a relatively understandable way,

What's your benchmark for being understandable?

> without slowing down mini_qdisc_pair_swap() or sch_handle_*gress().
Peilin Ye May 11, 2023, 8:40 p.m. UTC | #8
On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> My thinking was to make sure that dev->miniq_* pointers always point
> to one of the miniqs of the currently attached qdisc. Right now, on
> a quick look, those pointers are not initialized during initial graft,
> only when first filter is added, and may be cleared when filters are
> removed. But I don't think that's strictly required, miniq with no
> filters should be fine.

Ah, I see, thanks for explaining, I didn't think of that.  Looking at
sch_handle_ingress() BTW, currently mini Qdisc stats aren't being updated
(mini_qdisc_bstats_cpu_update()) if there's no filters, is this intended?
Should I keep this behavior by:

diff --git a/net/core/dev.c b/net/core/dev.c
index 735096d42c1d..9016481377e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5116,7 +5116,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
         * that are not configured with an ingress qdisc will bail
         * out here.
         */
-       if (!miniq)
+       if (!miniq || !miniq->filter_list)
                return skb;

        if (*pt_prev) {

On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:
> > On Fri,  5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> > >  Thread 1               A's refcnt   Thread 2
> > >   RTM_NEWQDISC (A, RTNL-locked)
> > >    qdisc_create(A)               1
> > >    qdisc_graft(A)                9
> > >
> > >   RTM_NEWTFILTER (X, RTNL-lockless)
> > >    __tcf_qdisc_find(A)          10
> > >    tcf_chain0_head_change(A)
> > >    mini_qdisc_pair_swap(A) (1st)
> > >             |
> > >             |                         RTM_NEWQDISC (B, RTNL-locked)
> > >            RCU                   2     qdisc_graft(B)
> > >             |                    1     notify_and_destroy(A)
> > >             |
> > >    tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-lockless)
> > >    qdisc_destroy(A)                    tcf_chain0_head_change(B)
> > >    tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
> > >    mini_qdisc_pair_swap(A) (3rd)                |
> > >            ...                                 ...
> >
> > Looking at the code, I think there is no guarantee that (1st) cannot
> > happen after (2nd), although unlikely?  Can RTNL-lockless RTM_NEWTFILTER
> > handlers get preempted?
>
> Right, we need qdisc_graft(B) to update the appropriate dev pointer
> to point to b1. With that the ordering should not matter. Probably
> using the ->attach() callback?

->attach() is later than dev_graft_qdisc().  We should get ready for
concurrent filter requests (i.e. have dev pointer pointing to b1) before
grafting (publishing) B.  Also currently qdisc_graft() doesn't call
->attach() for ingress and clsact Qdiscs.

But I see your point, thanks for the suggestion!  I'll try ->init() and
create v2.

Thanks,
Peilin Ye
Jakub Kicinski May 11, 2023, 11:20 p.m. UTC | #9
On Thu, 11 May 2023 13:40:10 -0700 Peilin Ye wrote:
> On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> > My thinking was to make sure that dev->miniq_* pointers always point
> > to one of the miniqs of the currently attached qdisc. Right now, on
> > a quick look, those pointers are not initialized during initial graft,
> > only when first filter is added, and may be cleared when filters are
> > removed. But I don't think that's strictly required, miniq with no
> > filters should be fine.  
> 
> Ah, I see, thanks for explaining, I didn't think of that.  Looking at
> sch_handle_ingress() BTW, currently mini Qdisc stats aren't being updated
> (mini_qdisc_bstats_cpu_update()) if there's no filters, is this intended?
> Should I keep this behavior by:
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 735096d42c1d..9016481377e0 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5116,7 +5116,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>          * that are not configured with an ingress qdisc will bail
>          * out here.
>          */
> -       if (!miniq)
> +       if (!miniq || !miniq->filter_list)
>                 return skb;
> 
>         if (*pt_prev) {

Good question, maybe Jiri or Daniel can answer?

> On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> > On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:  
> > > Looking at the code, I think there is no guarantee that (1st) cannot
> > > happen after (2nd), although unlikely?  Can RTNL-lockless RTM_NEWTFILTER
> > > handlers get preempted?  
> >
> > Right, we need qdisc_graft(B) to update the appropriate dev pointer
> > to point to b1. With that the ordering should not matter. Probably
> > using the ->attach() callback?  
> 
> ->attach() is later than dev_graft_qdisc().  We should get ready for  
> concurrent filter requests (i.e. have dev pointer pointing to b1) before
> grafting (publishing) B. 

I thought even for "unlocked" filter operations the start of it is
under the lock, but the lock gets dropped after qdisc/block are found.
I could be misremembering, I haven't looked at the code.

> Also currently qdisc_graft() doesn't call
> ->attach() for ingress and clsact Qdiscs.  
> 
> But I see your point, thanks for the suggestion!  I'll try ->init() and
> create v2.

->init() may be too early, aren't there any error points which could
prevent the Qdisc from binding after ->init() was called?
Peilin Ye May 11, 2023, 11:46 p.m. UTC | #10
On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > But I see your point, thanks for the suggestion!  I'll try ->init() and
> > create v2.
>
> ->init() may be too early, aren't there any error points which could
> prevent the Qdisc from binding after ->init() was called?

You're right, it's in qdisc_create(), argh...

On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > > > Looking at the code, I think there is no guarantee that (1st) cannot
> > > > happen after (2nd), although unlikely?  Can RTNL-lockless RTM_NEWTFILTER
> > > > handlers get preempted?
> > >
> > > Right, we need qdisc_graft(B) to update the appropriate dev pointer
> > > to point to b1. With that the ordering should not matter. Probably
> > > using the ->attach() callback?
> >
> > ->attach() is later than dev_graft_qdisc().  We should get ready for
> > concurrent filter requests (i.e. have dev pointer pointing to b1) before
> > grafting (publishing) B.
>
> I thought even for "unlocked" filter operations the start of it is
> under the lock, but the lock gets dropped after qdisc/block are found.
> I could be misremembering, I haven't looked at the code.

No, f.e. RTM_NEWTFILTER is registered as RTNL_FLAG_DOIT_UNLOCKED, so
tc_new_tfilter() starts and calls __tcf_qdisc_find() without RTNL mutex,
at least in latest code.

Thinking,
Peilin Ye
Peilin Ye May 12, 2023, 12:11 a.m. UTC | #11
On Thu, May 11, 2023 at 04:46:33PM -0700, Peilin Ye wrote:
> On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > > But I see your point, thanks for the suggestion!  I'll try ->init() and
> > > create v2.
> >
> > ->init() may be too early, aren't there any error points which could
> > prevent the Qdisc from binding after ->init() was called?
>
> You're right, it's in qdisc_create(), argh...

->destroy() is called for all error points between ->init() and
dev_graft_qdisc().  I'll try handling it in ->destroy().

Thanks,
Peilin Ye
Peilin Ye May 15, 2023, 10:45 p.m. UTC | #12
On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
> > > ->init() may be too early, aren't there any error points which could
> > > prevent the Qdisc from binding after ->init() was called?
> >
> > You're right, it's in qdisc_create(), argh...
>
> ->destroy() is called for all error points between ->init() and
> dev_graft_qdisc().  I'll try handling it in ->destroy().

Sorry for any confusion: there is no point at all undoing "setting dev
pointer to b1" in ->destroy() because datapath has already been affected.

To summarize, grafting B mustn't fail after setting dev pointer to b1, so
->init() is too early, because e.g. if user requested [1] to create a rate
estimator, gen_new_estimator() could fail after ->init() in
qdisc_create().

On the other hand, ->attach() is too late because it's later than
dev_graft_qdisc(), so concurrent filter requests might see uninitialized
dev pointer in theory.

Please suggest; is adding another callback (or calling ->attach()) right
before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
fix?

[1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact

Thanks,
Peilin Ye
Jakub Kicinski May 16, 2023, 7:22 p.m. UTC | #13
On Mon, 15 May 2023 15:45:15 -0700 Peilin Ye wrote:
> On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
> > > You're right, it's in qdisc_create(), argh...  
> >  
> > ->destroy() is called for all error points between ->init() and  
> > dev_graft_qdisc().  I'll try handling it in ->destroy().  
> 
> Sorry for any confusion: there is no point at all undoing "setting dev
> pointer to b1" in ->destroy() because datapath has already been affected.
> 
> To summarize, grafting B mustn't fail after setting dev pointer to b1, so
> ->init() is too early, because e.g. if user requested [1] to create a rate  
> estimator, gen_new_estimator() could fail after ->init() in
> qdisc_create().
> 
> On the other hand, ->attach() is too late because it's later than
> dev_graft_qdisc(), so concurrent filter requests might see uninitialized
> dev pointer in theory.
> 
> Please suggest; is adding another callback (or calling ->attach()) right
> before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
> fix?
> 
> [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact

Vlad, could you please clarify how you expect the unlocked filter
operations to work when the qdisc has global state?
Vlad Buslov May 16, 2023, 7:35 p.m. UTC | #14
On Tue 16 May 2023 at 12:22, Jakub Kicinski <kuba@kernel.org> wrote:
> On Mon, 15 May 2023 15:45:15 -0700 Peilin Ye wrote:
>> On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
>> > > You're right, it's in qdisc_create(), argh...  
>> >  
>> > ->destroy() is called for all error points between ->init() and  
>> > dev_graft_qdisc().  I'll try handling it in ->destroy().  
>> 
>> Sorry for any confusion: there is no point at all undoing "setting dev
>> pointer to b1" in ->destroy() because datapath has already been affected.
>> 
>> To summarize, grafting B mustn't fail after setting dev pointer to b1, so
>> ->init() is too early, because e.g. if user requested [1] to create a rate  
>> estimator, gen_new_estimator() could fail after ->init() in
>> qdisc_create().
>> 
>> On the other hand, ->attach() is too late because it's later than
>> dev_graft_qdisc(), so concurrent filter requests might see uninitialized
>> dev pointer in theory.
>> 
>> Please suggest; is adding another callback (or calling ->attach()) right
>> before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
>> fix?
>> 
>> [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact
>
> Vlad, could you please clarify how you expect the unlocked filter
> operations to work when the qdisc has global state?

Jakub, I didn't account for per-net_device pointer usage by miniqp code
hence this bug. I didn't comment on the fix because I was away from my
PC last week but Peilin's approach seems reasonable to me. When Peilin
brought up the issue initially I also tried to come up with some trick
to contain the changes to miniqp code instead of changing core but
couldn't think of anything workable due to the limitations already
discussed in this thread. I'm open to explore alternative approaches to
solving this issue, if that is what you suggest.
Jakub Kicinski May 16, 2023, 9:50 p.m. UTC | #15
On Tue, 16 May 2023 22:35:51 +0300 Vlad Buslov wrote:
> > Vlad, could you please clarify how you expect the unlocked filter
> > operations to work when the qdisc has global state?  
> 
> Jakub, I didn't account for per-net_device pointer usage by miniqp code
> hence this bug. I didn't comment on the fix because I was away from my
> PC last week but Peilin's approach seems reasonable to me. When Peilin
> brought up the issue initially I also tried to come up with some trick
> to contain the changes to miniqp code instead of changing core but
> couldn't think of anything workable due to the limitations already
> discussed in this thread. I'm open to explore alternative approaches to
> solving this issue, if that is what you suggest.

Given Peilin's investigation I think fix without changing core may
indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
is elevated will be appreciated by the users, do we already have
similar behavior in other parts of TC?
Peilin Ye May 16, 2023, 10:58 p.m. UTC | #16
On Tue, May 16, 2023 at 02:50:10PM -0700, Jakub Kicinski wrote:
> > > Vlad, could you please clarify how you expect the unlocked filter
> > > operations to work when the qdisc has global state?  
> > 
> > Jakub, I didn't account for per-net_device pointer usage by miniqp code
> > hence this bug. I didn't comment on the fix because I was away from my
> > PC last week but Peilin's approach seems reasonable to me. When Peilin
> > brought up the issue initially I also tried to come up with some trick
> > to contain the changes to miniqp code instead of changing core but
> > couldn't think of anything workable due to the limitations already
> > discussed in this thread. I'm open to explore alternative approaches to
> > solving this issue, if that is what you suggest.
> 
> Given Peilin's investigation I think fix without changing core may
> indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
> is elevated will be appreciated by the users, do we already have
> similar behavior in other parts of TC?

Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY:

net/sched/cls_u32.c:

	if (ht->refcnt == 1) {
		u32_destroy_hnode(tp, ht, extack);
	} else {
		NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
		return -EBUSY;
	}

Thanks,
Peilin Ye
Jakub Kicinski May 17, 2023, 12:39 a.m. UTC | #17
On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:
> > Given Peilin's investigation I think fix without changing core may
> > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
> > is elevated will be appreciated by the users, do we already have
> > similar behavior in other parts of TC?  
> 
> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY

I meant -EBUSY due to a race (another operation being in flight).
I think that's different.
Vlad Buslov May 17, 2023, 8:49 a.m. UTC | #18
On Tue 16 May 2023 at 17:39, Jakub Kicinski <kuba@kernel.org> wrote:
> On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:
>> > Given Peilin's investigation I think fix without changing core may
>> > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
>> > is elevated will be appreciated by the users, do we already have
>> > similar behavior in other parts of TC?  
>> 
>> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY
>
> I meant -EBUSY due to a race (another operation being in flight).
> I think that's different.

I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
functionality instead of returning error to the user would be a better
approach? Currently the function is replayed when qdisc_create() returns
EAGAIN. It should be trivial to do the same for qdisc_graft() result.
Jakub Kicinski May 17, 2023, 6:48 p.m. UTC | #19
On Wed, 17 May 2023 11:49:10 +0300 Vlad Buslov wrote:
> On Tue 16 May 2023 at 17:39, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:  
> >> 
> >> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY  
> >
> > I meant -EBUSY due to a race (another operation being in flight).
> > I think that's different.  
> 
> I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
> functionality instead of returning error to the user would be a better
> approach? Currently the function is replayed when qdisc_create() returns
> EAGAIN. It should be trivial to do the same for qdisc_graft() result.

Sounds better than returning -EBUSY to the user and expecting them 
to retry, yes.
Jakub Kicinski May 17, 2023, 6:48 p.m. UTC | #20
On Fri,  5 May 2023 17:16:10 -0700 Peilin Ye wrote:
>  		} else {
> -			dev_queue = dev_ingress_queue(dev);
> -			old = dev_graft_qdisc(dev_queue, new);
> +			old = dev_graft_qdisc(dev_queue, NULL);
> +
> +			/* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> +			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> +			 * pointer(s) in mini_qdisc_pair_swap().
> +			 */
> +			qdisc_notify(net, skb, n, classid, old, new, extack);
> +			qdisc_destroy(old);
> +
> +			dev_graft_qdisc(dev_queue, new);

BTW can't @old be NULL here?
Peilin Ye May 17, 2023, 9:18 p.m. UTC | #21
On Wed, May 17, 2023 at 11:48:25AM -0700, Jakub Kicinski wrote:
> >             } else {
> > -                   dev_queue = dev_ingress_queue(dev);
> > -                   old = dev_graft_qdisc(dev_queue, new);
> > +                   old = dev_graft_qdisc(dev_queue, NULL);
> > +
> > +                   /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> > +                    * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> > +                    * pointer(s) in mini_qdisc_pair_swap().
> > +                    */
> > +                   qdisc_notify(net, skb, n, classid, old, new, extack);
> > +                   qdisc_destroy(old);
> > +
> > +                   dev_graft_qdisc(dev_queue, new);
>
> BTW can't @old be NULL here?

ingress_queue->qdisc_sleeping is initialized to &noop_qdisc (placeholder)
in dev_ingress_queue_create(), and dev_graft_qdisc() also grafts
&noop_qdisc to represent "there's no Qdisc":

	/* ... and graft new one */
	if (qdisc == NULL)
		qdisc = &noop_qdisc;
	dev_queue->qdisc_sleeping = qdisc;

So @old can't be NULL here.

Thanks,
Peilin Ye
Peilin Ye May 17, 2023, 10:20 p.m. UTC | #22
On Wed, May 17, 2023 at 11:48:05AM -0700, Jakub Kicinski wrote:
> On Wed, 17 May 2023 11:49:10 +0300 Vlad Buslov wrote:
> > I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
> > functionality instead of returning error to the user would be a better
> > approach? Currently the function is replayed when qdisc_create() returns
> > EAGAIN. It should be trivial to do the same for qdisc_graft() result.
>
> Sounds better than returning -EBUSY to the user and expecting them
> to retry, yes.

Thanks for the suggestion, Vlad!  I'll try this in tc_modify_qdisc() and
tc_get_qdisc() (for Qdisc deletion) in v2.

Thanks,
Peilin Ye
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@  static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
 	refcount_inc(&qdisc->refcnt);
 }
 
+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return true;
+	return refcount_dec_if_one(&qdisc->refcnt);
+}
+
 /* Intended to be used by unlocked users, when concurrent qdisc release is
  * possible.
  */
@@ -652,6 +659,7 @@  void dev_deactivate_many(struct list_head *head);
 struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 			      struct Qdisc *qdisc);
 void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
 void qdisc_put(struct Qdisc *qdisc);
 void qdisc_put_unlocked(struct Qdisc *qdisc);
 void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..a2d07bc8ded6 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,20 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 		if ((q && q->flags & TCQ_F_INGRESS) ||
 		    (new && new->flags & TCQ_F_INGRESS)) {
 			ingress = 1;
-			if (!dev_ingress_queue(dev)) {
+			dev_queue = dev_ingress_queue(dev);
+			if (!dev_queue) {
 				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
 				return -ENOENT;
 			}
+
+			/* This is the counterpart of that qdisc_refcount_inc_nz() call in
+			 * __tcf_qdisc_find() for RTNL-lockless filter requests.
+			 */
+			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
+				NL_SET_ERR_MSG(extack,
+					       "Current ingress or clsact Qdisc has ongoing filter request(s)");
+				return -EBUSY;
+			}
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1104,8 +1114,16 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 				qdisc_put(old);
 			}
 		} else {
-			dev_queue = dev_ingress_queue(dev);
-			old = dev_graft_qdisc(dev_queue, new);
+			old = dev_graft_qdisc(dev_queue, NULL);
+
+			/* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
+			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+			 * pointer(s) in mini_qdisc_pair_swap().
+			 */
+			qdisc_notify(net, skb, n, classid, old, new, extack);
+			qdisc_destroy(old);
+
+			dev_graft_qdisc(dev_queue, new);
 		}
 
 skip:
@@ -1119,8 +1137,6 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 
 			if (new && new->ops->attach)
 				new->ops->attach(new);
-		} else {
-			notify_and_destroy(net, skb, n, classid, old, new, extack);
 		}
 
 		if (dev->flags & IFF_UP)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@  static void qdisc_free_cb(struct rcu_head *head)
 	qdisc_free(q);
 }
 
-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
 
@@ -1070,6 +1070,14 @@  static void qdisc_destroy(struct Qdisc *qdisc)
 	call_rcu(&qdisc->rcu, qdisc_free_cb);
 }
 
+void qdisc_destroy(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return;
+
+	__qdisc_destroy(qdisc);
+}
+
 void qdisc_put(struct Qdisc *qdisc)
 {
 	if (!qdisc)
@@ -1079,7 +1087,7 @@  void qdisc_put(struct Qdisc *qdisc)
 	    !refcount_dec_and_test(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 }
 EXPORT_SYMBOL(qdisc_put);
 
@@ -1094,7 +1102,7 @@  void qdisc_put_unlocked(struct Qdisc *qdisc)
 	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 	rtnl_unlock();
 }
 EXPORT_SYMBOL(qdisc_put_unlocked);