[v2,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

 BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
  print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
  print_report mm/kasan/report.c:430 [inline]
  kasan_report+0x11c/0x130 mm/kasan/report.c:536
  mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
  tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
  tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
  tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
  tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
  tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other.  In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state.  Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions).  Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:

In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

  Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
  adds a flower filter X to A.

  Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
  b2) to replace A, then adds a flower filter Y to B.

 Thread 1               A's refcnt   Thread 2
  RTM_NEWQDISC (A, RTNL-locked)
   qdisc_create(A)               1
   qdisc_graft(A)                9

  RTM_NEWTFILTER (X, RTNL-unlocked)
   __tcf_qdisc_find(A)          10
   tcf_chain0_head_change(A)
   mini_qdisc_pair_swap(A) (1st)
            |
            |                         RTM_NEWQDISC (B, RTNL-locked)
         RCU sync                2     qdisc_graft(B)
            |                    1     notify_and_destroy(A)
            |
   tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
   qdisc_destroy(A)                    tcf_chain0_head_change(B)
   tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
   mini_qdisc_pair_swap(A) (3rd)                |
           ...                                 ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers.  The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
changes in v2:
  - replay the request if the current Qdisc has any ongoing RTNL-unlocked
    filter requests (Vlad)
  - minor changes in code comments and commit log

 include/net/sch_generic.h |  8 ++++++++
 net/sched/sch_api.c       | 32 ++++++++++++++++++++++++++------
 net/sched/sch_generic.c   | 14 +++++++++++---
 3 files changed, 45 insertions(+), 9 deletions(-)

Message ID	8e3383d0bacd084f0e33d9158d24bd411f1bf6ba.1684796705.git.peilin.ye@bytedance.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5DEFE17748 for <netdev@vger.kernel.org>; Mon, 22 May 2023 23:55:53 +0000 (UTC) Received: from mail-ot1-x332.google.com (mail-ot1-x332.google.com [IPv6:2607:f8b0:4864:20::332]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54EF2213F; Mon, 22 May 2023 16:55:51 -0700 (PDT) Received: by mail-ot1-x332.google.com with SMTP id 46e09a7af769-6af7daff37eso1141127a34.0; Mon, 22 May 2023 16:55:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684799750; x=1687391750; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tItS/JirFunMX5RfI2+aJ2PuNtEICNcjEQi3pmV0waQ=; b=Oxk2wkmGKzN+KZViHTjAnIHcCWo+gN4WIpgSyeC38zjz0/hy481mG3wiKfWHFCi5xO jT8cRnHQ3D3ueSB0lDmMHwzLh3KvDfuwKsW0XIpyFf/xyfvVx5huI+WnPlyt+LJJOjis eUedKUWiJFf+2iKdFk/elq1GmfJPnZmXM80SwCwD7wQwpnwg3a3uDlP9iMVegvFYx+Y2 lgXAKuOW8opnf6DLXDlNtrrHKgM3A1Zkf3eo2BjeTdZZbaezCMRCeLGj5C1zZXFMMLYL wKAqOZC330NVZoYCNaL7hzH/dBQWeJVXelx9UNWtp+r9MKl8lZcY9W8eQgtVsKw0JImk eCqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684799750; x=1687391750; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tItS/JirFunMX5RfI2+aJ2PuNtEICNcjEQi3pmV0waQ=; b=j5H1XNGO8ygVIUlhkdVWHf1ny78B5VZmQ6ghEKZDjM1XxVgnnE6bAjn+10CojJB3IU ygqDgAI0m93xvGLkUwdH4KVsFGD+mnSQkrK4STbsrsZYwoPADiTjG31rXy6VASHX9LPc wq9KoaejKLgFzus2uHgGWw8XQhTw0ZCcf51bQCbkzSw3oFmxb6TEH4hXdwzuIK4PG432 //xNkiyECy5MmmcCYgg4lQgtHCipB83bYWVuSNSJj4H2EIVRStTkOvodrPNB/43GUfWM ZCmQMO4vzLbaF7eoqOPb5xm0HrN6WjX8m9kGe4JxrF/tp9Zi/bPJO/PZMaEf6UOB0FAA Eeww== X-Gm-Message-State: AC+VfDw35I5GZtVckiQIKR+aexKrQpQkZE9PUnsioabG+RP+cBEohS8l 7YQO1wksPqj1hq/YIfOOdA== X-Google-Smtp-Source: ACHHUZ7J2d9jQgy3oq7FjYSJFKKOLBhyVFtopnugZ+aCOiVhrEIYqEqEii3AeTLVJme3IWQVqMoV4w== X-Received: by 2002:a9d:7ad7:0:b0:6ac:8844:3605 with SMTP id m23-20020a9d7ad7000000b006ac88443605mr6297410otn.30.1684799750582; Mon, 22 May 2023 16:55:50 -0700 (PDT) Received: from C02FL77VMD6R.bytedance.net ([208.184.112.130]) by smtp.gmail.com with ESMTPSA id d5-20020a05683018e500b006a65be836acsm2889645otf.16.2023.05.22.16.55.49 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 22 May 2023 16:55:50 -0700 (PDT) From: Peilin Ye <yepeilin.cs@gmail.com> X-Google-Original-From: Peilin Ye <peilin.ye@bytedance.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us> Cc: Peilin Ye <peilin.ye@bytedance.com>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com>, Vlad Buslov <vladbu@mellanox.com>, Pedro Tammela <pctammela@mojatatu.com>, Hillf Danton <hdanton@sina.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, Peilin Ye <yepeilin.cs@gmail.com> Subject: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Date: Mon, 22 May 2023 16:55:36 -0700 Message-Id: <8e3383d0bacd084f0e33d9158d24bd411f1bf6ba.1684796705.git.peilin.ye@bytedance.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) In-Reply-To: <cover.1684796705.git.peilin.ye@bytedance.com> References: <cover.1684796705.git.peilin.ye@bytedance.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: <netdev.vger.kernel.org> List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-Delegate: kuba@kernel.org
Series	net/sched: Fixes for sch_ingress and sch_clsact \| expand [v2,net,0/6] net/sched: Fixes for sch_ingress and sch_clsact [v2,net,1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS [v2,net,2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT [v2,net,3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs [v2,net,4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs [v2,net,5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs [v2,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for net
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1391 this patch: 1391
netdev/cc_maintainers	success	CCed 9 of 9 maintainers
netdev/build_clang	success	Errors and warnings before: 149 this patch: 149
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1412 this patch: 1412
netdev/checkpatch	warning	WARNING: From:/Signed-off-by: email address mismatch: 'From: Peilin Ye <yepeilin.cs@gmail.com>' != 'Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>' WARNING: line length of 82 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

[v2,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Checks

Commit Message

Comments

Patch