diff mbox series

[net,v2] s390/qeth: handle deferred cc1

Message ID 20240307093827.2307279-1-wintera@linux.ibm.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] s390/qeth: handle deferred cc1 | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 1 blamed authors not CCed: oberpar@linux.ibm.com; 1 maintainers not CCed: oberpar@linux.ibm.com
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 57 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-07--15-00 (tests: 892)

Commit Message

Alexandra Winter March 7, 2024, 9:38 a.m. UTC
The IO subsystem expects a driver to retry a ccw_device_start, when the
subsequent interrupt response block (irb) contains a deferred
condition code 1.

Symptoms before this commit:
On the read channel we always trigger the next read anyhow, so no
different behaviour here.
On the write channel we may experience timeout errors, because the
expected reply will never be received without the retry.
Other callers of qeth_send_control_data() may wrongly assume that the ccw
was successful, which may cause problems later.

Note that since
commit 2297791c92d0 ("s390/cio: dont unregister subchannel from child-drivers")
and
commit 5ef1dc40ffa6 ("s390/cio: fix invalid -EBUSY on ccw_device_start")
deferred CC1s are more likely to occur. See the commit message of the
latter for more background information.

Fixes: 2297791c92d0 ("s390/cio: dont unregister subchannel from child-drivers")
Reference-ID: LTC205042
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
---
v1->v2: correct patch format

 drivers/s390/net/qeth_core_main.c | 36 +++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

Comments

Peter Oberparleiter March 7, 2024, 12:09 p.m. UTC | #1
On 07.03.2024 10:38, Alexandra Winter wrote:
> The IO subsystem expects a driver to retry a ccw_device_start, when the
> subsequent interrupt response block (irb) contains a deferred
> condition code 1.
> 
> Symptoms before this commit:
> On the read channel we always trigger the next read anyhow, so no
> different behaviour here.
> On the write channel we may experience timeout errors, because the
> expected reply will never be received without the retry.
> Other callers of qeth_send_control_data() may wrongly assume that the ccw
> was successful, which may cause problems later.
> 
> Note that since
> commit 2297791c92d0 ("s390/cio: dont unregister subchannel from child-drivers")
> and
> commit 5ef1dc40ffa6 ("s390/cio: fix invalid -EBUSY on ccw_device_start")
> deferred CC1s are more likely to occur. See the commit message of the
> latter for more background information.
> 
> Fixes: 2297791c92d0 ("s390/cio: dont unregister subchannel from child-drivers")
> Reference-ID: LTC205042
> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>

Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Heiko Carstens March 8, 2024, 11:09 a.m. UTC | #2
On Thu, Mar 07, 2024 at 10:38:27AM +0100, Alexandra Winter wrote:
> The IO subsystem expects a driver to retry a ccw_device_start, when the
> subsequent interrupt response block (irb) contains a deferred
> condition code 1.
> 
> Symptoms before this commit:
> On the read channel we always trigger the next read anyhow, so no
> different behaviour here.
> On the write channel we may experience timeout errors, because the
> expected reply will never be received without the retry.
> Other callers of qeth_send_control_data() may wrongly assume that the ccw
> was successful, which may cause problems later.
> 
> Note that since
> commit 2297791c92d0 ("s390/cio: dont unregister subchannel from child-drivers")
> and
> commit 5ef1dc40ffa6 ("s390/cio: fix invalid -EBUSY on ccw_device_start")
> deferred CC1s are more likely to occur. See the commit message of the
> latter for more background information.
> 
> Fixes: 2297791c92d0 ("s390/cio: dont unregister subchannel from child-drivers")
> Reference-ID: LTC205042
> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
> ---
> v1->v2: correct patch format
> 
>  drivers/s390/net/qeth_core_main.c | 36 +++++++++++++++++++++++++++++--
>  1 file changed, 34 insertions(+), 2 deletions(-)

This patch seems to introduce a reference count problem, which results
in use-after-free reports like the one below. So, please ignore the
patch - this needs to be sorted out first.

[  553.017253] ------------[ cut here ]------------
[  553.017255] refcount_t: addition on 0; use-after-free.
[  553.017269] WARNING: CPU: 12 PID: 115746 at lib/refcount.c:25 refcount_warn_saturate+0x10e/0x130
[  553.017275] Modules linked in: kvm algif_hash af_alg nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink dm_service_time mlx5_ib ib_uverbs ib_core uvdevice
+s390_trng eadm_sch vfio_ccw mdev vfio_iommu_type1 vfio sch_fq_codel loop configfs lcs ctcm fsm zfcp scsi_transport_fc mlx5_core ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey
+zcrypt rng_core dm_multipath autofs4
[  553.017331] Unloaded tainted modules: test_klp_state3(K):1 test_klp_state2(K):4 test_klp_state(K):3 test_klp_callbacks_demo2(K):2 test_klp_callbacks_demo(K):12 test_klp_atomic_replace(K):2 test_klp_livepatch(K):6 [last unloaded: kvm]
[  553.017405] CPU: 12 PID: 115746 Comm: qeth_recover Tainted: G        W     K    6.8.0-20240305.rc7.git1.570c73d6a3df.300.fc39.s390x #1
[  553.017408] Hardware name: IBM 8561 T01 701 (LPAR)
[  553.017409] Krnl PSW : 0404c00180000000 00000001a53cad02 (refcount_warn_saturate+0x112/0x130)
[  553.017413]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[  553.017415] Krnl GPRS: c000000000000027 0000000000000023 000000000000002a 0000000100000000
[  553.017417]            00000004b38e0900 0000000000000000 0000000095dfc2c0 0000000000001000
[  553.017419]            00000001a563eb58 000000009938d818 0000000000001007 000000009938d800
[  553.017421]            0000000095dfac00 0000000000000000 00000001a53cacfe 000003800f24bc10
[  553.017426] Krnl Code: 00000001a53cacf2: c020004f67d8        larl    %r2,00000001a5db7ca2
                          00000001a53cacf8: c0e5ffc262b4        brasl   %r14,00000001a4c17260
                         #00000001a53cacfe: af000000            mc      0,0
                         >00000001a53cad02: a7f4ff9a            brc     15,00000001a53cac36
                          00000001a53cad06: 92014008            mvi     8(%r4),1
                          00000001a53cad0a: c020004f67f6        larl    %r2,00000001a5db7cf6
                          00000001a53cad10: c0e5ffc262a8        brasl   %r14,00000001a4c17260
                          00000001a53cad16: af000000            mc      0,0
[  553.017439] Call Trace:
[  553.017440]  [<00000001a53cad02>] refcount_warn_saturate+0x112/0x130
[  553.017443] ([<00000001a53cacfe>] refcount_warn_saturate+0x10e/0x130)
[  553.017445]  [<00000001a563db88>] __qeth_issue_next_read+0x270/0x288
[  553.017448]  [<00000001a563dc38>] qeth_mpc_initialize.constprop.0+0x98/0x738
[  553.017449]  [<00000001a564007e>] qeth_hardsetup_card+0x36e/0xb38
[  553.017451]  [<00000001a56408d8>] qeth_set_online+0x90/0x3a0
[  553.017453]  [<00000001a5640d04>] qeth_do_reset+0x11c/0x1f8
[  553.017455]  [<00000001a4c47f30>] kthread+0x120/0x128
[  553.017458]  [<00000001a4bc3014>] __ret_from_fork+0x3c/0x58
[  553.017460]  [<00000001a58d717a>] ret_from_fork+0xa/0x30
[  553.017463] Last Breaking-Event-Address:
[  553.017464]  [<00000001a4c172de>] __warn_printk+0x7e/0xf0
[  553.017467] ---[ end trace 0000000000000000 ]---
[  553.017474] qeth 0.0.bd00: The qeth device driver failed to recover an error on the device
[  553.018615] qeth 0.0.bd00: The qeth device driver failed to recover an error on the device
diff mbox series

Patch

diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index cf8506d0f185..bc6f3f1f25ba 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -1179,6 +1179,19 @@  static int qeth_check_irb_error(struct qeth_card *card, struct ccw_device *cdev,
 	}
 }
 
+/**
+ *	qeth_irq() - qeth interrupt handler
+ *	@cdev: ccw device
+ *	@intparm: expect pointer to iob
+ *	@irb: Interruption Response Block
+ *
+ *	In the good path:
+ *	corresponding qeth channel is locked with last used iob as active_cmd.
+ *	But this function is also called for error interrupts.
+ *
+ *	Caller ensures that:
+ *	Interrupts are disabled; ccw device lock is held;
+ */
 static void qeth_irq(struct ccw_device *cdev, unsigned long intparm,
 		struct irb *irb)
 {
@@ -1220,11 +1233,10 @@  static void qeth_irq(struct ccw_device *cdev, unsigned long intparm,
 		iob = (struct qeth_cmd_buffer *) (addr_t)intparm;
 	}
 
-	qeth_unlock_channel(card, channel);
-
 	rc = qeth_check_irb_error(card, cdev, irb);
 	if (rc) {
 		/* IO was terminated, free its resources. */
+		qeth_unlock_channel(card, channel);
 		if (iob)
 			qeth_cancel_cmd(iob, rc);
 		return;
@@ -1276,6 +1288,26 @@  static void qeth_irq(struct ccw_device *cdev, unsigned long intparm,
 		}
 	}
 
+	if (scsw_cmd_is_valid_cc(&irb->scsw) && irb->scsw.cmd.cc == 1 && iob) {
+		/* channel command hasn't started: retry.
+		 * active_cmd is still set to last iob
+		 */
+		QETH_CARD_TEXT(card, 2, "irqcc1");
+		rc = ccw_device_start_timeout(cdev, __ccw_from_cmd(iob),
+					      (addr_t)iob, 0, 0, iob->timeout);
+		if (rc) {
+			QETH_DBF_MESSAGE(2,
+					 "ccw retry on %x failed, rc = %i\n",
+					 CARD_DEVID(card), rc);
+			QETH_CARD_TEXT_(card, 2, " err%d", rc);
+			qeth_unlock_channel(card, channel);
+			qeth_cancel_cmd(iob, rc);
+		}
+		return;
+	}
+
+	qeth_unlock_channel(card, channel);
+
 	if (iob) {
 		/* sanity check: */
 		if (irb->scsw.cmd.count > iob->length) {