mbox series

[v1,0/2] Fix failover to non integrity NVMe path

Message ID 20230423141330.40437-1-mgurtovoy@nvidia.com (mailing list archive)
Headers show
Series Fix failover to non integrity NVMe path | expand

Message

Max Gurtovoy April 23, 2023, 2:13 p.m. UTC
Hi Christoph/Sagi/Martin,

We're encountered a crash while testing failover between NVMeF/RDMA
paths to a target that expose a namespace with metadata. The scenario is
the following:
Configure one initiator/host path on PI offload capable port (e.g ConnectX-5
device) and configure second initiator/host path on non PI offload capable
port (e.g ConnectX-3).

In case of failover, the original rq/bio is protected with integrity
context but the failover port is not integrity capable and it didn't allocate
the metadata resources for request. Thus we get NULL deref in case
blk_integrity_rq(rq) return true while req->metadata_sgl is NULL.

Bellow snip on the trace:

[Tue Feb 14 18:48:25 2023] mlx5_core 0000:02:00.0 ens2f0np0: Link down
[Tue Feb 14 18:48:32 2023] nvme nvme0: starting error recovery
[Tue Feb 14 18:48:32 2023] BUG: kernel NULL pointer dereference, address: 0000000000000010
[Tue Feb 14 18:48:32 2023] #PF: supervisor read access in kernel mode
[Tue Feb 14 18:48:32 2023] #PF: error_code(0x0000) - not-present page
[Tue Feb 14 18:48:32 2023] PGD 0 P4D 0 
[Tue Feb 14 18:48:32 2023] Oops: 0000 [#1] PREEMPT SMP PTI
[Tue Feb 14 18:48:32 2023] CPU: 17 PID: 518 Comm: kworker/17:1H Tainted: G S          E      6.2.0-rc4+ #224
[Tue Feb 14 18:48:32 2023] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 2.0 12/17/2015
[Tue Feb 14 18:48:32 2023] Workqueue: kblockd nvme_requeue_work [nvme_core]
...
...

[Tue Feb 14 18:48:32 2023] Call Trace:
[Tue Feb 14 18:48:32 2023]  <TASK>
[Tue Feb 14 18:48:32 2023]  nvme_rdma_queue_rq+0x194/0xa20 [nvme_rdma]
[Tue Feb 14 18:48:32 2023]  ? __blk_mq_try_issue_directly+0x13f/0x1a0
[Tue Feb 14 18:48:32 2023]  __blk_mq_try_issue_directly+0x13f/0x1a0
[Tue Feb 14 18:48:32 2023]  blk_mq_try_issue_directly+0x15/0x50
[Tue Feb 14 18:48:32 2023]  blk_mq_submit_bio+0x539/0x580
[Tue Feb 14 18:48:32 2023]  __submit_bio+0xfa/0x170
[Tue Feb 14 18:48:32 2023]  submit_bio_noacct_nocheck+0xe1/0x2a0
[Tue Feb 14 18:48:32 2023]  nvme_requeue_work+0x4e/0x60 [nvme_core]

To solve that we'll expose API to release the integrity context from a
bio and call it in case of failover for each bio.
Another way to solve this is to free the context during
bio_integrity_prep.

I choose the first option because I thought it is better to avoid this
check in the fast path but the price is exporting new API from the block
layer.

Max Gurtovoy (2):
  block: bio-integrity: export bio_integrity_free func
  nvme-multipath: fix path failover for integrity ns

 block/bio-integrity.c         | 1 +
 drivers/nvme/host/multipath.c | 9 +++++++++
 include/linux/bio.h           | 5 +++++
 3 files changed, 15 insertions(+)

Comments

Christoph Hellwig April 24, 2023, 5:11 a.m. UTC | #1
On Sun, Apr 23, 2023 at 05:13:28PM +0300, Max Gurtovoy wrote:
> Hi Christoph/Sagi/Martin,
> 
> We're encountered a crash while testing failover between NVMeF/RDMA
> paths to a target that expose a namespace with metadata. The scenario is
> the following:
> Configure one initiator/host path on PI offload capable port (e.g ConnectX-5
> device) and configure second initiator/host path on non PI offload capable
> port (e.g ConnectX-3).

Hmm.  I suspect the right thing to do here is to just fail the second
connect.
Hannes Reinecke April 24, 2023, 6:10 a.m. UTC | #2
On 4/24/23 07:11, Christoph Hellwig wrote:
> On Sun, Apr 23, 2023 at 05:13:28PM +0300, Max Gurtovoy wrote:
>> Hi Christoph/Sagi/Martin,
>>
>> We're encountered a crash while testing failover between NVMeF/RDMA
>> paths to a target that expose a namespace with metadata. The scenario is
>> the following:
>> Configure one initiator/host path on PI offload capable port (e.g ConnectX-5
>> device) and configure second initiator/host path on non PI offload capable
>> port (e.g ConnectX-3).
> 
> Hmm.  I suspect the right thing to do here is to just fail the second
> connect.

Yeah, I'm slightly unhappy with this whole setup.
If we were just doing DIF I guess the setup could work, but then we have 
to disable DIX (as we cannot support integrity data on the non-PI path).
But we would need an additional patch to disable DIX functionality in 
those cases.

Cheers,

Hannes
Christoph Hellwig April 24, 2023, 6:20 a.m. UTC | #3
On Mon, Apr 24, 2023 at 08:10:59AM +0200, Hannes Reinecke wrote:
> Yeah, I'm slightly unhappy with this whole setup.
> If we were just doing DIF I guess the setup could work, but then we have to 
> disable DIX (as we cannot support integrity data on the non-PI path).
> But we would need an additional patch to disable DIX functionality in those 
> cases.

NVMeoF only supports inline integrity data, the remapping from out of
line integrity data is always done by the HCA for NVMe over RDMA,
and integrity data is not supported without that.

Because of that I can't see how we could sensibly support one path with
integrity offload and one without.  And yes, it might make sense to
offer a way to explicitly disable integrity support to allow forming such
a multipath setup.
Sagi Grimberg April 24, 2023, 8:53 a.m. UTC | #4
>> Yeah, I'm slightly unhappy with this whole setup.
>> If we were just doing DIF I guess the setup could work, but then we have to
>> disable DIX (as we cannot support integrity data on the non-PI path).
>> But we would need an additional patch to disable DIX functionality in those
>> cases.
> 
> NVMeoF only supports inline integrity data, the remapping from out of
> line integrity data is always done by the HCA for NVMe over RDMA,
> and integrity data is not supported without that.
> 
> Because of that I can't see how we could sensibly support one path with
> integrity offload and one without.  And yes, it might make sense to
> offer a way to explicitly disable integrity support to allow forming such
> a multipath setup.

I agree. I didn't read through the change log well enough, I thought
that one path is DIF and the other is DIX.

I agree that we should not permit such a configuration.
Max Gurtovoy April 24, 2023, 9:17 a.m. UTC | #5
On 24/04/2023 11:53, Sagi Grimberg wrote:
> 
>>> Yeah, I'm slightly unhappy with this whole setup.
>>> If we were just doing DIF I guess the setup could work, but then we 
>>> have to
>>> disable DIX (as we cannot support integrity data on the non-PI path).
>>> But we would need an additional patch to disable DIX functionality in 
>>> those
>>> cases.
>>
>> NVMeoF only supports inline integrity data, the remapping from out of
>> line integrity data is always done by the HCA for NVMe over RDMA,
>> and integrity data is not supported without that.
>>
>> Because of that I can't see how we could sensibly support one path with
>> integrity offload and one without.  And yes, it might make sense to
>> offer a way to explicitly disable integrity support to allow forming such
>> a multipath setup.
> 
> I agree. I didn't read through the change log well enough, I thought
> that one path is DIF and the other is DIX.
> 
> I agree that we should not permit such a configuration.

I'm not yet convinced why not to permit it.
The spec allows this to happen and I can think about scenarios that 
users will want this kind of configuration.

We also support it today (with the exception on this bug).
There is a special PRACT bit in the spec that asks the controller to 
take action upon each R/W IO request.

The Multipath is not related to md IMO. One path can generate/verify md 
and other can raise PRACT bit.

You can also create 2 paths from different hosts to same target and one 
will have ConnectX-5 and other ConnectX-3. The fact that these path are 
from the same host is not so important IMO.