[01/33] TCMU PR: first commit to implement TCMU PR

Message ID	20180615182342.6239-1-lszhu@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <target-devel-owner@kernel.org> From: Zhu Lingshan <lszhu@suse.com> To: target-devel@vger.kernel.org, linux-scsi@vger.kernel.org, nab@linux-iscsi.org, mchristi@redhat.com, martin.petersen@oracle.com Cc: hare@suse.de, lduncan@suse.com, ddiss@suse.de, lszhu@suse.de, Zhu Lingshan <lszhu@suse.com> Subject: [PATCH 01/33] TCMU PR: first commit to implement TCMU PR Date: Sat, 16 Jun 2018 02:23:10 +0800 Message-Id: <20180615182342.6239-1-lszhu@suse.com> Sender: target-devel-owner@vger.kernel.org Precedence: bulk

Zhu Lingshan June 15, 2018, 6:23 p.m. UTC

These commits and the following intend to implement Persistent
Reservation operations for TCMU devices.

This series of commits would implement such PR operations:
PR_Out_Register, PR_Out_Reserve, PR_Out_Clear, PR_Out_Preempt,
PR_Out_Release and PR_In_ReadKeys.

Next wave of patches will contain the other PR operations.

This patch added a struct tcmu_pr_info to store PR information
for the handling functions, added command codes and attrs for
netlink interfaces.

Design note:
In order to get consistent Persistent Reservation results from
multiple targets hosting the same TCMU device(like Ceph RBD),
this solution stores a string on the device itself(like RBD metadata).

Everytime when kernel receive a PR request against a TCMU device,
it will query this string(a netlink attr carried by a netlink cmd).
Then decide whether the PR request should be performed, after
processing, it will update this string.

For example:
When receive a PR Reserve request, kernel will send a netlink
message to tcmu-runner, try to get the string, tcmu-runner will
response, send the PR info string to kernel. Then kernel will
decode the string, find information like key, reservation holder,
then process this request. After processing, it will update the
string, send the updated string to tcmu-runner, so that tcmu-runner
will write it back to the device(like RBD metadata).

So we make the device itself as a "single" response point, (with
locks protection) we will get a consistent result even more than one
initiators sending multiple PR requests via multiple targets.

Signed-off-by: Zhu Lingshan <lszhu@suse.com>
---
 include/uapi/linux/target_core_user.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Christoph Hellwig June 16, 2018, 5:22 a.m. UTC | #1

On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote:
> These commits and the following intend to implement Persistent
> Reservation operations for TCMU devices.

Err, hell no.

If you are that tightly integrated with the target code that you can
implement persistent reservation you need to use kernel code.
Everything else just creates a way too complex interface.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhu Lingshan June 16, 2018, 7:08 a.m. UTC | #2

Hello Christoph,

Thanks for your comment, actually we already have pure kernel code that 
can handle PRG for a single target hosting a TCMU device. It is commit 
4ec5bf0ea83930b96addf6b78225bf0355459d7f. But in it's commit message, it 
mentioned that it does not handle multiple targets  use cases.

IMHO, users may setup multiple target servers hosting the same TCMU 
devices to avoid performance single point bottleneck, For example:
If they have two target servers(let's call them target A and target B) 
hosting the same Ceph RBD device, all PR requests against this RBD 
device must have consistent response. Like if Initiator A registered a 
key via Target A, another Initiator B must can see it via Target B. If 
Initiator A reserved the device via Target A, when Initiator B try to 
reserve the same RBD device, it must get a RESERVATION_CONFLICT.

User A                             User B
     \                                         /
      \                                       /
   Initiator A                Initiator B
        \                                 /
         \                               /
     Target A                 Target B
           \                           /
            \                         /
             \                       /
           The same TCMU device
           As a LUN

I have tried pure kernel code before, this requires a communication 
mechanism between target server kernels, only can send message is not 
enough, they must can automatic synchronize information, because when a 
PR request coming in, we can not query every target server, then judge 
whose PR information is newer, there are more problem like network 
delay, more puzzled. Then a DLM solution come to my mind, Bart also 
kindly offered his SCST solution(Thanks for Bart!). The reason why I did 
not use DLM is: (1)if we use DLM, we need corosync and pacemaker, a 
whole HA stack, it's a little overkill, users may setup multiple targets 
just for avoiding single point performance bottleneck. (2) Users may 
setup target server on a OSD server, if we use DLM, this means two 
clusters controlling the same nodes(Ceph itself is a cluster). This may 
lead conflicts, like if our HA cluster want to fence a node, but 
actually it's working well for Ceph.

So this solution come to my mind, we use the TCMU device(like RBD) 
itself as a mutual and single point that can help response to PR 
requests. Yes, the code is a bit complex, but the logic is easy, just 
exchange information with tcmu-runner via netlink, then tcmu-runner 
handles read / write the metadata.

Thanks a lot for your help!

Thanks,
BR
Zhu Lingshan

On 2018/6/16 13:22, Christoph Hellwig wrote:
> On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote:
>> These commits and the following intend to implement Persistent
>> Reservation operations for TCMU devices.
> Err, hell no.
>
> If you are that tightly integrated with the target code that you can
> implement persistent reservation you need to use kernel code.
> Everything else just creates a way too complex interface.
>

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Maged Mokhtar June 16, 2018, 11:04 a.m. UTC | #3

Hi Zhu,

These are my comments for the patch set:

1) patches 18-24, 28, 30,32 (+ the yet to be implemented pr functions) :
instead of bypassing target_core_pr.c and re-writing all the existing pr
logic again, can we use existing code but just have get/set functions to
read/write the pr data from/to the backend. Example of
target_scsi3_emulate_pr_out:

sense_reason_t target_scsi3_emulate_pr_out(struct se_cmd *cmd) {
	get pr data callback  /* new code */
	switch (sa) {
	case PRO_REGISTER:
		ret = core_scsi3_emulate_pro_register(..)
		break;
	case PRO_RESERVE:
		ret = core_scsi3_emulate_pro_reserve(..);
	..........
	}
	set pr data callback  /* new code */
}

This is over-simplified as there could be extra locking but should be
relatively minor. target_core_pr.c can have default implementation to
get/set from static variable. Backends can over-ride this to support
clustered distribution but they would not need to re-implement pr logic.

2) patch 4: existing set pr function
static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, void *val)
better use char* than void* since it is a null terminated string rather
than a buffer with length

3) patch 4: existing set pr function
static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, void *val)
better is
static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, char *val_old, char
*val_new)
so in addition to the new pr we also include the old pr, this will allow
backends that support atomic operations like ceph to implement this via an
atomic compare and set. This is the method used in existing SUSE
target_core_rbd and is more robust in a clustered environment.

4) patch 20 tcmu_execute_pr_register_existing: the update key should also
update any reservation by the old key (if any) in addition to the
registration. Point 1) would avoid this.

5) patch 24 tcmu_execute_pr_read_keys: the add_len should reflect the
total needed length rather than the truncated length. This was fixed
recently in target_core_pr.c by David. Again Point 1) would avoid this.

6) patch 15:
#define TCMU_PR_INFO_XATTR_MAX_SIZE 8192
	buf_remain = TCMU_PR_INFO_XATTR_ENCODED_MAXLEN(pr_info->num_regs);
	if (buf_remain > TCMU_PR_INFO_XATTR_MAX_SIZE) {
		pr_err("PR info too large for encoding: %zd\n", buf_remain);
		return -EINVAL;
	}
should probably increase TCMU_PR_INFO_XATTR_MAX_SIZE, 8k limits
registrations to 16 nexus


7) Minor observation, some patches could be combined, for example patches
5-17 deal with string decode/encode could be made in fewer patches.

Maged



--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhu Lingshan June 16, 2018, 11:08 a.m. UTC | #4

Hello Maged,

Thanks for your comment and help, please give me some time to process this.

Thanks,
BR
Zhu Lingshan

On 2018/6/16 19:04, Maged Mokhtar wrote:
> Hi Zhu,
>
> These are my comments for the patch set:
>
> 1) patches 18-24, 28, 30,32 (+ the yet to be implemented pr functions) :
> instead of bypassing target_core_pr.c and re-writing all the existing pr
> logic again, can we use existing code but just have get/set functions to
> read/write the pr data from/to the backend. Example of
> target_scsi3_emulate_pr_out:
>
> sense_reason_t target_scsi3_emulate_pr_out(struct se_cmd *cmd) {
> 	get pr data callback  /* new code */
> 	switch (sa) {
> 	case PRO_REGISTER:
> 		ret = core_scsi3_emulate_pro_register(..)
> 		break;
> 	case PRO_RESERVE:
> 		ret = core_scsi3_emulate_pro_reserve(..);
> 	..........
> 	}
> 	set pr data callback  /* new code */
> }
>
> This is over-simplified as there could be extra locking but should be
> relatively minor. target_core_pr.c can have default implementation to
> get/set from static variable. Backends can over-ride this to support
> clustered distribution but they would not need to re-implement pr logic.
>
> 2) patch 4: existing set pr function
> static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, void *val)
> better use char* than void* since it is a null terminated string rather
> than a buffer with length
>
> 3) patch 4: existing set pr function
> static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, void *val)
> better is
> static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, char *val_old, char
> *val_new)
> so in addition to the new pr we also include the old pr, this will allow
> backends that support atomic operations like ceph to implement this via an
> atomic compare and set. This is the method used in existing SUSE
> target_core_rbd and is more robust in a clustered environment.
>
> 4) patch 20 tcmu_execute_pr_register_existing: the update key should also
> update any reservation by the old key (if any) in addition to the
> registration. Point 1) would avoid this.
>
> 5) patch 24 tcmu_execute_pr_read_keys: the add_len should reflect the
> total needed length rather than the truncated length. This was fixed
> recently in target_core_pr.c by David. Again Point 1) would avoid this.
>
> 6) patch 15:
> #define TCMU_PR_INFO_XATTR_MAX_SIZE 8192
> 	buf_remain = TCMU_PR_INFO_XATTR_ENCODED_MAXLEN(pr_info->num_regs);
> 	if (buf_remain > TCMU_PR_INFO_XATTR_MAX_SIZE) {
> 		pr_err("PR info too large for encoding: %zd\n", buf_remain);
> 		return -EINVAL;
> 	}
> should probably increase TCMU_PR_INFO_XATTR_MAX_SIZE, 8k limits
> registrations to 16 nexus
>
>
> 7) Minor observation, some patches could be combined, for example patches
> 5-17 deal with string decode/encode could be made in fewer patches.
>
> Maged
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Christie June 16, 2018, 7:20 p.m. UTC | #5

Adding Bodo who is working on a alternative approach.

On 06/16/2018 12:22 AM, Christoph Hellwig wrote:
> On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote:
>> These commits and the following intend to implement Persistent
>> Reservation operations for TCMU devices.
> 
> Err, hell no.
> 
> If you are that tightly integrated with the target code that you can
> implement persistent reservation you need to use kernel code.
> Everything else just creates a way too complex interface.

Hey Christoph,

Just wanted to make sure I understood this comment. In Lingshan's
patches I think he was going to end up calling out to
userspace/tcmu-runner and there he was going to make ceph calls which
basically translate PGR operations to ceph requests. Are you saying we
should just add some kernel module that makes the ceph calls? This would
then avoid the mess of having the split PGR processing design in this
patchset?

Also just to make it a little more fun :) There is another person
working on a completely different design.

Bodo's design is for tcmu only and allows userspace to just handle
everything. PGR commands and related checks for conflicts are all
handled in the userspace daemon that is receiving commands from the
target_core_user kernel module.

The one hiccup is this design requires a change where the I_T Nexus info
is passed to userspace so it can handle certain PGR registration
commands correction and also for each command check if a nexus has been
registered and what reservation there is for it. We could:

	1. Just pass that info to userspace in each tcmu request.

	2. Do something like the old kernel tgt code did where when a I_T nexus
was established in the kernel it sent an event to userspace. Each SCSI
command passed to userspace had some tag that allowed userspace to match
the tag with the I_T Nexus.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Christie June 16, 2018, 7:25 p.m. UTC | #6

On 06/16/2018 02:20 PM, Mike Christie wrote:
> Adding Bodo who is working on a alternative approach.
> 
> On 06/16/2018 12:22 AM, Christoph Hellwig wrote:
>> > On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote:
>>> >> These commits and the following intend to implement Persistent
>>> >> Reservation operations for TCMU devices.
>> > 
>> > Err, hell no.
>> > 
>> > If you are that tightly integrated with the target code that you can
>> > implement persistent reservation you need to use kernel code.
>> > Everything else just creates a way too complex interface.
> Hey Christoph,
> 
> Just wanted to make sure I understood this comment. In Lingshan's
> patches I think he was going to end up calling out to
> userspace/tcmu-runner and there he was going to make ceph calls which
> basically translate PGR operations to ceph requests. Are you saying we
> should just add some kernel module that makes the ceph calls? This would
> then avoid the mess of having the split PGR processing design in this
> patchset?

Oh yeah, I meant to also ask if I understood you correctly above, then
did you also just want us to add the target_core_rbd module that did
READ/WRITE/VAAI commands too?
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhu Lingshan June 17, 2018, 4:40 a.m. UTC | #7

On 2018/6/17 3:20, Mike Christie wrote:
> Adding Bodo who is working on a alternative approach.
>
> On 06/16/2018 12:22 AM, Christoph Hellwig wrote:
>> On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote:
>>> These commits and the following intend to implement Persistent
>>> Reservation operations for TCMU devices.
>> Err, hell no.
>>
>> If you are that tightly integrated with the target code that you can
>> implement persistent reservation you need to use kernel code.
>> Everything else just creates a way too complex interface.
> Hey Christoph,
>
> Just wanted to make sure I understood this comment. In Lingshan's
> patches I think he was going to end up calling out to
> userspace/tcmu-runner and there he was going to make ceph calls which
> basically translate PGR operations to ceph requests. Are you saying we
> should just add some kernel module that makes the ceph calls? This would
> then avoid the mess of having the split PGR processing design in this
> patchset?
Hello Mike and Christoph，
Thanks Mike's comment inspired me, if I understand this correctly, it is 
suggested to implement this whole solution in kernel, avoid
  splitting PRG handling in both kernel and userspace, make it not that 
complex. I can try to implement this by sending OSD requests
from kernel side, then we don't need tcmu-runner supporting is, no need 
to use netlink to exchange information with userspace, pure kernel code, 
seems
much more simple. Do you think this may be better?

Thanks,
BR
Zhu Lingshan
>
>
> Also just to make it a little more fun :) There is another person
> working on a completely different design.
>
> Bodo's design is for tcmu only and allows userspace to just handle
> everything. PGR commands and related checks for conflicts are all
> handled in the userspace daemon that is receiving commands from the
> target_core_user kernel module.
>
> The one hiccup is this design requires a change where the I_T Nexus info
> is passed to userspace so it can handle certain PGR registration
> commands correction and also for each command check if a nexus has been
> registered and what reservation there is for it. We could:
>
> 	1. Just pass that info to userspace in each tcmu request.
>
> 	2. Do something like the old kernel tgt code did where when a I_T nexus
> was established in the kernel it sent an event to userspace. Each SCSI
> command passed to userspace had some tag that allowed userspace to match
> the tag with the I_T Nexus.
>

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig June 18, 2018, 11:09 a.m. UTC | #8

On Sat, Jun 16, 2018 at 02:20:12PM -0500, Mike Christie wrote:
> Just wanted to make sure I understood this comment. In Lingshan's
> patches I think he was going to end up calling out to
> userspace/tcmu-runner and there he was going to make ceph calls which
> basically translate PGR operations to ceph requests. Are you saying we
> should just add some kernel module that makes the ceph calls? This would
> then avoid the mess of having the split PGR processing design in this
> patchset?

Yes.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig June 18, 2018, 11:10 a.m. UTC | #9

On Sat, Jun 16, 2018 at 02:25:47PM -0500, Mike Christie wrote:
> > Just wanted to make sure I understood this comment. In Lingshan's
> > patches I think he was going to end up calling out to
> > userspace/tcmu-runner and there he was going to make ceph calls which
> > basically translate PGR operations to ceph requests. Are you saying we
> > should just add some kernel module that makes the ceph calls? This would
> > then avoid the mess of having the split PGR processing design in this
> > patchset?
> 
> Oh yeah, I meant to also ask if I understood you correctly above, then
> did you also just want us to add the target_core_rbd module that did
> READ/WRITE/VAAI commands too?

In general I'd prefer to have generic APIs for it so that we could
also front an NVMe target with the same ceph backend.  But we can
at least get that work started this way, and I'll look into helping
with a proper API.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig June 18, 2018, 11:12 a.m. UTC | #10

On Sun, Jun 17, 2018 at 12:40:56PM +0800, Zhu Lingshan wrote:
> Hello Mike and Christoph，
> Thanks Mike's comment inspired me, if I understand this correctly, it is
> suggested to implement this whole solution in kernel, avoid
>  splitting PRG handling in both kernel and userspace, make it not that
> complex. I can try to implement this by sending OSD requests
> from kernel side, then we don't need tcmu-runner supporting is, no need to
> use netlink to exchange information with userspace, pure kernel code, seems
> much more simple. Do you think this may be better?

Yes.  And SuSE actually ships such a backend (originally writen by
Mike), although it would still need some work to get into shape.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhu Lingshan June 18, 2018, 11:31 a.m. UTC | #11

On 2018/6/18 19:12, Christoph Hellwig wrote:
> On Sun, Jun 17, 2018 at 12:40:56PM +0800, Zhu Lingshan wrote:
>> Hello Mike and Christoph，
>> Thanks Mike's comment inspired me, if I understand this correctly, it is
>> suggested to implement this whole solution in kernel, avoid
>>   splitting PRG handling in both kernel and userspace, make it not that
>> complex. I can try to implement this by sending OSD requests
>> from kernel side, then we don't need tcmu-runner supporting is, no need to
>> use netlink to exchange information with userspace, pure kernel code, seems
>> much more simple. Do you think this may be better?
> Yes.  And SuSE actually ships such a backend (originally writen by
> Mike), although it would still need some work to get into shape.
>
Thanks, I will work on that.

Thanks,
BR
Zhu Lingshan
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Disseldorp June 18, 2018, 11:41 a.m. UTC | #12

On Mon, 18 Jun 2018 04:12:17 -0700, Christoph Hellwig wrote:

> On Sun, Jun 17, 2018 at 12:40:56PM +0800, Zhu Lingshan wrote:
> > Hello Mike and Christoph，
> > Thanks Mike's comment inspired me, if I understand this correctly, it is
> > suggested to implement this whole solution in kernel, avoid
> >  splitting PRG handling in both kernel and userspace, make it not that
> > complex. I can try to implement this by sending OSD requests
> > from kernel side, then we don't need tcmu-runner supporting is, no need to
> > use netlink to exchange information with userspace, pure kernel code, seems
> > much more simple. Do you think this may be better?  
> 
> Yes.  And SuSE actually ships such a backend (originally writen by
> Mike), although it would still need some work to get into shape.

FWIW, the kernel backend is also shipped by petasan. I'd be happy to
restart efforts with Mike to get this in shape for mainline. The current
target_core_rbd PR implementation is suboptimal in that it duplicates
quite a bit of protocol logic from target_core_pr.

Cheers, David
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig June 19, 2018, 2:29 p.m. UTC | #13

On Mon, Jun 18, 2018 at 01:41:48PM +0200, David Disseldorp wrote:
> On Mon, 18 Jun 2018 04:12:17 -0700, Christoph Hellwig wrote:
> 
> > On Sun, Jun 17, 2018 at 12:40:56PM +0800, Zhu Lingshan wrote:
> > > Hello Mike and Christoph，
> > > Thanks Mike's comment inspired me, if I understand this correctly, it is
> > > suggested to implement this whole solution in kernel, avoid
> > >  splitting PRG handling in both kernel and userspace, make it not that
> > > complex. I can try to implement this by sending OSD requests
> > > from kernel side, then we don't need tcmu-runner supporting is, no need to
> > > use netlink to exchange information with userspace, pure kernel code, seems
> > > much more simple. Do you think this may be better?  
> > 
> > Yes.  And SuSE actually ships such a backend (originally writen by
> > Mike), although it would still need some work to get into shape.
> 
> FWIW, the kernel backend is also shipped by petasan. I'd be happy to
> restart efforts with Mike to get this in shape for mainline.

That would be great!
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[01/33] TCMU PR: first commit to implement TCMU PR

Commit Message

Comments

Patch