Message ID | 20180615182342.6239-1-lszhu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote: > These commits and the following intend to implement Persistent > Reservation operations for TCMU devices. Err, hell no. If you are that tightly integrated with the target code that you can implement persistent reservation you need to use kernel code. Everything else just creates a way too complex interface. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Christoph, Thanks for your comment, actually we already have pure kernel code that can handle PRG for a single target hosting a TCMU device. It is commit 4ec5bf0ea83930b96addf6b78225bf0355459d7f. But in it's commit message, it mentioned that it does not handle multiple targets use cases. IMHO, users may setup multiple target servers hosting the same TCMU devices to avoid performance single point bottleneck, For example: If they have two target servers(let's call them target A and target B) hosting the same Ceph RBD device, all PR requests against this RBD device must have consistent response. Like if Initiator A registered a key via Target A, another Initiator B must can see it via Target B. If Initiator A reserved the device via Target A, when Initiator B try to reserve the same RBD device, it must get a RESERVATION_CONFLICT. User A User B \ / \ / Initiator A Initiator B \ / \ / Target A Target B \ / \ / \ / The same TCMU device As a LUN I have tried pure kernel code before, this requires a communication mechanism between target server kernels, only can send message is not enough, they must can automatic synchronize information, because when a PR request coming in, we can not query every target server, then judge whose PR information is newer, there are more problem like network delay, more puzzled. Then a DLM solution come to my mind, Bart also kindly offered his SCST solution(Thanks for Bart!). The reason why I did not use DLM is: (1)if we use DLM, we need corosync and pacemaker, a whole HA stack, it's a little overkill, users may setup multiple targets just for avoiding single point performance bottleneck. (2) Users may setup target server on a OSD server, if we use DLM, this means two clusters controlling the same nodes(Ceph itself is a cluster). This may lead conflicts, like if our HA cluster want to fence a node, but actually it's working well for Ceph. So this solution come to my mind, we use the TCMU device(like RBD) itself as a mutual and single point that can help response to PR requests. Yes, the code is a bit complex, but the logic is easy, just exchange information with tcmu-runner via netlink, then tcmu-runner handles read / write the metadata. Thanks a lot for your help! Thanks, BR Zhu Lingshan On 2018/6/16 13:22, Christoph Hellwig wrote: > On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote: >> These commits and the following intend to implement Persistent >> Reservation operations for TCMU devices. > Err, hell no. > > If you are that tightly integrated with the target code that you can > implement persistent reservation you need to use kernel code. > Everything else just creates a way too complex interface. > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Zhu, These are my comments for the patch set: 1) patches 18-24, 28, 30,32 (+ the yet to be implemented pr functions) : instead of bypassing target_core_pr.c and re-writing all the existing pr logic again, can we use existing code but just have get/set functions to read/write the pr data from/to the backend. Example of target_scsi3_emulate_pr_out: sense_reason_t target_scsi3_emulate_pr_out(struct se_cmd *cmd) { get pr data callback /* new code */ switch (sa) { case PRO_REGISTER: ret = core_scsi3_emulate_pro_register(..) break; case PRO_RESERVE: ret = core_scsi3_emulate_pro_reserve(..); .......... } set pr data callback /* new code */ } This is over-simplified as there could be extra locking but should be relatively minor. target_core_pr.c can have default implementation to get/set from static variable. Backends can over-ride this to support clustered distribution but they would not need to re-implement pr logic. 2) patch 4: existing set pr function static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, void *val) better use char* than void* since it is a null terminated string rather than a buffer with length 3) patch 4: existing set pr function static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, void *val) better is static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, char *val_old, char *val_new) so in addition to the new pr we also include the old pr, this will allow backends that support atomic operations like ceph to implement this via an atomic compare and set. This is the method used in existing SUSE target_core_rbd and is more robust in a clustered environment. 4) patch 20 tcmu_execute_pr_register_existing: the update key should also update any reservation by the old key (if any) in addition to the registration. Point 1) would avoid this. 5) patch 24 tcmu_execute_pr_read_keys: the add_len should reflect the total needed length rather than the truncated length. This was fixed recently in target_core_pr.c by David. Again Point 1) would avoid this. 6) patch 15: #define TCMU_PR_INFO_XATTR_MAX_SIZE 8192 buf_remain = TCMU_PR_INFO_XATTR_ENCODED_MAXLEN(pr_info->num_regs); if (buf_remain > TCMU_PR_INFO_XATTR_MAX_SIZE) { pr_err("PR info too large for encoding: %zd\n", buf_remain); return -EINVAL; } should probably increase TCMU_PR_INFO_XATTR_MAX_SIZE, 8k limits registrations to 16 nexus 7) Minor observation, some patches could be combined, for example patches 5-17 deal with string decode/encode could be made in fewer patches. Maged -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Maged, Thanks for your comment and help, please give me some time to process this. Thanks, BR Zhu Lingshan On 2018/6/16 19:04, Maged Mokhtar wrote: > Hi Zhu, > > These are my comments for the patch set: > > 1) patches 18-24, 28, 30,32 (+ the yet to be implemented pr functions) : > instead of bypassing target_core_pr.c and re-writing all the existing pr > logic again, can we use existing code but just have get/set functions to > read/write the pr data from/to the backend. Example of > target_scsi3_emulate_pr_out: > > sense_reason_t target_scsi3_emulate_pr_out(struct se_cmd *cmd) { > get pr data callback /* new code */ > switch (sa) { > case PRO_REGISTER: > ret = core_scsi3_emulate_pro_register(..) > break; > case PRO_RESERVE: > ret = core_scsi3_emulate_pro_reserve(..); > .......... > } > set pr data callback /* new code */ > } > > This is over-simplified as there could be extra locking but should be > relatively minor. target_core_pr.c can have default implementation to > get/set from static variable. Backends can over-ride this to support > clustered distribution but they would not need to re-implement pr logic. > > 2) patch 4: existing set pr function > static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, void *val) > better use char* than void* since it is a null terminated string rather > than a buffer with length > > 3) patch 4: existing set pr function > static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, void *val) > better is > static int tcmu_set_dev_pr_info(struct tcmu_dev *udev, char *val_old, char > *val_new) > so in addition to the new pr we also include the old pr, this will allow > backends that support atomic operations like ceph to implement this via an > atomic compare and set. This is the method used in existing SUSE > target_core_rbd and is more robust in a clustered environment. > > 4) patch 20 tcmu_execute_pr_register_existing: the update key should also > update any reservation by the old key (if any) in addition to the > registration. Point 1) would avoid this. > > 5) patch 24 tcmu_execute_pr_read_keys: the add_len should reflect the > total needed length rather than the truncated length. This was fixed > recently in target_core_pr.c by David. Again Point 1) would avoid this. > > 6) patch 15: > #define TCMU_PR_INFO_XATTR_MAX_SIZE 8192 > buf_remain = TCMU_PR_INFO_XATTR_ENCODED_MAXLEN(pr_info->num_regs); > if (buf_remain > TCMU_PR_INFO_XATTR_MAX_SIZE) { > pr_err("PR info too large for encoding: %zd\n", buf_remain); > return -EINVAL; > } > should probably increase TCMU_PR_INFO_XATTR_MAX_SIZE, 8k limits > registrations to 16 nexus > > > 7) Minor observation, some patches could be combined, for example patches > 5-17 deal with string decode/encode could be made in fewer patches. > > Maged > > > > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Adding Bodo who is working on a alternative approach. On 06/16/2018 12:22 AM, Christoph Hellwig wrote: > On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote: >> These commits and the following intend to implement Persistent >> Reservation operations for TCMU devices. > > Err, hell no. > > If you are that tightly integrated with the target code that you can > implement persistent reservation you need to use kernel code. > Everything else just creates a way too complex interface. Hey Christoph, Just wanted to make sure I understood this comment. In Lingshan's patches I think he was going to end up calling out to userspace/tcmu-runner and there he was going to make ceph calls which basically translate PGR operations to ceph requests. Are you saying we should just add some kernel module that makes the ceph calls? This would then avoid the mess of having the split PGR processing design in this patchset? Also just to make it a little more fun :) There is another person working on a completely different design. Bodo's design is for tcmu only and allows userspace to just handle everything. PGR commands and related checks for conflicts are all handled in the userspace daemon that is receiving commands from the target_core_user kernel module. The one hiccup is this design requires a change where the I_T Nexus info is passed to userspace so it can handle certain PGR registration commands correction and also for each command check if a nexus has been registered and what reservation there is for it. We could: 1. Just pass that info to userspace in each tcmu request. 2. Do something like the old kernel tgt code did where when a I_T nexus was established in the kernel it sent an event to userspace. Each SCSI command passed to userspace had some tag that allowed userspace to match the tag with the I_T Nexus. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/16/2018 02:20 PM, Mike Christie wrote: > Adding Bodo who is working on a alternative approach. > > On 06/16/2018 12:22 AM, Christoph Hellwig wrote: >> > On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote: >>> >> These commits and the following intend to implement Persistent >>> >> Reservation operations for TCMU devices. >> > >> > Err, hell no. >> > >> > If you are that tightly integrated with the target code that you can >> > implement persistent reservation you need to use kernel code. >> > Everything else just creates a way too complex interface. > Hey Christoph, > > Just wanted to make sure I understood this comment. In Lingshan's > patches I think he was going to end up calling out to > userspace/tcmu-runner and there he was going to make ceph calls which > basically translate PGR operations to ceph requests. Are you saying we > should just add some kernel module that makes the ceph calls? This would > then avoid the mess of having the split PGR processing design in this > patchset? Oh yeah, I meant to also ask if I understood you correctly above, then did you also just want us to add the target_core_rbd module that did READ/WRITE/VAAI commands too? -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2018/6/17 3:20, Mike Christie wrote: > Adding Bodo who is working on a alternative approach. > > On 06/16/2018 12:22 AM, Christoph Hellwig wrote: >> On Sat, Jun 16, 2018 at 02:23:10AM +0800, Zhu Lingshan wrote: >>> These commits and the following intend to implement Persistent >>> Reservation operations for TCMU devices. >> Err, hell no. >> >> If you are that tightly integrated with the target code that you can >> implement persistent reservation you need to use kernel code. >> Everything else just creates a way too complex interface. > Hey Christoph, > > Just wanted to make sure I understood this comment. In Lingshan's > patches I think he was going to end up calling out to > userspace/tcmu-runner and there he was going to make ceph calls which > basically translate PGR operations to ceph requests. Are you saying we > should just add some kernel module that makes the ceph calls? This would > then avoid the mess of having the split PGR processing design in this > patchset? Hello Mike and Christoph, Thanks Mike's comment inspired me, if I understand this correctly, it is suggested to implement this whole solution in kernel, avoid splitting PRG handling in both kernel and userspace, make it not that complex. I can try to implement this by sending OSD requests from kernel side, then we don't need tcmu-runner supporting is, no need to use netlink to exchange information with userspace, pure kernel code, seems much more simple. Do you think this may be better? Thanks, BR Zhu Lingshan > > > Also just to make it a little more fun :) There is another person > working on a completely different design. > > Bodo's design is for tcmu only and allows userspace to just handle > everything. PGR commands and related checks for conflicts are all > handled in the userspace daemon that is receiving commands from the > target_core_user kernel module. > > The one hiccup is this design requires a change where the I_T Nexus info > is passed to userspace so it can handle certain PGR registration > commands correction and also for each command check if a nexus has been > registered and what reservation there is for it. We could: > > 1. Just pass that info to userspace in each tcmu request. > > 2. Do something like the old kernel tgt code did where when a I_T nexus > was established in the kernel it sent an event to userspace. Each SCSI > command passed to userspace had some tag that allowed userspace to match > the tag with the I_T Nexus. > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 16, 2018 at 02:20:12PM -0500, Mike Christie wrote: > Just wanted to make sure I understood this comment. In Lingshan's > patches I think he was going to end up calling out to > userspace/tcmu-runner and there he was going to make ceph calls which > basically translate PGR operations to ceph requests. Are you saying we > should just add some kernel module that makes the ceph calls? This would > then avoid the mess of having the split PGR processing design in this > patchset? Yes. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 16, 2018 at 02:25:47PM -0500, Mike Christie wrote: > > Just wanted to make sure I understood this comment. In Lingshan's > > patches I think he was going to end up calling out to > > userspace/tcmu-runner and there he was going to make ceph calls which > > basically translate PGR operations to ceph requests. Are you saying we > > should just add some kernel module that makes the ceph calls? This would > > then avoid the mess of having the split PGR processing design in this > > patchset? > > Oh yeah, I meant to also ask if I understood you correctly above, then > did you also just want us to add the target_core_rbd module that did > READ/WRITE/VAAI commands too? In general I'd prefer to have generic APIs for it so that we could also front an NVMe target with the same ceph backend. But we can at least get that work started this way, and I'll look into helping with a proper API. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jun 17, 2018 at 12:40:56PM +0800, Zhu Lingshan wrote: > Hello Mike and Christoph, > Thanks Mike's comment inspired me, if I understand this correctly, it is > suggested to implement this whole solution in kernel, avoid > splitting PRG handling in both kernel and userspace, make it not that > complex. I can try to implement this by sending OSD requests > from kernel side, then we don't need tcmu-runner supporting is, no need to > use netlink to exchange information with userspace, pure kernel code, seems > much more simple. Do you think this may be better? Yes. And SuSE actually ships such a backend (originally writen by Mike), although it would still need some work to get into shape. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2018/6/18 19:12, Christoph Hellwig wrote: > On Sun, Jun 17, 2018 at 12:40:56PM +0800, Zhu Lingshan wrote: >> Hello Mike and Christoph, >> Thanks Mike's comment inspired me, if I understand this correctly, it is >> suggested to implement this whole solution in kernel, avoid >> splitting PRG handling in both kernel and userspace, make it not that >> complex. I can try to implement this by sending OSD requests >> from kernel side, then we don't need tcmu-runner supporting is, no need to >> use netlink to exchange information with userspace, pure kernel code, seems >> much more simple. Do you think this may be better? > Yes. And SuSE actually ships such a backend (originally writen by > Mike), although it would still need some work to get into shape. > Thanks, I will work on that. Thanks, BR Zhu Lingshan -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 18 Jun 2018 04:12:17 -0700, Christoph Hellwig wrote: > On Sun, Jun 17, 2018 at 12:40:56PM +0800, Zhu Lingshan wrote: > > Hello Mike and Christoph, > > Thanks Mike's comment inspired me, if I understand this correctly, it is > > suggested to implement this whole solution in kernel, avoid > > splitting PRG handling in both kernel and userspace, make it not that > > complex. I can try to implement this by sending OSD requests > > from kernel side, then we don't need tcmu-runner supporting is, no need to > > use netlink to exchange information with userspace, pure kernel code, seems > > much more simple. Do you think this may be better? > > Yes. And SuSE actually ships such a backend (originally writen by > Mike), although it would still need some work to get into shape. FWIW, the kernel backend is also shipped by petasan. I'd be happy to restart efforts with Mike to get this in shape for mainline. The current target_core_rbd PR implementation is suboptimal in that it duplicates quite a bit of protocol logic from target_core_pr. Cheers, David -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 18, 2018 at 01:41:48PM +0200, David Disseldorp wrote: > On Mon, 18 Jun 2018 04:12:17 -0700, Christoph Hellwig wrote: > > > On Sun, Jun 17, 2018 at 12:40:56PM +0800, Zhu Lingshan wrote: > > > Hello Mike and Christoph, > > > Thanks Mike's comment inspired me, if I understand this correctly, it is > > > suggested to implement this whole solution in kernel, avoid > > > splitting PRG handling in both kernel and userspace, make it not that > > > complex. I can try to implement this by sending OSD requests > > > from kernel side, then we don't need tcmu-runner supporting is, no need to > > > use netlink to exchange information with userspace, pure kernel code, seems > > > much more simple. Do you think this may be better? > > > > Yes. And SuSE actually ships such a backend (originally writen by > > Mike), although it would still need some work to get into shape. > > FWIW, the kernel backend is also shipped by petasan. I'd be happy to > restart efforts with Mike to get this in shape for mainline. That would be great! -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/uapi/linux/target_core_user.h b/include/uapi/linux/target_core_user.h index 0be80f72646b..2d5c3e55d3f8 100644 --- a/include/uapi/linux/target_core_user.h +++ b/include/uapi/linux/target_core_user.h @@ -132,9 +132,13 @@ enum tcmu_genl_cmd { TCMU_CMD_ADDED_DEVICE, TCMU_CMD_REMOVED_DEVICE, TCMU_CMD_RECONFIG_DEVICE, + TCMU_CMD_GET_PR_INFO, + TCMU_CMD_SET_PR_INFO, TCMU_CMD_ADDED_DEVICE_DONE, TCMU_CMD_REMOVED_DEVICE_DONE, TCMU_CMD_RECONFIG_DEVICE_DONE, + TCMU_CMD_GET_PR_INFO_DONE, + TCMU_CMD_SET_PR_INFO_DONE, TCMU_CMD_SET_FEATURES, __TCMU_CMD_MAX, }; @@ -151,8 +155,23 @@ enum tcmu_genl_attr { TCMU_ATTR_CMD_STATUS, TCMU_ATTR_DEVICE_ID, TCMU_ATTR_SUPP_KERN_CMD_REPLY, + TCMU_ATTR_PR_INFO, __TCMU_ATTR_MAX, }; #define TCMU_ATTR_MAX (__TCMU_ATTR_MAX - 1) +/* This struct help to store the Persistent Reservation which we + * are handling, it is encoded from or decoded to the string buffer in + * "struct tcmu_dev_pr_info" + */ +struct tcmu_pr_info { + u32 vers; /* on disk format version number */ + u32 seq; /* sequence number bumped every xattr write */ + struct tcmu_scsi2_rsv *scsi2_rsv; /* SCSI2 reservation if any */ + u32 gen; /* PR generation number */ + struct tcmu_pr_rsv *rsv; /* SCSI3 reservation if any */ + u32 num_regs; /* number of registrations */ + struct list_head regs; /* list of registrations */ +}; + #endif
These commits and the following intend to implement Persistent Reservation operations for TCMU devices. This series of commits would implement such PR operations: PR_Out_Register, PR_Out_Reserve, PR_Out_Clear, PR_Out_Preempt, PR_Out_Release and PR_In_ReadKeys. Next wave of patches will contain the other PR operations. This patch added a struct tcmu_pr_info to store PR information for the handling functions, added command codes and attrs for netlink interfaces. Design note: In order to get consistent Persistent Reservation results from multiple targets hosting the same TCMU device(like Ceph RBD), this solution stores a string on the device itself(like RBD metadata). Everytime when kernel receive a PR request against a TCMU device, it will query this string(a netlink attr carried by a netlink cmd). Then decide whether the PR request should be performed, after processing, it will update this string. For example: When receive a PR Reserve request, kernel will send a netlink message to tcmu-runner, try to get the string, tcmu-runner will response, send the PR info string to kernel. Then kernel will decode the string, find information like key, reservation holder, then process this request. After processing, it will update the string, send the updated string to tcmu-runner, so that tcmu-runner will write it back to the device(like RBD metadata). So we make the device itself as a "single" response point, (with locks protection) we will get a consistent result even more than one initiators sending multiple PR requests via multiple targets. Signed-off-by: Zhu Lingshan <lszhu@suse.com> --- include/uapi/linux/target_core_user.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)