Message ID | 20230907034423.3928010-1-yi.zhang@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [blktests] tests/nvme/031: fix connecting faiure | expand |
On Sep 07, 2023 / 11:44, Yi Zhang wrote: > allow_any_host was disabled during _create_nvmet_subsystem, call > _create_nvmet_host before connecting to allow the host to connect. > > [76096.420586] nvmet: adding nsid 1 to subsystem blktests-subsystem-0 > [76096.440595] nvmet_tcp: enabling port 0 (127.0.0.1:4420) > [76096.491344] nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 for subsystem blktests-subsystem-0 not allowed > [76096.505049] nvme nvme2: Connect for subsystem blktests-subsystem-0 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 > [76096.519609] nvme nvme2: failed to connect queue: 0 ret=16772 > > Signed-off-by: Yi Zhang <yi.zhang@redhat.com> Thanks for the catching this. I looked back the past changes and found that the commit c32b233b7dd6 ("nvme/rc: Add helper for adding/removing to allow list") triggered the connection failure. So, I think a Fixes tag with this commit is required (I can add when this patch is applied). Even after the commit, the test case still passes. That's why I did not notice the connection failure. I think _nvme_connect_subsys() should check exit status of "nvme connect" command and print an error message on failure. This will help to catch similar connection failures in future. As for this patch, I will wait some days for further comments. Also will do some confirmation test runs. > --- > tests/nvme/031 | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/tests/nvme/031 b/tests/nvme/031 > index d5c2561..696db2d 100755 > --- a/tests/nvme/031 > +++ b/tests/nvme/031 > @@ -42,10 +42,12 @@ test() { > for ((i = 0; i < iterations; i++)); do > _create_nvmet_subsystem "${subsys}$i" "${loop_dev}" > _add_nvmet_subsys_to_port "${port}" "${subsys}$i" > + _create_nvmet_host "${subsys}$i" "${def_hostnqn}" > _nvme_connect_subsys "${nvme_trtype}" "${subsys}$i" > _nvme_disconnect_subsys "${subsys}$i" >> "${FULL}" 2>&1 > _remove_nvmet_subsystem_from_port "${port}" "${subsys}$i" > _remove_nvmet_subsystem "${subsys}$i" > + _remove_nvmet_host "${def_hostnqn}" > done > > _remove_nvmet_port "${port}" > -- > 2.34.3 >
On Fri, Sep 08, 2023 at 11:26:31AM +0000, Shinichiro Kawasaki wrote: > On Sep 07, 2023 / 11:44, Yi Zhang wrote: > > allow_any_host was disabled during _create_nvmet_subsystem, call > > _create_nvmet_host before connecting to allow the host to connect. > > > > [76096.420586] nvmet: adding nsid 1 to subsystem blktests-subsystem-0 > > [76096.440595] nvmet_tcp: enabling port 0 (127.0.0.1:4420) > > [76096.491344] nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 for subsystem blktests-subsystem-0 not allowed > > [76096.505049] nvme nvme2: Connect for subsystem blktests-subsystem-0 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 > > [76096.519609] nvme nvme2: failed to connect queue: 0 ret=16772 > > > > Signed-off-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> > Thanks for the catching this. I looked back the past changes and found that the > commit c32b233b7dd6 ("nvme/rc: Add helper for adding/removing to allow list") > triggered the connection failure. So, I think a Fixes tag with this commit is > required (I can add when this patch is applied). > > Even after the commit, the test case still passes. That's why I did not notice > the connection failure. I think _nvme_connect_subsys() should check exit status > of "nvme connect" command and print an error message on failure. This will help > to catch similar connection failures in future. I was running into a similiar problem for (not yet existing) nvme/050 test case [1]: nvmf_wait_for_state() { local def_state_timeout=5 local subsys_name="$1" local state="$2" local timeout="${3:-$def_state_timeout}" local nvmedev local state_file local start_time local end_time nvmedev=$(_find_nvme_dev "${subsys_name}") state_file="/sys/class/nvme-fabrics/ctl/${nvmedev}/state" start_time=$(date +%s) while ! grep -q "${state}" "${state_file}"; do sleep 1 end_time=$(date +%s) if (( end_time - start_time > timeout )); then echo "expected state \"${state}\" not " \ "reached within ${timeout} seconds" return 1 fi done return 0 } _nvme_connect_subsys "${nvme_trtype}" "${def_subsysnqn}" \ --hostnqn "${def_hostnqn}" \ --reconnect-delay 1 \ --ctrl-loss-tmo 10 nvmf_wait_for_state "${def_subsysnqn}" "live" nvmedev=$(_find_nvme_dev "${def_subsysnqn}") We could make this a bit more generic and move it into the connect helper. What do you think? [1] https://lore.kernel.org/linux-nvme/20230621155825.20146-2-dwagner@suse.de/
On Sep 08, 2023 / 15:09, Daniel Wagner wrote: > On Fri, Sep 08, 2023 at 11:26:31AM +0000, Shinichiro Kawasaki wrote: > > On Sep 07, 2023 / 11:44, Yi Zhang wrote: > > > allow_any_host was disabled during _create_nvmet_subsystem, call > > > _create_nvmet_host before connecting to allow the host to connect. > > > > > > [76096.420586] nvmet: adding nsid 1 to subsystem blktests-subsystem-0 > > > [76096.440595] nvmet_tcp: enabling port 0 (127.0.0.1:4420) > > > [76096.491344] nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 for subsystem blktests-subsystem-0 not allowed > > > [76096.505049] nvme nvme2: Connect for subsystem blktests-subsystem-0 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 > > > [76096.519609] nvme nvme2: failed to connect queue: 0 ret=16772 > > > > > > Signed-off-by: Yi Zhang <yi.zhang@redhat.com> > > Reviewed-by: Daniel Wagner <dwagner@suse.de> > > > Thanks for the catching this. I looked back the past changes and found that the > > commit c32b233b7dd6 ("nvme/rc: Add helper for adding/removing to allow list") > > triggered the connection failure. So, I think a Fixes tag with this commit is > > required (I can add when this patch is applied). I've applied the fix. Thanks! > > > > Even after the commit, the test case still passes. That's why I did not notice > > the connection failure. I think _nvme_connect_subsys() should check exit status > > of "nvme connect" command and print an error message on failure. This will help > > to catch similar connection failures in future. > > I was running into a similiar problem for (not yet existing) nvme/050 > test case [1]: > > nvmf_wait_for_state() { > local def_state_timeout=5 > local subsys_name="$1" > local state="$2" > local timeout="${3:-$def_state_timeout}" > local nvmedev > local state_file > local start_time > local end_time > > nvmedev=$(_find_nvme_dev "${subsys_name}") > state_file="/sys/class/nvme-fabrics/ctl/${nvmedev}/state" > > start_time=$(date +%s) > while ! grep -q "${state}" "${state_file}"; do > sleep 1 > end_time=$(date +%s) > if (( end_time - start_time > timeout )); then > echo "expected state \"${state}\" not " \ > "reached within ${timeout} seconds" > return 1 > fi > done > > return 0 > } > > _nvme_connect_subsys "${nvme_trtype}" "${def_subsysnqn}" \ > --hostnqn "${def_hostnqn}" \ > --reconnect-delay 1 \ > --ctrl-loss-tmo 10 > > nvmf_wait_for_state "${def_subsysnqn}" "live" > nvmedev=$(_find_nvme_dev "${def_subsysnqn}") > > We could make this a bit more generic and move it into the connect > helper. What do you think? This nvme state file check looks a bit complicated, but looks more robust than "nvme connect" command exit status check. Do you think that "nvme connect" can fail even when "nvme connect" command returns good status? If so, your approach will be the way to choose. > > [1] https://lore.kernel.org/linux-nvme/20230621155825.20146-2-dwagner@suse.de/
On Mon, Sep 11, 2023 at 11:00:17AM +0000, Shinichiro Kawasaki wrote: > > nvmf_wait_for_state "${def_subsysnqn}" "live" > > nvmedev=$(_find_nvme_dev "${def_subsysnqn}") > > > > We could make this a bit more generic and move it into the connect > > helper. What do you think? > > This nvme state file check looks a bit complicated, but looks more robust than > "nvme connect" command exit status check. Do you think that "nvme connect" can > fail even when "nvme connect" command returns good status? If so, your approach > will be the way to choose. Currently, 'nvme connect' is a synchronous call for tcp/rdma but not for fc. 'nvme connect' for tcp/rdma will report an error if something is wrong but not for fc, because it always return successfully. The nvme/005 is exposing the behavior differences between the transports. My long time goal is to address and make all transport behave the same way (unification of the state machines). But as it currently stands fc would need someting like this to make sure we are not blindly reporting success just because the 'nvme connect' call is successful. This type of check would make the test suite more robust and better in detecting errors.
On Sep 11, 2023 / 14:16, Daniel Wagner wrote: > On Mon, Sep 11, 2023 at 11:00:17AM +0000, Shinichiro Kawasaki wrote: > > > nvmf_wait_for_state "${def_subsysnqn}" "live" > > > nvmedev=$(_find_nvme_dev "${def_subsysnqn}") > > > > > > We could make this a bit more generic and move it into the connect > > > helper. What do you think? > > > > This nvme state file check looks a bit complicated, but looks more robust than > > "nvme connect" command exit status check. Do you think that "nvme connect" can > > fail even when "nvme connect" command returns good status? If so, your approach > > will be the way to choose. > > Currently, 'nvme connect' is a synchronous call for tcp/rdma but not for > fc. 'nvme connect' for tcp/rdma will report an error if something is > wrong but not for fc, because it always return successfully. > > The nvme/005 is exposing the behavior differences between the > transports. My long time goal is to address and make all transport > behave the same way (unification of the state machines). But as it > currently stands fc would need someting like this to make sure we are > not blindly reporting success just because the 'nvme connect' call is > successful. > > This type of check would make the test suite more robust and better in > detecting errors. Thanks for the explanation. I agree that the nvme state file check in _nvme_connect_subsys() it the more robust and the better.
diff --git a/tests/nvme/031 b/tests/nvme/031 index d5c2561..696db2d 100755 --- a/tests/nvme/031 +++ b/tests/nvme/031 @@ -42,10 +42,12 @@ test() { for ((i = 0; i < iterations; i++)); do _create_nvmet_subsystem "${subsys}$i" "${loop_dev}" _add_nvmet_subsys_to_port "${port}" "${subsys}$i" + _create_nvmet_host "${subsys}$i" "${def_hostnqn}" _nvme_connect_subsys "${nvme_trtype}" "${subsys}$i" _nvme_disconnect_subsys "${subsys}$i" >> "${FULL}" 2>&1 _remove_nvmet_subsystem_from_port "${port}" "${subsys}$i" _remove_nvmet_subsystem "${subsys}$i" + _remove_nvmet_host "${def_hostnqn}" done _remove_nvmet_port "${port}"
allow_any_host was disabled during _create_nvmet_subsystem, call _create_nvmet_host before connecting to allow the host to connect. [76096.420586] nvmet: adding nsid 1 to subsystem blktests-subsystem-0 [76096.440595] nvmet_tcp: enabling port 0 (127.0.0.1:4420) [76096.491344] nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 for subsystem blktests-subsystem-0 not allowed [76096.505049] nvme nvme2: Connect for subsystem blktests-subsystem-0 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 [76096.519609] nvme nvme2: failed to connect queue: 0 ret=16772 Signed-off-by: Yi Zhang <yi.zhang@redhat.com> --- tests/nvme/031 | 2 ++ 1 file changed, 2 insertions(+)