mbox series

[v4,00/11] Witness protocol support for transparent failover

Message ID 20201130180257.31787-1-scabrero@suse.de (mailing list archive)
Headers show
Series Witness protocol support for transparent failover | expand

Message

Samuel Cabrero Nov. 30, 2020, 6:02 p.m. UTC
Changes from v3:
  * Add registration ID attribute to the unregister netlink message.
    The userspace daemon will include it in the log output as it is
    useful for debugging purposes.

Changes from v2:
  * Fix 'no previous prototype' kernel test robot warning in
    fs/cifs/netlink.c

Changes from v1:
  * Update SPDX header in user space API files to LGPL-2.1+ with
    Linux-syscall-note

[PATCH v4 01/11] cifs: Make extract_hostname function public
[PATCH v4 02/11] cifs: Make extract_sharename function public
[PATCH v4 03/11] cifs: Register generic netlink family
[PATCH v4 04/11] cifs: add witness mount option and data structs
[PATCH v4 05/11] cifs: Send witness register and unregister commands
[PATCH v4 06/11] cifs: Set witness notification handler for messages
[PATCH v4 07/11] cifs: Add witness information to debug data dump
[PATCH v4 08/11] cifs: Send witness register messages to userspace
[PATCH v4 09/11] cifs: Simplify reconnect code when dfs upcall is
[PATCH v4 10/11] cifs: Handle witness client move notification
[PATCH v4 11/11] cifs: Handle witness share moved notification

 fs/cifs/Kconfig                        |  11 +
 fs/cifs/Makefile                       |   2 +
 fs/cifs/cache.c                        |  24 -
 fs/cifs/cifs_debug.c                   |  13 +
 fs/cifs/cifs_swn.c                     | 727 +++++++++++++++++++++++++
 fs/cifs/cifs_swn.h                     |  25 +
 fs/cifs/cifsfs.c                       |  22 +-
 fs/cifs/cifsglob.h                     |   8 +
 fs/cifs/cifsproto.h                    |   2 +
 fs/cifs/connect.c                      | 141 +++--
 fs/cifs/fscache.c                      |   1 +
 fs/cifs/fscache.h                      |   1 -
 fs/cifs/misc.c                         |  56 ++
 fs/cifs/netlink.c                      |  89 +++
 fs/cifs/netlink.h                      |  16 +
 include/uapi/linux/cifs/cifs_netlink.h |  63 +++
 16 files changed, 1122 insertions(+), 79 deletions(-)

Comments

Aurélien Aptel Dec. 9, 2020, 12:36 p.m. UTC | #1
Hi,

I've added a "failover" test group to the buildbot that mounts a
"regular" (non-scaleout) cluster and switches the fileserver to another
cluster node live and it looks like it's working: you can keep on using
the mount.

In non-scale-out, the file server has its own virtual IP that both node
share. So when you "move" the fileserver to a different node, it doesn't
actually change IP. After doing that we realized that this actually
works already without -o witness since it's reconnecting to the same IP.

Now we need to add a scale-out cluster fileserver in buildbot where,
IIUC (please correct me Samuel) the fileserver is actually using the
node IP instead of this virtual-IP shared by nodes. So that when we move
the fileserver, it actually changes its IP address and we can test this
properly.

As for the code, I'm not an expert on reconnection but it looks for
merging I think. It doesn't handle multichannel but multchannel doesn't
handle reconnection well anyway. There is an issue which pops up in
other parts of the code as well.

If you run a command too quickly after the transition, they will fail
with EIO so it's not completely failing over but I think there can be
the same issue with DFS (Paulo, any ideas/comments?)  which is why we do
2 times ls and we ignore the result of the first in the DFS tests.

the dfs test code:

    def io_reco_test(unc, opts, cwd, expected):
        try:
            lsdir = '.'
            cddir = os.path.join(ARGS.mnt, cwd)
            info(("TEST: mount {unc} , cd {cddir} , ls {lsdir}, expect:[{expect}]\n"+
                  "      disconnect {cddir} , ls#1 {lsdir} (fail here is ok),  ls#2 (fail here NOT ok)").format(
                      unc=unc, cddir=cddir, lsdir=lsdir, expect=" ".join(['"%s"'%x for x in expected])
                  ))


Cheers,
Paulo Alcantara Dec. 9, 2020, 3:06 p.m. UTC | #2
Aurélien Aptel <aaptel@suse.de> writes:

> I've added a "failover" test group to the buildbot that mounts a
> "regular" (non-scaleout) cluster and switches the fileserver to another
> cluster node live and it looks like it's working: you can keep on using
> the mount.
>
> In non-scale-out, the file server has its own virtual IP that both node
> share. So when you "move" the fileserver to a different node, it doesn't
> actually change IP. After doing that we realized that this actually
> works already without -o witness since it's reconnecting to the same IP.
>
> Now we need to add a scale-out cluster fileserver in buildbot where,
> IIUC (please correct me Samuel) the fileserver is actually using the
> node IP instead of this virtual-IP shared by nodes. So that when we move
> the fileserver, it actually changes its IP address and we can test this
> properly.
>
> As for the code, I'm not an expert on reconnection but it looks for
> merging I think. It doesn't handle multichannel but multchannel doesn't
> handle reconnection well anyway. There is an issue which pops up in
> other parts of the code as well.
>
> If you run a command too quickly after the transition, they will fail
> with EIO so it's not completely failing over but I think there can be
> the same issue with DFS (Paulo, any ideas/comments?)  which is why we do
> 2 times ls and we ignore the result of the first in the DFS tests.
>
> the dfs test code:
>
>     def io_reco_test(unc, opts, cwd, expected):
>         try:
>             lsdir = '.'
>             cddir = os.path.join(ARGS.mnt, cwd)
>             info(("TEST: mount {unc} , cd {cddir} , ls {lsdir}, expect:[{expect}]\n"+
>                   "      disconnect {cddir} , ls#1 {lsdir} (fail here is ok),  ls#2 (fail here NOT ok)").format(
>                       unc=unc, cddir=cddir, lsdir=lsdir, expect=" ".join(['"%s"'%x for x in expected])
>                   ))

For soft mounts, it is OK ignoring the first ls.  But for hard mounts,
we shouldn't ignore the first ls as it must retry forever until failover
is done.