Message ID | 20210603225907.19981-1-olga.kornievskaia@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | modify xprt state using sysfs | expand |
On Fri, 04 Jun 2021, Olga Kornievskaia wrote: > From: Olga Kornievskaia <kolga@netapp.com> > > When a transport gets stuck, it is desired to be able to move the tasks > that have been stuck/queued on that transport to another. This is interesting..... A long-standing problem with NFS is that it is tricky to reliably unmount a filesystem if the network is not responding. It is possible, but you need to identify all the processes blocked on the filesystem and SIGKILL them. My most recent exposure to this was when shutdown hung for someone because NetworkManager shutdown the wifi before NFS filesystems were unmounted. This is arguably a config error, but the same problem could happen with a power-outage instead of networkmanage breaking the wifi. It would be nice to be able to forcibly unmount filesystems. e.g. mark the transport as dead in such a way that all requests report EIO (or similar). This is obviously a big hammer, probably bigger than justified for use with "umount -f", but sometimes it is a necessary hammer. Could your work lead to being able to do this? Could I write a shutdown script that runs when there is no more network and no expectation of any network ever again, and which marks all transports as dead - and then wakes up all pending rpc tasks? Thanks, NeilBrown
On Thu, Jun 3, 2021 at 7:57 PM NeilBrown <neilb@suse.de> wrote: > > On Fri, 04 Jun 2021, Olga Kornievskaia wrote: > > From: Olga Kornievskaia <kolga@netapp.com> > > > > When a transport gets stuck, it is desired to be able to move the tasks > > that have been stuck/queued on that transport to another. > > This is interesting..... > A long-standing problem with NFS is that it is tricky to reliably > unmount a filesystem if the network is not responding. It is possible, > but you need to identify all the processes blocked on the filesystem and > SIGKILL them. > My most recent exposure to this was when shutdown hung for someone > because NetworkManager shutdown the wifi before NFS filesystems were > unmounted. This is arguably a config error, but the same problem could > happen with a power-outage instead of networkmanage breaking the wifi. > > It would be nice to be able to forcibly unmount filesystems. e.g. mark > the transport as dead in such a way that all requests report EIO (or > similar). > This is obviously a big hammer, probably bigger than justified for use > with "umount -f", but sometimes it is a necessary hammer. > > Could your work lead to being able to do this? Could I write a shutdown > script that runs when there is no more network and no expectation of any > network ever again, and which marks all transports as dead - and then > wakes up all pending rpc tasks? I thought that was something that Ben was looking into in parallel to my efforts. In this patch series I'm only addressing the issue where some transport is unresponsive and it's not the "main" transport. I don't allow main transport to be put offline or removed. As you said in that case, the tasks need to be errored out to the application. But yes, I think in the next step we can allow for the main transport to be removed and erroring the tasks and allowing for unmounting when the server isn't responding. > > Thanks, > NeilBrown
From: Olga Kornievskaia <kolga@netapp.com> When a transport gets stuck, it is desired to be able to move the tasks that have been stuck/queued on that transport to another. This patch series attempts to do so. First patch, takes a transport and marks it offline so that no more tasks are queued on it. Second, we identify which tasks are able to be re-tried on a different transport (only 4.1+). Lastly, once the transport is deemed bad and in need of a removal, it's marked to be removed. Any tasks that are stuck there will now release the transport and try picking a different one. This transport will be removed from the list of xprts. First transport with which the RPC client was created is considered the main transport and can't be taken offline or removed. Olga Kornievskaia (3): sunrpc: take a xprt offline using sysfs NFSv4.1 identify and mark RPC tasks that can move between transports sunrpc: remove an offlined xprt using sysfs fs/nfs/nfs4proc.c | 38 ++++++++++++++++++++++---- fs/nfs/pagelist.c | 8 ++++-- fs/nfs/write.c | 6 ++++- include/linux/sunrpc/sched.h | 2 ++ include/linux/sunrpc/xprt.h | 3 +++ net/sunrpc/clnt.c | 21 +++++++++++++++ net/sunrpc/sysfs.c | 52 +++++++++++++++++++++++++++++++++--- net/sunrpc/xprtmultipath.c | 3 ++- 8 files changed, 120 insertions(+), 13 deletions(-)