Message ID | 20190531122802.12814-2-zyan@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/3] libceph: add function that reset client's entity addr | expand |
On Fri, May 31, 2019 at 2:30 PM Yan, Zheng <zyan@redhat.com> wrote: > > echo force_reconnect > /sys/kernel/debug/ceph/xxx/control > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Hi Zheng, There should be an explanation in the commit message of what this is and why it is needed. I'm assuming the use case is recovering a blacklisted mount, but what is the intended semantics? What happens to in-flight OSD requests, MDS requests, open files, etc? These are things that should really be written down. Looking at the previous patch, it appears that in-flight OSD requests are simply retried, as they would be on a regular connection fault. Is that safe? Thanks, Ilya
On Fri, May 31, 2019 at 10:20 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > On Fri, May 31, 2019 at 2:30 PM Yan, Zheng <zyan@redhat.com> wrote: > > > > echo force_reconnect > /sys/kernel/debug/ceph/xxx/control > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > Hi Zheng, > > There should be an explanation in the commit message of what this is > and why it is needed. > > I'm assuming the use case is recovering a blacklisted mount, but what > is the intended semantics? What happens to in-flight OSD requests, > MDS requests, open files, etc? These are things that should really be > written down. > got it > Looking at the previous patch, it appears that in-flight OSD requests > are simply retried, as they would be on a regular connection fault. Is > that safe? > It's not safe. I still thinking about how to handle dirty data and in-flight osd requests in the this case. Regards Yan, Zheng > Thanks, > > Ilya
On Mon, Jun 3, 2019 at 6:51 AM Yan, Zheng <ukernel@gmail.com> wrote: > > On Fri, May 31, 2019 at 10:20 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > On Fri, May 31, 2019 at 2:30 PM Yan, Zheng <zyan@redhat.com> wrote: > > > > > > echo force_reconnect > /sys/kernel/debug/ceph/xxx/control > > > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > Hi Zheng, > > > > There should be an explanation in the commit message of what this is > > and why it is needed. > > > > I'm assuming the use case is recovering a blacklisted mount, but what > > is the intended semantics? What happens to in-flight OSD requests, > > MDS requests, open files, etc? These are things that should really be > > written down. > > > got it > > > Looking at the previous patch, it appears that in-flight OSD requests > > are simply retried, as they would be on a regular connection fault. Is > > that safe? > > > > It's not safe. I still thinking about how to handle dirty data and > in-flight osd requests in the this case. Can we figure out the consistency-handling story before we start adding interfaces for people to mis-use then please? It's not pleasant but if the client gets disconnected I'd assume we have to just return EIO or something on all outstanding writes and toss away our dirty data. There's not really another option that makes any sense, is there? -Greg > > Regards > Yan, Zheng > > > Thanks, > > > > Ilya
On Mon, Jun 3, 2019 at 7:54 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > On Mon, Jun 3, 2019 at 6:51 AM Yan, Zheng <ukernel@gmail.com> wrote: > > > > On Fri, May 31, 2019 at 10:20 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > On Fri, May 31, 2019 at 2:30 PM Yan, Zheng <zyan@redhat.com> wrote: > > > > > > > > echo force_reconnect > /sys/kernel/debug/ceph/xxx/control > > > > > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > > > Hi Zheng, > > > > > > There should be an explanation in the commit message of what this is > > > and why it is needed. > > > > > > I'm assuming the use case is recovering a blacklisted mount, but what > > > is the intended semantics? What happens to in-flight OSD requests, > > > MDS requests, open files, etc? These are things that should really be > > > written down. > > > > > got it > > > > > Looking at the previous patch, it appears that in-flight OSD requests > > > are simply retried, as they would be on a regular connection fault. Is > > > that safe? > > > > > > > It's not safe. I still thinking about how to handle dirty data and > > in-flight osd requests in the this case. > > Can we figure out the consistency-handling story before we start > adding interfaces for people to mis-use then please? > > It's not pleasant but if the client gets disconnected I'd assume we > have to just return EIO or something on all outstanding writes and > toss away our dirty data. There's not really another option that makes > any sense, is there? Can we also discuss how useful is allowing to recover a mount after it has been blacklisted? After we fail everything with EIO and throw out all dirty state, how many applications would continue working without some kind of restart? And if you are restarting your application, why not get a new mount? IOW what is the use case for introducing a new debugfs knob that isn't that much different from umount+mount? Thanks, Ilya
On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > Can we also discuss how useful is allowing to recover a mount after it > has been blacklisted? After we fail everything with EIO and throw out > all dirty state, how many applications would continue working without > some kind of restart? And if you are restarting your application, why > not get a new mount? > > IOW what is the use case for introducing a new debugfs knob that isn't > that much different from umount+mount? People don't like it when their filesystem refuses to umount, which is what happens when the kernel client can't reconnect to the MDS right now. I'm not sure there's a practical way to deal with that besides some kind of computer admin intervention. (Even if you umount -l, that by design doesn't reply to syscalls and let the applications exit.) So it's not that we expect most applications to work, but we need to give them *something* that isn't a successful return, and we don't currently do that automatically on a disconnect. (And probably don't want to?)
On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > Can we also discuss how useful is allowing to recover a mount after it > > has been blacklisted? After we fail everything with EIO and throw out > > all dirty state, how many applications would continue working without > > some kind of restart? And if you are restarting your application, why > > not get a new mount? > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > that much different from umount+mount? > > People don't like it when their filesystem refuses to umount, which is > what happens when the kernel client can't reconnect to the MDS right > now. I'm not sure there's a practical way to deal with that besides > some kind of computer admin intervention. Furthermore, there are often many applications using the mount (even with containers) and it's not a sustainable position that any network/client/cephfs hiccup requires a remount. Also, an application that fails because of EIO is easy to deal with a layer above but a remount usually requires grump admin intervention.
On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > Can we also discuss how useful is allowing to recover a mount after it > > has been blacklisted? After we fail everything with EIO and throw out > > all dirty state, how many applications would continue working without > > some kind of restart? And if you are restarting your application, why > > not get a new mount? > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > that much different from umount+mount? > > People don't like it when their filesystem refuses to umount, which is > what happens when the kernel client can't reconnect to the MDS right > now. I'm not sure there's a practical way to deal with that besides > some kind of computer admin intervention. (Even if you umount -l, that > by design doesn't reply to syscalls and let the applications exit.) Well, that is what I'm saying: if an admin intervention is required anyway, then why not make it be umount+mount? That is certainly more intuitive than an obscure write-only file in debugfs... We have umount -f, which is there for tearing down a mount that is unresponsive. It should be able to deal with a blacklisted mount, if it can't it's probably a bug. Thanks, Ilya
On Tue, Jun 4, 2019 at 5:18 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > Can we also discuss how useful is allowing to recover a mount after it > > > has been blacklisted? After we fail everything with EIO and throw out > > > all dirty state, how many applications would continue working without > > > some kind of restart? And if you are restarting your application, why > > > not get a new mount? > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > that much different from umount+mount? > > > > People don't like it when their filesystem refuses to umount, which is > > what happens when the kernel client can't reconnect to the MDS right > > now. I'm not sure there's a practical way to deal with that besides > > some kind of computer admin intervention. (Even if you umount -l, that > > by design doesn't reply to syscalls and let the applications exit.) > > Well, that is what I'm saying: if an admin intervention is required > anyway, then why not make it be umount+mount? That is certainly more > intuitive than an obscure write-only file in debugfs... > I think 'umount -f' + 'mount -o remount' is better than the debugfs file > We have umount -f, which is there for tearing down a mount that is > unresponsive. It should be able to deal with a blacklisted mount, if > it can't it's probably a bug. > > Thanks, > > Ilya
On Tue, Jun 4, 2019 at 4:10 AM Yan, Zheng <ukernel@gmail.com> wrote: > > On Tue, Jun 4, 2019 at 5:18 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > Can we also discuss how useful is allowing to recover a mount after it > > > > has been blacklisted? After we fail everything with EIO and throw out > > > > all dirty state, how many applications would continue working without > > > > some kind of restart? And if you are restarting your application, why > > > > not get a new mount? > > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > > that much different from umount+mount? > > > > > > People don't like it when their filesystem refuses to umount, which is > > > what happens when the kernel client can't reconnect to the MDS right > > > now. I'm not sure there's a practical way to deal with that besides > > > some kind of computer admin intervention. (Even if you umount -l, that > > > by design doesn't reply to syscalls and let the applications exit.) > > > > Well, that is what I'm saying: if an admin intervention is required > > anyway, then why not make it be umount+mount? That is certainly more > > intuitive than an obscure write-only file in debugfs... > > > > I think 'umount -f' + 'mount -o remount' is better than the debugfs file Why '-o remount'? I wouldn't expect 'umount -f' to leave behind any actionable state, it should tear down all data structures, mount point, etc. What would '-o remount' act on? Thanks, Ilya
On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote: > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > Can we also discuss how useful is allowing to recover a mount after it > > > has been blacklisted? After we fail everything with EIO and throw out > > > all dirty state, how many applications would continue working without > > > some kind of restart? And if you are restarting your application, why > > > not get a new mount? > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > that much different from umount+mount? > > > > People don't like it when their filesystem refuses to umount, which is > > what happens when the kernel client can't reconnect to the MDS right > > now. I'm not sure there's a practical way to deal with that besides > > some kind of computer admin intervention. > > Furthermore, there are often many applications using the mount (even > with containers) and it's not a sustainable position that any > network/client/cephfs hiccup requires a remount. Also, an application Well, it's not just any hiccup. It's one that lead to blacklisting... > that fails because of EIO is easy to deal with a layer above but a > remount usually requires grump admin intervention. I feel like I'm missing something here. Would figuring out $ID, obtaining root and echoing to /sys/kernel/debug/$ID/control make the admin less grumpy, especially when containers are involved? Doing the force_reconnect thing would retain the mount point, but how much use would it be? Would using existing (i.e. pre-blacklist) file descriptors be allowed? I assumed it wouldn't be (permanent EIO or something of that sort), so maybe that is the piece I'm missing... Thanks, Ilya
On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote: > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote: > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > Can we also discuss how useful is allowing to recover a mount after it > > > > has been blacklisted? After we fail everything with EIO and throw out > > > > all dirty state, how many applications would continue working without > > > > some kind of restart? And if you are restarting your application, why > > > > not get a new mount? > > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > > that much different from umount+mount? > > > > > > People don't like it when their filesystem refuses to umount, which is > > > what happens when the kernel client can't reconnect to the MDS right > > > now. I'm not sure there's a practical way to deal with that besides > > > some kind of computer admin intervention. > > > > Furthermore, there are often many applications using the mount (even > > with containers) and it's not a sustainable position that any > > network/client/cephfs hiccup requires a remount. Also, an application > > Well, it's not just any hiccup. It's one that lead to blacklisting... > > > that fails because of EIO is easy to deal with a layer above but a > > remount usually requires grump admin intervention. > > I feel like I'm missing something here. Would figuring out $ID, > obtaining root and echoing to /sys/kernel/debug/$ID/control make the > admin less grumpy, especially when containers are involved? > > Doing the force_reconnect thing would retain the mount point, but how > much use would it be? Would using existing (i.e. pre-blacklist) file > descriptors be allowed? I assumed it wouldn't be (permanent EIO or > something of that sort), so maybe that is the piece I'm missing... > I agree with Ilya here. I don't see how applications can just pick up where they left off after being blacklisted. Remounting in some fashion is really the only recourse here. To be clear, what happens to stateful objects (open files, byte-range locks, etc.) in this scenario? Were you planning to just re-open files and re-request locks that you held before being blacklisted? If so, that sounds like a great way to cause some silent data corruption...
On Tue, Jun 4, 2019 at 5:25 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > On Tue, Jun 4, 2019 at 4:10 AM Yan, Zheng <ukernel@gmail.com> wrote: > > > > On Tue, Jun 4, 2019 at 5:18 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > Can we also discuss how useful is allowing to recover a mount after it > > > > > has been blacklisted? After we fail everything with EIO and throw out > > > > > all dirty state, how many applications would continue working without > > > > > some kind of restart? And if you are restarting your application, why > > > > > not get a new mount? > > > > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > > > that much different from umount+mount? > > > > > > > > People don't like it when their filesystem refuses to umount, which is > > > > what happens when the kernel client can't reconnect to the MDS right > > > > now. I'm not sure there's a practical way to deal with that besides > > > > some kind of computer admin intervention. (Even if you umount -l, that > > > > by design doesn't reply to syscalls and let the applications exit.) > > > > > > Well, that is what I'm saying: if an admin intervention is required > > > anyway, then why not make it be umount+mount? That is certainly more > > > intuitive than an obscure write-only file in debugfs... > > > > > > > I think 'umount -f' + 'mount -o remount' is better than the debugfs file > > Why '-o remount'? I wouldn't expect 'umount -f' to leave behind any > actionable state, it should tear down all data structures, mount point, > etc. What would '-o remount' act on? > If mount point is in use, 'umount -f ' only closes mds sessions and aborts osd requests. Mount point is still there, any operation on it will return -EIO. The remount change the mount point back to normal state. > Thanks, > > Ilya
On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@redhat.com> wrote: > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote: > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote: > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > Can we also discuss how useful is allowing to recover a mount after it > > > > > has been blacklisted? After we fail everything with EIO and throw out > > > > > all dirty state, how many applications would continue working without > > > > > some kind of restart? And if you are restarting your application, why > > > > > not get a new mount? > > > > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > > > that much different from umount+mount? > > > > > > > > People don't like it when their filesystem refuses to umount, which is > > > > what happens when the kernel client can't reconnect to the MDS right > > > > now. I'm not sure there's a practical way to deal with that besides > > > > some kind of computer admin intervention. > > > > > > Furthermore, there are often many applications using the mount (even > > > with containers) and it's not a sustainable position that any > > > network/client/cephfs hiccup requires a remount. Also, an application > > > > Well, it's not just any hiccup. It's one that lead to blacklisting... > > > > > that fails because of EIO is easy to deal with a layer above but a > > > remount usually requires grump admin intervention. > > > > I feel like I'm missing something here. Would figuring out $ID, > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the > > admin less grumpy, especially when containers are involved? > > > > Doing the force_reconnect thing would retain the mount point, but how > > much use would it be? Would using existing (i.e. pre-blacklist) file > > descriptors be allowed? I assumed it wouldn't be (permanent EIO or > > something of that sort), so maybe that is the piece I'm missing... > > > > I agree with Ilya here. I don't see how applications can just pick up > where they left off after being blacklisted. Remounting in some fashion > is really the only recourse here. > > To be clear, what happens to stateful objects (open files, byte-range > locks, etc.) in this scenario? Were you planning to just re-open files > and re-request locks that you held before being blacklisted? If so, that > sounds like a great way to cause some silent data corruption... The plan is: - files open for reading re-obtain caps and may continue to be used - files open for writing discard all dirty file blocks and return -EIO on further use (this could be configurable via a mount_option like with the ceph-fuse client) Not sure how best to handle locks and I'm open to suggestions. We could raise SIGLOST on those processes?
On Wed, 2019-06-05 at 14:57 -0700, Patrick Donnelly wrote: > On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@redhat.com> wrote: > > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote: > > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote: > > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > Can we also discuss how useful is allowing to recover a mount after it > > > > > > has been blacklisted? After we fail everything with EIO and throw out > > > > > > all dirty state, how many applications would continue working without > > > > > > some kind of restart? And if you are restarting your application, why > > > > > > not get a new mount? > > > > > > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > > > > that much different from umount+mount? > > > > > > > > > > People don't like it when their filesystem refuses to umount, which is > > > > > what happens when the kernel client can't reconnect to the MDS right > > > > > now. I'm not sure there's a practical way to deal with that besides > > > > > some kind of computer admin intervention. > > > > > > > > Furthermore, there are often many applications using the mount (even > > > > with containers) and it's not a sustainable position that any > > > > network/client/cephfs hiccup requires a remount. Also, an application > > > > > > Well, it's not just any hiccup. It's one that lead to blacklisting... > > > > > > > that fails because of EIO is easy to deal with a layer above but a > > > > remount usually requires grump admin intervention. > > > > > > I feel like I'm missing something here. Would figuring out $ID, > > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the > > > admin less grumpy, especially when containers are involved? > > > > > > Doing the force_reconnect thing would retain the mount point, but how > > > much use would it be? Would using existing (i.e. pre-blacklist) file > > > descriptors be allowed? I assumed it wouldn't be (permanent EIO or > > > something of that sort), so maybe that is the piece I'm missing... > > > > > > > I agree with Ilya here. I don't see how applications can just pick up > > where they left off after being blacklisted. Remounting in some fashion > > is really the only recourse here. > > > > To be clear, what happens to stateful objects (open files, byte-range > > locks, etc.) in this scenario? Were you planning to just re-open files > > and re-request locks that you held before being blacklisted? If so, that > > sounds like a great way to cause some silent data corruption... > > The plan is: > > - files open for reading re-obtain caps and may continue to be used > - files open for writing discard all dirty file blocks and return -EIO > on further use (this could be configurable via a mount_option like > with the ceph-fuse client) > That sounds fairly reasonable. > Not sure how best to handle locks and I'm open to suggestions. We > could raise SIGLOST on those processes? > Unfortunately, SIGLOST has never really been a thing on Linux. There was an attempt by Anna Schumaker a few years ago to implement it for use with NFS, but it never went in. We ended up with this patch, IIRC: https://patchwork.kernel.org/patch/10108419/ "The current practice is to set NFS_LOCK_LOST so that read/write returns EIO when a lock is lost. So, change these comments to code when sets NFS_LOCK_LOST." Maybe we should aim for similar behavior in this situation. It's a little tricker here since we don't really have an analogue to a lock stateid in ceph, so we'd need to implement this in some other way.
Apologies for having this discussion in two threads... On Wed, Jun 5, 2019 at 3:26 PM Jeff Layton <jlayton@redhat.com> wrote: > > On Wed, 2019-06-05 at 14:57 -0700, Patrick Donnelly wrote: > > On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@redhat.com> wrote: > > > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote: > > > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote: > > > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > > Can we also discuss how useful is allowing to recover a mount after it > > > > > > > has been blacklisted? After we fail everything with EIO and throw out > > > > > > > all dirty state, how many applications would continue working without > > > > > > > some kind of restart? And if you are restarting your application, why > > > > > > > not get a new mount? > > > > > > > > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > > > > > that much different from umount+mount? > > > > > > > > > > > > People don't like it when their filesystem refuses to umount, which is > > > > > > what happens when the kernel client can't reconnect to the MDS right > > > > > > now. I'm not sure there's a practical way to deal with that besides > > > > > > some kind of computer admin intervention. > > > > > > > > > > Furthermore, there are often many applications using the mount (even > > > > > with containers) and it's not a sustainable position that any > > > > > network/client/cephfs hiccup requires a remount. Also, an application > > > > > > > > Well, it's not just any hiccup. It's one that lead to blacklisting... > > > > > > > > > that fails because of EIO is easy to deal with a layer above but a > > > > > remount usually requires grump admin intervention. > > > > > > > > I feel like I'm missing something here. Would figuring out $ID, > > > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the > > > > admin less grumpy, especially when containers are involved? > > > > > > > > Doing the force_reconnect thing would retain the mount point, but how > > > > much use would it be? Would using existing (i.e. pre-blacklist) file > > > > descriptors be allowed? I assumed it wouldn't be (permanent EIO or > > > > something of that sort), so maybe that is the piece I'm missing... > > > > > > > > > > I agree with Ilya here. I don't see how applications can just pick up > > > where they left off after being blacklisted. Remounting in some fashion > > > is really the only recourse here. > > > > > > To be clear, what happens to stateful objects (open files, byte-range > > > locks, etc.) in this scenario? Were you planning to just re-open files > > > and re-request locks that you held before being blacklisted? If so, that > > > sounds like a great way to cause some silent data corruption... > > > > The plan is: > > > > - files open for reading re-obtain caps and may continue to be used > > - files open for writing discard all dirty file blocks and return -EIO > > on further use (this could be configurable via a mount_option like > > with the ceph-fuse client) > > > > That sounds fairly reasonable. > > > Not sure how best to handle locks and I'm open to suggestions. We > > could raise SIGLOST on those processes? > > > > Unfortunately, SIGLOST has never really been a thing on Linux. There was > an attempt by Anna Schumaker a few years ago to implement it for use > with NFS, but it never went in. Is there another signal we could reasonably use? > We ended up with this patch, IIRC: > > https://patchwork.kernel.org/patch/10108419/ > > "The current practice is to set NFS_LOCK_LOST so that read/write returns > EIO when a lock is lost. So, change these comments to code when sets > NFS_LOCK_LOST." > > Maybe we should aim for similar behavior in this situation. It's a > little tricker here since we don't really have an analogue to a lock > stateid in ceph, so we'd need to implement this in some other way. So effectively blacklist the process so all I/O is blocked on the mount? Do I understand correctly?
On Wed, 2019-06-05 at 16:18 -0700, Patrick Donnelly wrote: > Apologies for having this discussion in two threads... > > On Wed, Jun 5, 2019 at 3:26 PM Jeff Layton <jlayton@redhat.com> wrote: > > On Wed, 2019-06-05 at 14:57 -0700, Patrick Donnelly wrote: > > > On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@redhat.com> wrote: > > > > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote: > > > > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote: > > > > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > > > Can we also discuss how useful is allowing to recover a mount after it > > > > > > > > has been blacklisted? After we fail everything with EIO and throw out > > > > > > > > all dirty state, how many applications would continue working without > > > > > > > > some kind of restart? And if you are restarting your application, why > > > > > > > > not get a new mount? > > > > > > > > > > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > > > > > > that much different from umount+mount? > > > > > > > > > > > > > > People don't like it when their filesystem refuses to umount, which is > > > > > > > what happens when the kernel client can't reconnect to the MDS right > > > > > > > now. I'm not sure there's a practical way to deal with that besides > > > > > > > some kind of computer admin intervention. > > > > > > > > > > > > Furthermore, there are often many applications using the mount (even > > > > > > with containers) and it's not a sustainable position that any > > > > > > network/client/cephfs hiccup requires a remount. Also, an application > > > > > > > > > > Well, it's not just any hiccup. It's one that lead to blacklisting... > > > > > > > > > > > that fails because of EIO is easy to deal with a layer above but a > > > > > > remount usually requires grump admin intervention. > > > > > > > > > > I feel like I'm missing something here. Would figuring out $ID, > > > > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the > > > > > admin less grumpy, especially when containers are involved? > > > > > > > > > > Doing the force_reconnect thing would retain the mount point, but how > > > > > much use would it be? Would using existing (i.e. pre-blacklist) file > > > > > descriptors be allowed? I assumed it wouldn't be (permanent EIO or > > > > > something of that sort), so maybe that is the piece I'm missing... > > > > > > > > > > > > > I agree with Ilya here. I don't see how applications can just pick up > > > > where they left off after being blacklisted. Remounting in some fashion > > > > is really the only recourse here. > > > > > > > > To be clear, what happens to stateful objects (open files, byte-range > > > > locks, etc.) in this scenario? Were you planning to just re-open files > > > > and re-request locks that you held before being blacklisted? If so, that > > > > sounds like a great way to cause some silent data corruption... > > > > > > The plan is: > > > > > > - files open for reading re-obtain caps and may continue to be used > > > - files open for writing discard all dirty file blocks and return -EIO > > > on further use (this could be configurable via a mount_option like > > > with the ceph-fuse client) > > > > > > > That sounds fairly reasonable. > > > > > Not sure how best to handle locks and I'm open to suggestions. We > > > could raise SIGLOST on those processes? > > > > > > > Unfortunately, SIGLOST has never really been a thing on Linux. There was > > an attempt by Anna Schumaker a few years ago to implement it for use > > with NFS, but it never went in. > > Is there another signal we could reasonably use? > Not really. The problem is really that SIGLOST is not even defined. In fact, if you look at the asm-generic/signal.h header: #define SIGIO 29 #define SIGPOLL SIGIO /* #define SIGLOST 29 */ So, there it is, commented out, and it shares a value with SIGIO. We could pick another value for it, of course, but then you'd have to get it into userland headers too. All of that sounds like a giant PITA. > > We ended up with this patch, IIRC: > > > > https://patchwork.kernel.org/patch/10108419/ > > > > "The current practice is to set NFS_LOCK_LOST so that read/write returns > > EIO when a lock is lost. So, change these comments to code when sets > > NFS_LOCK_LOST." > > > > Maybe we should aim for similar behavior in this situation. It's a > > little tricker here since we don't really have an analogue to a lock > > stateid in ceph, so we'd need to implement this in some other way. > > So effectively blacklist the process so all I/O is blocked on the > mount? Do I understand correctly? > No. I think in practice what we'd want to do is "invalidate" any file descriptions that were open before the blacklisting where locks were lost. Attempts to do reads or writes against those fd's would get back an error (EIO, most likely). File descriptions that didn't have any lost locks could carry on working as normal after reacquiring caps. We could also consider a module parameter or something to allow reclaim of lost locks too (in violation of continuity rules), like the recover_lost_locks parameter in nfs.ko.
On Tue, Jun 4, 2019 at 4:10 AM Yan, Zheng <ukernel@gmail.com> wrote: > > On Tue, Jun 4, 2019 at 5:18 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > Can we also discuss how useful is allowing to recover a mount after it > > > > has been blacklisted? After we fail everything with EIO and throw out > > > > all dirty state, how many applications would continue working without > > > > some kind of restart? And if you are restarting your application, why > > > > not get a new mount? > > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't > > > > that much different from umount+mount? > > > > > > People don't like it when their filesystem refuses to umount, which is > > > what happens when the kernel client can't reconnect to the MDS right > > > now. I'm not sure there's a practical way to deal with that besides > > > some kind of computer admin intervention. (Even if you umount -l, that > > > by design doesn't reply to syscalls and let the applications exit.) > > > > Well, that is what I'm saying: if an admin intervention is required > > anyway, then why not make it be umount+mount? That is certainly more > > intuitive than an obscure write-only file in debugfs... > > > > I think 'umount -f' + 'mount -o remount' is better than the debugfs file A small bit of user input: for some of the places we'd like to use CephFS we value availability over consistency. For example, in a large batch processing farm, it is really inconvenient (and expensive in lost CPU-hours) if an operator needs to repair thousands of mounts when cephfs breaks (e.g. an mds crash or whatever). It is preferential to let the apps crash, drop caches, fh's, whatever else is necessary, and create a new session to the cluster with the same mount. In this use-case, it doesn't matter if the files were inconsistent, because a higher-level job scheduler will retry the job from scratch somewhere else with new output files. It would be nice if there was a mount option to allow users to choose this mode (-o soft, for example). Without a mount option, we're forced to run ugly cron jobs which look for hung mounts and do the necessary. My 2c, dan > > > > We have umount -f, which is there for tearing down a mount that is > > unresponsive. It should be able to deal with a blacklisted mount, if > > it can't it's probably a bug. > > > > Thanks, > > > > Ilya
diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c index a14d64664878..d65da57406bd 100644 --- a/fs/ceph/debugfs.c +++ b/fs/ceph/debugfs.c @@ -210,6 +210,31 @@ CEPH_DEFINE_SHOW_FUNC(mdsc_show) CEPH_DEFINE_SHOW_FUNC(caps_show) CEPH_DEFINE_SHOW_FUNC(mds_sessions_show) +static ssize_t control_file_write(struct file *file, + const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct ceph_fs_client *fsc = file_inode(file)->i_private; + char buf[16]; + ssize_t len; + + len = min(count, sizeof(buf) - 1); + if (copy_from_user(buf, ubuf, len)) + return -EFAULT; + + buf[len] = '\0'; + if (!strcmp(buf, "force_reconnect")) { + ceph_mdsc_force_reconnect(fsc->mdsc); + } else { + return -EINVAL; + } + + return count; +} + +static const struct file_operations control_file_fops = { + .write = control_file_write, +}; /* * debugfs @@ -233,7 +258,6 @@ static int congestion_kb_get(void *data, u64 *val) DEFINE_SIMPLE_ATTRIBUTE(congestion_kb_fops, congestion_kb_get, congestion_kb_set, "%llu\n"); - void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc) { dout("ceph_fs_debugfs_cleanup\n"); @@ -243,6 +267,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc) debugfs_remove(fsc->debugfs_mds_sessions); debugfs_remove(fsc->debugfs_caps); debugfs_remove(fsc->debugfs_mdsc); + debugfs_remove(fsc->debugfs_control); } int ceph_fs_debugfs_init(struct ceph_fs_client *fsc) @@ -302,6 +327,13 @@ int ceph_fs_debugfs_init(struct ceph_fs_client *fsc) if (!fsc->debugfs_caps) goto out; + fsc->debugfs_control = debugfs_create_file("control", + 0200, + fsc->client->debugfs_dir, + fsc, + &control_file_fops); + if (!fsc->debugfs_control) + goto out; return 0; out: diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index f5c3499fdec6..95ee893205c5 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2631,7 +2631,7 @@ static void kick_requests(struct ceph_mds_client *mdsc, int mds) if (req->r_attempts > 0) continue; /* only new requests */ if (req->r_session && - req->r_session->s_mds == mds) { + (mds == -1 || req->r_session->s_mds == mds)) { dout(" kicking tid %llu\n", req->r_tid); list_del_init(&req->r_wait); __do_request(mdsc, req); @@ -4371,6 +4371,45 @@ void ceph_mdsc_force_umount(struct ceph_mds_client *mdsc) mutex_unlock(&mdsc->mutex); } +void ceph_mdsc_force_reconnect(struct ceph_mds_client *mdsc) +{ + struct ceph_mds_session *session; + int mds; + LIST_HEAD(to_wake); + + pr_info("force reconnect\n"); + + /* this also reset add mon/osd conntions */ + ceph_reset_client_addr(mdsc->fsc->client); + + mutex_lock(&mdsc->mutex); + + /* reset mds connections */ + for (mds = 0; mds < mdsc->max_sessions; mds++) { + session = __ceph_lookup_mds_session(mdsc, mds); + if (!session) + continue; + + __unregister_session(mdsc, session); + list_splice_init(&session->s_waiting, &to_wake); + mutex_unlock(&mdsc->mutex); + + mutex_lock(&session->s_mutex); + cleanup_session_requests(mdsc, session); + remove_session_caps(session); + mutex_unlock(&session->s_mutex); + + ceph_put_mds_session(session); + mutex_lock(&mdsc->mutex); + } + + list_splice_init(&mdsc->waiting_for_map, &to_wake); + __wake_requests(mdsc, &to_wake); + kick_requests(mdsc, -1); + + mutex_unlock(&mdsc->mutex); +} + static void ceph_mdsc_stop(struct ceph_mds_client *mdsc) { dout("stop\n"); diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 330769ecb601..125e26895f14 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -457,6 +457,7 @@ extern int ceph_send_msg_mds(struct ceph_mds_client *mdsc, extern int ceph_mdsc_init(struct ceph_fs_client *fsc); extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc); extern void ceph_mdsc_force_umount(struct ceph_mds_client *mdsc); +extern void ceph_mdsc_force_reconnect(struct ceph_mds_client *mdsc); extern void ceph_mdsc_destroy(struct ceph_fs_client *fsc); extern void ceph_mdsc_sync(struct ceph_mds_client *mdsc); diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 9c82d213a5ab..9ccb6e031988 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -118,6 +118,7 @@ struct ceph_fs_client { struct dentry *debugfs_bdi; struct dentry *debugfs_mdsc, *debugfs_mdsmap; struct dentry *debugfs_mds_sessions; + struct dentry *debugfs_control; #endif #ifdef CONFIG_CEPH_FSCACHE
echo force_reconnect > /sys/kernel/debug/ceph/xxx/control Signed-off-by: "Yan, Zheng" <zyan@redhat.com> --- fs/ceph/debugfs.c | 34 +++++++++++++++++++++++++++++++++- fs/ceph/mds_client.c | 41 ++++++++++++++++++++++++++++++++++++++++- fs/ceph/mds_client.h | 1 + fs/ceph/super.h | 1 + 4 files changed, 75 insertions(+), 2 deletions(-)