Message ID | 1427908760-7083-1-git-send-email-idryomov@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote: > Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc > flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and > cephfs. However it turned out to not play nice with loopback scenario, > leading to lockups with a full socket send-q and empty recv-q. > > While we always advised against colocating kernel client and ceph > servers on the same box, a few people are doing it and it's also useful > for light development testing, so rather than reverting make sure to > not set those flags in the loopback case. > This does not clarify why the non-loopback case needs access to pfmemalloc reserves. Granted, I've spent zero time on this but it's really unclear what problem was originally tried to be solved and why dirty page limiting was insufficient. Swap over NFS was always a very special case minimally because it's immune to dirty page throttling. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote: > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote: >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and >> cephfs. However it turned out to not play nice with loopback scenario, >> leading to lockups with a full socket send-q and empty recv-q. >> >> While we always advised against colocating kernel client and ceph >> servers on the same box, a few people are doing it and it's also useful >> for light development testing, so rather than reverting make sure to >> not set those flags in the loopback case. >> > > This does not clarify why the non-loopback case needs access to pfmemalloc > reserves. Granted, I've spent zero time on this but it's really unclear > what problem was originally tried to be solved and why dirty page limiting > was insufficient. Swap over NFS was always a very special case minimally > because it's immune to dirty page throttling. I don't think there was any particular problem tried to be solved, certainly not one we hit and fixed with 89baaa570ab0. Mike is out this week, but I'm pretty sure he said he copied this for iscsi from nbd because you nudged him to (and you yourself did this for nbd as part of swap-over-NFS series). And then one day when I tracked down a lockup caused by the fact that ceph workqueues didn't have a WQ_MEM_RECLAIM tag he remembered his SOCK_MEMALLOC/PF_MEMALLOC iscsi patch and copied it for rbd/cephfs. As I mentioned in the previous thread [1], because rbd is very similar to nbd, it seemed like a step in the right direction... We didn't get a clear answer from you in [1]. If this is the wrong thing to do for network block devices then we should yank it universally (nbd, iscsi, libceph). If not, this patch simply tries to keep ceph loopback scenario alive, for toy setups and development testing, if nothing else. [1] http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/23708 Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 02, 2015 at 02:40:19AM +0300, Ilya Dryomov wrote: > On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote: > > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote: > >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc > >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and > >> cephfs. However it turned out to not play nice with loopback scenario, > >> leading to lockups with a full socket send-q and empty recv-q. > >> > >> While we always advised against colocating kernel client and ceph > >> servers on the same box, a few people are doing it and it's also useful > >> for light development testing, so rather than reverting make sure to > >> not set those flags in the loopback case. > >> > > > > This does not clarify why the non-loopback case needs access to pfmemalloc > > reserves. Granted, I've spent zero time on this but it's really unclear > > what problem was originally tried to be solved and why dirty page limiting > > was insufficient. Swap over NFS was always a very special case minimally > > because it's immune to dirty page throttling. > > I don't think there was any particular problem tried to be solved, Then please go back and look at why dirty page limiting is insufficient for ceph. > certainly not one we hit and fixed with 89baaa570ab0. Mike is out this > week, but I'm pretty sure he said he copied this for iscsi from nbd > because you nudged him to (and you yourself did this for nbd as part of > swap-over-NFS series). In http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/23708 I stated that if ceph insisted on using using nbd as justification for ceph using __GFP_MEMALLOC that it was preferred that nbd be broken instead. In commit 7f338fe4540b1d0600b02314c7d885fd358e9eca, the use case in mind was the swap-over-nbd case and I regret I didn't have userspace explicitly tell the kernel that NBD was being used as a swap device. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 2, 2015 at 8:41 AM, Mel Gorman <mgorman@suse.de> wrote: > On Thu, Apr 02, 2015 at 02:40:19AM +0300, Ilya Dryomov wrote: >> On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote: >> > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote: >> >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc >> >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and >> >> cephfs. However it turned out to not play nice with loopback scenario, >> >> leading to lockups with a full socket send-q and empty recv-q. >> >> >> >> While we always advised against colocating kernel client and ceph >> >> servers on the same box, a few people are doing it and it's also useful >> >> for light development testing, so rather than reverting make sure to >> >> not set those flags in the loopback case. >> >> >> > >> > This does not clarify why the non-loopback case needs access to pfmemalloc >> > reserves. Granted, I've spent zero time on this but it's really unclear >> > what problem was originally tried to be solved and why dirty page limiting >> > was insufficient. Swap over NFS was always a very special case minimally >> > because it's immune to dirty page throttling. >> >> I don't think there was any particular problem tried to be solved, > > Then please go back and look at why dirty page limiting is insufficient > for ceph. > >> certainly not one we hit and fixed with 89baaa570ab0. Mike is out this >> week, but I'm pretty sure he said he copied this for iscsi from nbd >> because you nudged him to (and you yourself did this for nbd as part of >> swap-over-NFS series). > > In http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/23708 I > stated that if ceph insisted on using using nbd as justification for ceph > using __GFP_MEMALLOC that it was preferred that nbd be broken instead. In > commit 7f338fe4540b1d0600b02314c7d885fd358e9eca, the use case in mind was > the swap-over-nbd case and I regret I didn't have userspace explicitly > tell the kernel that NBD was being used as a swap device. OK, it all starts to make sense now. So ideally nbd would only use __GFP_MEMALLOC if nbd-client was invoked with -swap - you just didn't implement that. I guess I should have gone deeper into the history of your nbd patch when Mike cited it as a reason he did this for ceph. I think ceph is fine with dirty page limiting in general, so it's only if we wanted to support swap-over-rbd (cephfs is a bit of a weak link currently, so I'm not going there) would we need to enable SOCK_MEMALLOC/PF_MEMALLOC and only for that ceph_client instance. Sounds like that will require a "swap" libceph option, which will also implicitly enable "noshare" to make sure __GFP_MEMALLOC ceph_client is not shared with anything else - luckily we don't have a userspace process a la nbd-client we need to worry about. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 02, 2015 at 11:35:35AM +0300, Ilya Dryomov wrote: > On Thu, Apr 2, 2015 at 8:41 AM, Mel Gorman <mgorman@suse.de> wrote: > > On Thu, Apr 02, 2015 at 02:40:19AM +0300, Ilya Dryomov wrote: > >> On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote: > >> > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote: > >> >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc > >> >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and > >> >> cephfs. However it turned out to not play nice with loopback scenario, > >> >> leading to lockups with a full socket send-q and empty recv-q. > >> >> > >> >> While we always advised against colocating kernel client and ceph > >> >> servers on the same box, a few people are doing it and it's also useful > >> >> for light development testing, so rather than reverting make sure to > >> >> not set those flags in the loopback case. > >> >> > >> > > >> > This does not clarify why the non-loopback case needs access to pfmemalloc > >> > reserves. Granted, I've spent zero time on this but it's really unclear > >> > what problem was originally tried to be solved and why dirty page limiting > >> > was insufficient. Swap over NFS was always a very special case minimally > >> > because it's immune to dirty page throttling. > >> > >> I don't think there was any particular problem tried to be solved, > > > > Then please go back and look at why dirty page limiting is insufficient > > for ceph. > > > >> certainly not one we hit and fixed with 89baaa570ab0. Mike is out this > >> week, but I'm pretty sure he said he copied this for iscsi from nbd > >> because you nudged him to (and you yourself did this for nbd as part of > >> swap-over-NFS series). > > > > In http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/23708 I > > stated that if ceph insisted on using using nbd as justification for ceph > > using __GFP_MEMALLOC that it was preferred that nbd be broken instead. In > > commit 7f338fe4540b1d0600b02314c7d885fd358e9eca, the use case in mind was > > the swap-over-nbd case and I regret I didn't have userspace explicitly > > tell the kernel that NBD was being used as a swap device. > > OK, it all starts to make sense now. So ideally nbd would only use > __GFP_MEMALLOC if nbd-client was invoked with -swap - you just didn't > implement that. Yes. > I think ceph is fine with dirty page limiting in general, Then I suggest removing ceph's usage of __GFP_MEMALLOC until there is a genuine problem that dirty page limiting is unable to handle. Dirty page limiting might stall in some cases but worst case for __GFP_MEMALLOC abuse is a livelocked machine. > so it's only > if we wanted to support swap-over-rbd (cephfs is a bit of a weak link > currently, so I'm not going there) would we need to enable > SOCK_MEMALLOC/PF_MEMALLOC and only for that ceph_client instance. Yes. > Sounds like that will require a "swap" libceph option, which will also > implicitly enable "noshare" to make sure __GFP_MEMALLOC ceph_client is > not shared with anything else - luckily we don't have a userspace > process a la nbd-client we need to worry about. > I'm not familiar enough with the ins and outs of rbd to know what sort of implementation hazards might be encountered.
On 04/02/2015 12:41 AM, Mel Gorman wrote: > On Thu, Apr 02, 2015 at 02:40:19AM +0300, Ilya Dryomov wrote: >> > On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote: >>> > > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote: >>>> > >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc >>>> > >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and >>>> > >> cephfs. However it turned out to not play nice with loopback scenario, >>>> > >> leading to lockups with a full socket send-q and empty recv-q. >>>> > >> >>>> > >> While we always advised against colocating kernel client and ceph >>>> > >> servers on the same box, a few people are doing it and it's also useful >>>> > >> for light development testing, so rather than reverting make sure to >>>> > >> not set those flags in the loopback case. >>>> > >> >>> > > >>> > > This does not clarify why the non-loopback case needs access to pfmemalloc >>> > > reserves. Granted, I've spent zero time on this but it's really unclear >>> > > what problem was originally tried to be solved and why dirty page limiting >>> > > was insufficient. Swap over NFS was always a very special case minimally >>> > > because it's immune to dirty page throttling. >> > >> > I don't think there was any particular problem tried to be solved, > Then please go back and look at why dirty page limiting is insufficient > for ceph. > The problem I was trying to solve is just the basic one where block drivers have in the past been required to be able to make forward progress on a write. With iscsi under heavy IO and memory use loads, we will see memory allocation failures from the network layer followed by hard system lock ups. The block layer and its drivers like scsi does not make any distinction between swap and non swap disks to handle this problem. It will always just work when the network is not involved. I thought we did not special case swap, because there were cases where there may not be swappable pages, and the mm layer then needs to write out pages to other non-swap disks to be able to free up memory. In the block layer and scsi drivers like qla2xxx forward progress is easier to handle. They just use bio, request, scsi_cmnd, scatterlist, etc mempools and internally preallocate some resources they need. For iscsi and other block drivers that use the network, it is more difficult as you of course know, and when I did the iscsi and rbd/ceph patches I had thought we were supposed to be using the memalloc related flags to handle this problem for both swap and non swap cases. I might have misunderstood you way back when I did those patches originally. For dirty page limiting, I thought the problem is that it is difficult to get right and at the same time not affect performance for some workloads. For non-net block drivers, we do not have to configure it just to handle this problem. It just works, and so I thought we have been trying to solve this problem in a similar way as the rest of the block layer by having some memory reserves. Also on a related note, I thought I heard at LSF that that forward progress requirement for non swap writes was going away. Is that true and is it something that is going to happen in the near future or was it more of a wish list item. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 03, 2015 at 03:03:53PM -0500, Mike Christie wrote: > On 04/02/2015 12:41 AM, Mel Gorman wrote: > > On Thu, Apr 02, 2015 at 02:40:19AM +0300, Ilya Dryomov wrote: > >> > On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote: > >>> > > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote: > >>>> > >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc > >>>> > >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and > >>>> > >> cephfs. However it turned out to not play nice with loopback scenario, > >>>> > >> leading to lockups with a full socket send-q and empty recv-q. > >>>> > >> > >>>> > >> While we always advised against colocating kernel client and ceph > >>>> > >> servers on the same box, a few people are doing it and it's also useful > >>>> > >> for light development testing, so rather than reverting make sure to > >>>> > >> not set those flags in the loopback case. > >>>> > >> > >>> > > > >>> > > This does not clarify why the non-loopback case needs access to pfmemalloc > >>> > > reserves. Granted, I've spent zero time on this but it's really unclear > >>> > > what problem was originally tried to be solved and why dirty page limiting > >>> > > was insufficient. Swap over NFS was always a very special case minimally > >>> > > because it's immune to dirty page throttling. > >> > > >> > I don't think there was any particular problem tried to be solved, > > Then please go back and look at why dirty page limiting is insufficient > > for ceph. > > > > The problem I was trying to solve is just the basic one where block > drivers have in the past been required to be able to make forward > progress on a write. With iscsi under heavy IO and memory use loads, we > will see memory allocation failures from the network layer followed by > hard system lock ups. Why was it unable to discard clean file pages or swap anonymous pages to local disk to ensure forward progress? Are you swapping over ISCSI that requires network transmits? If so, then you may need to do something similar to swap-over-NFS when the network is involved. Enabling pfmemalloc reserves for all communications is not the answer as it's trading one set of problems for another -- specifically emergency reserves will be used in cases where emergency reserves are not required with the risk of the machine locking up. > The block layer and its drivers like scsi does not > make any distinction between swap and non swap disks to handle this > problem. Can they be identified like swap-over-NFS is? If not, why not? > It will always just work when the network is not involved. Your other option is to fail to add swap if writing to it requires network buffers. The configuration is a hand-grenade and better to mark it as unsupported until such time as it is properly handled. > I > thought we did not special case swap, because there were cases where > there may not be swappable pages, and the mm layer then needs to write > out pages to other non-swap disks to be able to free up memory. > > In the block layer and scsi drivers like qla2xxx forward progress is > easier to handle. They just use bio, request, scsi_cmnd, scatterlist, > etc mempools and internally preallocate some resources they need. For > iscsi and other block drivers that use the network, it is more difficult > as you of course know, and when I did the iscsi and rbd/ceph patches I > had thought we were supposed to be using the memalloc related flags to > handle this problem for both swap and non swap cases. I might have > misunderstood you way back when I did those patches originally. > Only the swap case should use emergency reserves like this. File-backed cases should discard clean file pages and depend on the dirty page limiting to ensure enough clean pages are free. There is a corner case where anonymous memory uses 100-dirty_ratio of memory and all other memory is dirty so it is still necessary to have a local swap device to avoid this specific case. > For dirty page limiting, I thought the problem is that it is difficult > to get right and at the same time not affect performance for some > workloads. There can be stalls as a result of this. Digging into the reserves until the machine locks up does not avoid them. At best, it simply moves when the stall occurs because the allocation is fine but other users must wait on kswapd to make progress or direct reclaim to restore the emergency reserves. > For non-net block drivers, we do not have to configure it > just to handle this problem. It just works, and so I thought we have > been trying to solve this problem in a similar way as the rest of the > block layer by having some memory reserves. > That reserve is not the allocator emergency reserves. > Also on a related note, I thought I heard at LSF that that forward > progress requirement for non swap writes was going away. Is that true > and is it something that is going to happen in the near future or was it > more of a wish list item. At best, that was a wish-list item. There was some discussion stating it would be nice if reserves would be guaranteed to exist on a per-subsystem basis to guarantee forward progress but there is no commitment to actually implement it. Even if it did exist, then it would not be consumed via __GFP_MEMALLOC.
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 6b3f54ed65ba..9fa2cce71164 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -101,6 +101,7 @@ #define CON_FLAG_WRITE_PENDING 2 /* we have data ready to send */ #define CON_FLAG_SOCK_CLOSED 3 /* socket state changed to closed */ #define CON_FLAG_BACKOFF 4 /* need to retry queuing delayed work */ +#define CON_FLAG_LOCAL 5 /* using loopback interface */ static bool con_flag_valid(unsigned long con_flag) { @@ -110,6 +111,7 @@ static bool con_flag_valid(unsigned long con_flag) case CON_FLAG_WRITE_PENDING: case CON_FLAG_SOCK_CLOSED: case CON_FLAG_BACKOFF: + case CON_FLAG_LOCAL: return true; default: return false; @@ -470,6 +472,18 @@ static void set_sock_callbacks(struct socket *sock, * socket helpers */ +static bool sk_is_loopback(struct sock *sk) +{ + struct dst_entry *dst = sk_dst_get(sk); + bool ret = false; + + if (dst) { + ret = dst->dev && (dst->dev->flags & IFF_LOOPBACK); + dst_release(dst); + } + return ret; +} + /* * initiate connection to a remote socket. */ @@ -484,7 +498,7 @@ static int ceph_tcp_connect(struct ceph_connection *con) IPPROTO_TCP, &sock); if (ret) return ret; - sock->sk->sk_allocation = GFP_NOFS | __GFP_MEMALLOC; + sock->sk->sk_allocation = GFP_NOFS; #ifdef CONFIG_LOCKDEP lockdep_set_class(&sock->sk->sk_lock, &socket_class); @@ -510,6 +524,11 @@ static int ceph_tcp_connect(struct ceph_connection *con) return ret; } + if (sk_is_loopback(sock->sk)) + con_flag_set(con, CON_FLAG_LOCAL); + else + con_flag_clear(con, CON_FLAG_LOCAL); + if (con->msgr->tcp_nodelay) { int optval = 1; @@ -520,7 +539,18 @@ static int ceph_tcp_connect(struct ceph_connection *con) ret); } - sk_set_memalloc(sock->sk); + /* + * Tagging with SOCK_MEMALLOC / setting PF_MEMALLOC may lead to + * lockups if our peer is on the same host (communicating via + * loopback) due to sk_filter() mercilessly dropping pfmemalloc + * skbs on the receiving side - receiving loopback socket is + * not going to be tagged with SOCK_MEMALLOC. See: + * + * - http://article.gmane.org/gmane.linux.kernel/1418791 + * - http://article.gmane.org/gmane.linux.kernel.stable/46128 + */ + if (!con_flag_test(con, CON_FLAG_LOCAL)) + sk_set_memalloc(sock->sk); con->sock = sock; return 0; @@ -2811,7 +2841,11 @@ static void con_work(struct work_struct *work) unsigned long pflags = current->flags; bool fault; - current->flags |= PF_MEMALLOC; + /* + * See SOCK_MEMALLOC comment in ceph_tcp_connect(). + */ + if (!con_flag_test(con, CON_FLAG_LOCAL)) + current->flags |= PF_MEMALLOC; mutex_lock(&con->mutex); while (true) {
Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and cephfs. However it turned out to not play nice with loopback scenario, leading to lockups with a full socket send-q and empty recv-q. While we always advised against colocating kernel client and ceph servers on the same box, a few people are doing it and it's also useful for light development testing, so rather than reverting make sure to not set those flags in the loopback case. Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Mel Gorman <mgorman@suse.de> Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.18+, needs backporting Signed-off-by: Ilya Dryomov <idryomov@gmail.com> --- net/ceph/messenger.c | 40 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 37 insertions(+), 3 deletions(-)