Message ID | ceaf7a4b035e78cdbdde4e9a4ab71ba61a5e5457.1671194454.git.bcodding@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 98123866fcf3fe95a0c1b198ef122dfdbd351916 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Stop corrupting socket's task_frag | expand |
On Fri, Dec 16, 2022 at 1:45 PM Benjamin Coddington <bcodding@redhat.com> wrote: > > Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the > GFP_NOIO flag on sk_allocation which the networking system uses to decide > when it is safe to use current->task_frag. The results of this are > unexpected corruption in task_frag when SUNRPC is involved in memory > reclaim. > > The corruption can be seen in crashes, but the root cause is often > difficult to ascertain as a crashing machine's stack trace will have no > evidence of being near NFS or SUNRPC code. I believe this problem to > be much more pervasive than reports to the community may indicate. > > Fix this by having kernel users of sockets that may corrupt task_frag due > to reclaim set sk_use_task_frag = false. Preemptively correcting this > situation for users that still set sk_allocation allows them to convert to > memalloc_nofs_save/restore without the same unexpected corruptions that are > sure to follow, unlikely to show up in testing, and difficult to bisect. > I am back from PTO. It seems inet_ctl_sock_create() has been forgotten. Without following fix, ICMP messages sent from softirq would corrupt innocent thread task_frag. (I will submit this patch formally a bit later today) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index ab4a06be489b5d410cec603bf56248d31dbc90dd..6c0ec27899431eb56e2f9d0c3a936b77f44ccaca 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1665,6 +1665,7 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family, if (rc == 0) { *sk = sock->sk; (*sk)->sk_allocation = GFP_ATOMIC; + (*sk)->sk_use_task_frag = false; /* * Unhash it so that IP input processing does not even see it, * we do not wish this socket to see incoming packets.
On Tue, Jan 03, 2023 at 03:26:27PM +0100, Eric Dumazet wrote: > On Fri, Dec 16, 2022 at 1:45 PM Benjamin Coddington <bcodding@redhat.com> wrote: > > > > Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the > > GFP_NOIO flag on sk_allocation which the networking system uses to decide > > when it is safe to use current->task_frag. The results of this are > > unexpected corruption in task_frag when SUNRPC is involved in memory > > reclaim. > > > > The corruption can be seen in crashes, but the root cause is often > > difficult to ascertain as a crashing machine's stack trace will have no > > evidence of being near NFS or SUNRPC code. I believe this problem to > > be much more pervasive than reports to the community may indicate. > > > > Fix this by having kernel users of sockets that may corrupt task_frag due > > to reclaim set sk_use_task_frag = false. Preemptively correcting this > > situation for users that still set sk_allocation allows them to convert to > > memalloc_nofs_save/restore without the same unexpected corruptions that are > > sure to follow, unlikely to show up in testing, and difficult to bisect. > > > > I am back from PTO. > > It seems inet_ctl_sock_create() has been forgotten. > > Without following fix, ICMP messages sent from softirq would corrupt > innocent thread task_frag. I didn't consider setting ->sk_use_task_frag on ICMP sockets as my understanding was that only TCP and ip_append_data() could eventually call sk_page_frag(). Therefore, I didn't see how ICMP sockets could be affected. Did I miss something? > (I will submit this patch formally a bit later today) > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index ab4a06be489b5d410cec603bf56248d31dbc90dd..6c0ec27899431eb56e2f9d0c3a936b77f44ccaca > 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1665,6 +1665,7 @@ int inet_ctl_sock_create(struct sock **sk, > unsigned short family, > if (rc == 0) { > *sk = sock->sk; > (*sk)->sk_allocation = GFP_ATOMIC; > + (*sk)->sk_use_task_frag = false; > /* > * Unhash it so that IP input processing does not even see it, > * we do not wish this socket to see incoming packets. >
On Tue, Jan 3, 2023 at 4:14 PM Guillaume Nault <gnault@redhat.com> wrote: > > On Tue, Jan 03, 2023 at 03:26:27PM +0100, Eric Dumazet wrote: > > On Fri, Dec 16, 2022 at 1:45 PM Benjamin Coddington <bcodding@redhat.com> wrote: > > > > > > Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the > > > GFP_NOIO flag on sk_allocation which the networking system uses to decide > > > when it is safe to use current->task_frag. The results of this are > > > unexpected corruption in task_frag when SUNRPC is involved in memory > > > reclaim. > > > > > > The corruption can be seen in crashes, but the root cause is often > > > difficult to ascertain as a crashing machine's stack trace will have no > > > evidence of being near NFS or SUNRPC code. I believe this problem to > > > be much more pervasive than reports to the community may indicate. > > > > > > Fix this by having kernel users of sockets that may corrupt task_frag due > > > to reclaim set sk_use_task_frag = false. Preemptively correcting this > > > situation for users that still set sk_allocation allows them to convert to > > > memalloc_nofs_save/restore without the same unexpected corruptions that are > > > sure to follow, unlikely to show up in testing, and difficult to bisect. > > > > > > > I am back from PTO. > > > > It seems inet_ctl_sock_create() has been forgotten. > > > > Without following fix, ICMP messages sent from softirq would corrupt > > innocent thread task_frag. > > I didn't consider setting ->sk_use_task_frag on ICMP sockets as my > understanding was that only TCP and ip_append_data() could eventually > call sk_page_frag(). Therefore, I didn't see how ICMP sockets could be > affected. Did I miss something? net/ipv4/ping.c ICMP uses per-cpu sockets (look in icmp_init(), icmp_xmit_lock()...) icmp_rcv() -> icmp_echo() -> icmp_reply() -> icmp_push_reply() -> ip_append_data() -> sk_page_frag_refill() > > > (I will submit this patch formally a bit later today) > > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > > index ab4a06be489b5d410cec603bf56248d31dbc90dd..6c0ec27899431eb56e2f9d0c3a936b77f44ccaca > > 100644 > > --- a/net/ipv4/af_inet.c > > +++ b/net/ipv4/af_inet.c > > @@ -1665,6 +1665,7 @@ int inet_ctl_sock_create(struct sock **sk, > > unsigned short family, > > if (rc == 0) { > > *sk = sock->sk; > > (*sk)->sk_allocation = GFP_ATOMIC; > > + (*sk)->sk_use_task_frag = false; > > /* > > * Unhash it so that IP input processing does not even see it, > > * we do not wish this socket to see incoming packets. > > >
On Tue, Jan 03, 2023 at 05:10:24PM +0100, Eric Dumazet wrote: > On Tue, Jan 3, 2023 at 4:14 PM Guillaume Nault <gnault@redhat.com> wrote: > > > > On Tue, Jan 03, 2023 at 03:26:27PM +0100, Eric Dumazet wrote: > > > On Fri, Dec 16, 2022 at 1:45 PM Benjamin Coddington <bcodding@redhat.com> wrote: > > > > > > > > Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the > > > > GFP_NOIO flag on sk_allocation which the networking system uses to decide > > > > when it is safe to use current->task_frag. The results of this are > > > > unexpected corruption in task_frag when SUNRPC is involved in memory > > > > reclaim. > > > > > > > > The corruption can be seen in crashes, but the root cause is often > > > > difficult to ascertain as a crashing machine's stack trace will have no > > > > evidence of being near NFS or SUNRPC code. I believe this problem to > > > > be much more pervasive than reports to the community may indicate. > > > > > > > > Fix this by having kernel users of sockets that may corrupt task_frag due > > > > to reclaim set sk_use_task_frag = false. Preemptively correcting this > > > > situation for users that still set sk_allocation allows them to convert to > > > > memalloc_nofs_save/restore without the same unexpected corruptions that are > > > > sure to follow, unlikely to show up in testing, and difficult to bisect. > > > > > > > > > > I am back from PTO. > > > > > > It seems inet_ctl_sock_create() has been forgotten. > > > > > > Without following fix, ICMP messages sent from softirq would corrupt > > > innocent thread task_frag. > > > > I didn't consider setting ->sk_use_task_frag on ICMP sockets as my > > understanding was that only TCP and ip_append_data() could eventually > > call sk_page_frag(). Therefore, I didn't see how ICMP sockets could be > > affected. Did I miss something? > > net/ipv4/ping.c > > ICMP uses per-cpu sockets (look in icmp_init(), icmp_xmit_lock()...) > > icmp_rcv() > -> icmp_echo() > -> icmp_reply() > -> icmp_push_reply() > -> ip_append_data() > -> sk_page_frag_refill() Thanks, that looks so obvious now :/ Sorry for missing that case. > > > > > (I will submit this patch formally a bit later today) > > > > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > > > index ab4a06be489b5d410cec603bf56248d31dbc90dd..6c0ec27899431eb56e2f9d0c3a936b77f44ccaca > > > 100644 > > > --- a/net/ipv4/af_inet.c > > > +++ b/net/ipv4/af_inet.c > > > @@ -1665,6 +1665,7 @@ int inet_ctl_sock_create(struct sock **sk, > > > unsigned short family, > > > if (rc == 0) { > > > *sk = sock->sk; > > > (*sk)->sk_allocation = GFP_ATOMIC; > > > + (*sk)->sk_use_task_frag = false; > > > /* > > > * Unhash it so that IP input processing does not even see it, > > > * we do not wish this socket to see incoming packets. > > > > > >
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index 0e58a3187345..757f4692b5bd 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -1030,6 +1030,9 @@ static int conn_connect(struct drbd_connection *connection) sock.socket->sk->sk_allocation = GFP_NOIO; msock.socket->sk->sk_allocation = GFP_NOIO; + sock.socket->sk->sk_use_task_frag = false; + msock.socket->sk->sk_use_task_frag = false; + sock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE_BULK; msock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE; diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index e379ccc63c52..592cfa8b765a 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -512,6 +512,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send, noreclaim_flag = memalloc_noreclaim_save(); do { sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC; + sock->sk->sk_use_task_frag = false; msg.msg_name = NULL; msg.msg_namelen = 0; msg.msg_control = NULL; diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index b69b89166b6b..8cedc1ef496c 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1537,6 +1537,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid) queue->sock->sk->sk_rcvtimeo = 10 * HZ; queue->sock->sk->sk_allocation = GFP_ATOMIC; + queue->sock->sk->sk_use_task_frag = false; nvme_tcp_set_queue_io_cpu(queue); queue->request = NULL; queue->data_remaining = 0; diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index 5fb1f364e815..1d1cf641937c 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -738,6 +738,7 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session, sk->sk_reuse = SK_CAN_REUSE; sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */ sk->sk_allocation = GFP_ATOMIC; + sk->sk_use_task_frag = false; sk_set_memalloc(sk); sock_no_linger(sk); diff --git a/drivers/usb/usbip/usbip_common.c b/drivers/usb/usbip/usbip_common.c index f8b326eed54d..a2b2da1255dd 100644 --- a/drivers/usb/usbip/usbip_common.c +++ b/drivers/usb/usbip/usbip_common.c @@ -315,6 +315,7 @@ int usbip_recv(struct socket *sock, void *buf, int size) do { sock->sk->sk_allocation = GFP_NOIO; + sock->sk->sk_use_task_frag = false; result = sock_recvmsg(sock, &msg, MSG_WAITALL); if (result <= 0) diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index e80252a83225..7bc7b5e03c51 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -2944,6 +2944,7 @@ generic_ip_connect(struct TCP_Server_Info *server) cifs_dbg(FYI, "Socket created\n"); server->ssocket = socket; socket->sk->sk_allocation = GFP_NOFS; + socket->sk->sk_use_task_frag = false; if (sfamily == AF_INET6) cifs_reclassify_socket6(socket); else diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 8b80ca0cd65f..4450721ec83c 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -645,6 +645,7 @@ static void add_sock(struct socket *sock, struct connection *con) if (dlm_config.ci_protocol == DLM_PROTO_SCTP) sk->sk_state_change = lowcomms_state_change; sk->sk_allocation = GFP_NOFS; + sk->sk_use_task_frag = false; sk->sk_error_report = lowcomms_error_report; release_sock(sk); } @@ -1769,6 +1770,7 @@ static int dlm_listen_for_all(void) listen_con.sock = sock; sock->sk->sk_allocation = GFP_NOFS; + sock->sk->sk_use_task_frag = false; sock->sk->sk_data_ready = lowcomms_listen_data_ready; release_sock(sock->sk); diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 37d222bdfc8c..a07b24d170f2 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1602,6 +1602,7 @@ static void o2net_start_connect(struct work_struct *work) sc->sc_sock = sock; /* freed by sc_kref_release */ sock->sk->sk_allocation = GFP_ATOMIC; + sock->sk->sk_use_task_frag = false; myaddr.sin_family = AF_INET; myaddr.sin_addr.s_addr = mynode->nd_ipv4_address; diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index 07db2f436d44..d9120f14684b 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -868,6 +868,7 @@ static int p9_socket_open(struct p9_client *client, struct socket *csocket) } csocket->sk->sk_allocation = GFP_NOIO; + csocket->sk->sk_use_task_frag = false; file = sock_alloc_file(csocket, 0, NULL); if (IS_ERR(file)) { pr_err("%s (%d): failed to map fd\n", diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index dfa237fbd5a3..1d06e114ba3f 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -446,6 +446,7 @@ int ceph_tcp_connect(struct ceph_connection *con) if (ret) return ret; sock->sk->sk_allocation = GFP_NOFS; + sock->sk->sk_use_task_frag = false; #ifdef CONFIG_LOCKDEP lockdep_set_class(&sock->sk->sk_lock, &socket_class); diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index c0506d0d7478..aaa5b2741b79 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -1882,6 +1882,7 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt, sk->sk_write_space = xs_udp_write_space; sk->sk_state_change = xs_local_state_change; sk->sk_error_report = xs_error_report; + sk->sk_use_task_frag = false; xprt_clear_connected(xprt); @@ -2082,6 +2083,7 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) sk->sk_user_data = xprt; sk->sk_data_ready = xs_data_ready; sk->sk_write_space = xs_udp_write_space; + sk->sk_use_task_frag = false; xprt_set_connected(xprt); @@ -2249,6 +2251,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) sk->sk_state_change = xs_tcp_state_change; sk->sk_write_space = xs_tcp_write_space; sk->sk_error_report = xs_error_report; + sk->sk_use_task_frag = false; /* socket options */ sock_reset_flag(sk, SOCK_LINGER); diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c index d6fece1ed982..74a54295c164 100644 --- a/net/xfrm/espintcp.c +++ b/net/xfrm/espintcp.c @@ -489,6 +489,7 @@ static int espintcp_init_sk(struct sock *sk) /* avoid using task_frag */ sk->sk_allocation = GFP_ATOMIC; + sk->sk_use_task_frag = false; return 0;