Message ID | 20220130190259.94593-1-tonylu@linux.alibaba.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node | expand |
On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote: > Currently, pages are allocated in the process context, for its NUMA node > isn't equal to ibdev's, which is not the best policy for performance. > > Applications will generally perform best when the processes are > accessing memory on the same NUMA node. When numa_balancing enabled > (which is enabled by most of OS distributions), it moves tasks closer to > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind > to the same node usually. This reduces the latency when accessing remote > memory. It is very subjective per-specific test. I would expect that application will control NUMA memory policies (set_mempolicy(), ...) by itself without kernel setting NUMA node. Various *_alloc_node() APIs are applicable for in-kernel allocations where user can't control memory policy. I don't know SMC-R enough, but if I judge from your description, this allocation is controlled by the application. Thanks > > According to our tests in different scenarios, there has up to 15.30% > performance drop (Redis benchmark) when accessing remote memory. > > Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> > --- > net/smc/smc_core.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c > index 8935ef4811b0..2a28b045edfa 100644 > --- a/net/smc/smc_core.c > +++ b/net/smc/smc_core.c > @@ -2065,9 +2065,10 @@ int smcr_buf_reg_lgr(struct smc_link *lnk) > return rc; > } > > -static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr, > +static struct smc_buf_desc *smcr_new_buf_create(struct smc_connection *conn, > bool is_rmb, int bufsize) > { > + int node = ibdev_to_node(conn->lnk->smcibdev->ibdev); > struct smc_buf_desc *buf_desc; > > /* try to alloc a new buffer */ > @@ -2076,10 +2077,10 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr, > return ERR_PTR(-ENOMEM); > > buf_desc->order = get_order(bufsize); > - buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN | > - __GFP_NOMEMALLOC | __GFP_COMP | > - __GFP_NORETRY | __GFP_ZERO, > - buf_desc->order); > + buf_desc->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_NOWARN | > + __GFP_NOMEMALLOC | __GFP_COMP | > + __GFP_NORETRY | __GFP_ZERO, > + buf_desc->order); > if (!buf_desc->pages) { > kfree(buf_desc); > return ERR_PTR(-EAGAIN); > @@ -2190,7 +2191,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) > if (is_smcd) > buf_desc = smcd_new_buf_create(lgr, is_rmb, bufsize); > else > - buf_desc = smcr_new_buf_create(lgr, is_rmb, bufsize); > + buf_desc = smcr_new_buf_create(conn, is_rmb, bufsize); > > if (PTR_ERR(buf_desc) == -ENOMEM) > break; > -- > 2.32.0.3.g01195cf9f >
On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote: > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote: > > Currently, pages are allocated in the process context, for its NUMA node > > isn't equal to ibdev's, which is not the best policy for performance. > > > > Applications will generally perform best when the processes are > > accessing memory on the same NUMA node. When numa_balancing enabled > > (which is enabled by most of OS distributions), it moves tasks closer to > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind > > to the same node usually. This reduces the latency when accessing remote > > memory. > > It is very subjective per-specific test. I would expect that > application will control NUMA memory policies (set_mempolicy(), ...) > by itself without kernel setting NUMA node. > > Various *_alloc_node() APIs are applicable for in-kernel allocations > where user can't control memory policy. > > I don't know SMC-R enough, but if I judge from your description, this > allocation is controlled by the application. The original design of SMC doesn't handle the memory allocation of different NUMA node, and the application can't control the NUMA policy in SMC. It allocates memory according to the NUMA node based on the process context, which is determined by the scheduler. If application process runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends on the scheduler. If RDMA device is attached to node 1, the process runs on node 0, it allocates memory on node 0. This patch tries to allocate memory on the same NUMA node of RDMA device. Applications can't know the current node of RDMA device. The scheduler knows the node of memory, and can let applications run on the same node of memory and RDMA device. Thanks, Tony Lu
On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote: > On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote: > > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote: > > > Currently, pages are allocated in the process context, for its NUMA node > > > isn't equal to ibdev's, which is not the best policy for performance. > > > > > > Applications will generally perform best when the processes are > > > accessing memory on the same NUMA node. When numa_balancing enabled > > > (which is enabled by most of OS distributions), it moves tasks closer to > > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind > > > to the same node usually. This reduces the latency when accessing remote > > > memory. > > > > It is very subjective per-specific test. I would expect that > > application will control NUMA memory policies (set_mempolicy(), ...) > > by itself without kernel setting NUMA node. > > > > Various *_alloc_node() APIs are applicable for in-kernel allocations > > where user can't control memory policy. > > > > I don't know SMC-R enough, but if I judge from your description, this > > allocation is controlled by the application. > > The original design of SMC doesn't handle the memory allocation of > different NUMA node, and the application can't control the NUMA policy > in SMC. > > It allocates memory according to the NUMA node based on the process > context, which is determined by the scheduler. If application process > runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends > on the scheduler. If RDMA device is attached to node 1, the process runs > on node 0, it allocates memory on node 0. > > This patch tries to allocate memory on the same NUMA node of RDMA > device. Applications can't know the current node of RDMA device. The > scheduler knows the node of memory, and can let applications run on the > same node of memory and RDMA device. I don't know, everything explained above is controlled through memory policy, where application needs to run on same node as ibdev. Thanks > > Thanks, > Tony Lu
On 2/7/22 14:49, Leon Romanovsky wrote: > On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote: >> On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote: >>> On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote: >>>> Currently, pages are allocated in the process context, for its NUMA node >>>> isn't equal to ibdev's, which is not the best policy for performance. >>>> >>>> Applications will generally perform best when the processes are >>>> accessing memory on the same NUMA node. When numa_balancing enabled >>>> (which is enabled by most of OS distributions), it moves tasks closer to >>>> the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind >>>> to the same node usually. This reduces the latency when accessing remote >>>> memory. >>> >>> It is very subjective per-specific test. I would expect that >>> application will control NUMA memory policies (set_mempolicy(), ...) >>> by itself without kernel setting NUMA node. >>> >>> Various *_alloc_node() APIs are applicable for in-kernel allocations >>> where user can't control memory policy. >>> >>> I don't know SMC-R enough, but if I judge from your description, this >>> allocation is controlled by the application. >> >> The original design of SMC doesn't handle the memory allocation of >> different NUMA node, and the application can't control the NUMA policy >> in SMC. >> >> It allocates memory according to the NUMA node based on the process >> context, which is determined by the scheduler. If application process >> runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends >> on the scheduler. If RDMA device is attached to node 1, the process runs >> on node 0, it allocates memory on node 0. >> >> This patch tries to allocate memory on the same NUMA node of RDMA >> device. Applications can't know the current node of RDMA device. The >> scheduler knows the node of memory, and can let applications run on the >> same node of memory and RDMA device. > > I don't know, everything explained above is controlled through memory > policy, where application needs to run on same node as ibdev. The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP applications. The idea is to avoid almost any modification to the application, just switch the address family. So while what you say makes a lot of sense for applications that intend to use RDMA, in the case of SMC-R we can safely assume that most if not all applications running it assume they get connectivity through a non-RDMA NIC. Hence we cannot expect the applications to think about aspects such as NUMA, and we should do the right thing within SMC-R. Ciao, Stefan
On Tue, Feb 08, 2022 at 10:10:55AM +0100, Stefan Raspl wrote: > On 2/7/22 14:49, Leon Romanovsky wrote: > > On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote: > > > On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote: > > > > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote: > > > > > Currently, pages are allocated in the process context, for its NUMA node > > > > > isn't equal to ibdev's, which is not the best policy for performance. > > > > > > > > > > Applications will generally perform best when the processes are > > > > > accessing memory on the same NUMA node. When numa_balancing enabled > > > > > (which is enabled by most of OS distributions), it moves tasks closer to > > > > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind > > > > > to the same node usually. This reduces the latency when accessing remote > > > > > memory. > > > > > > > > It is very subjective per-specific test. I would expect that > > > > application will control NUMA memory policies (set_mempolicy(), ...) > > > > by itself without kernel setting NUMA node. > > > > > > > > Various *_alloc_node() APIs are applicable for in-kernel allocations > > > > where user can't control memory policy. > > > > > > > > I don't know SMC-R enough, but if I judge from your description, this > > > > allocation is controlled by the application. > > > > > > The original design of SMC doesn't handle the memory allocation of > > > different NUMA node, and the application can't control the NUMA policy > > > in SMC. > > > > > > It allocates memory according to the NUMA node based on the process > > > context, which is determined by the scheduler. If application process > > > runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends > > > on the scheduler. If RDMA device is attached to node 1, the process runs > > > on node 0, it allocates memory on node 0. > > > > > > This patch tries to allocate memory on the same NUMA node of RDMA > > > device. Applications can't know the current node of RDMA device. The > > > scheduler knows the node of memory, and can let applications run on the > > > same node of memory and RDMA device. > > > > I don't know, everything explained above is controlled through memory > > policy, where application needs to run on same node as ibdev. > > The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP > applications. The idea is to avoid almost any modification to the > application, just switch the address family. So while what you say makes a > lot of sense for applications that intend to use RDMA, in the case of SMC-R > we can safely assume that most if not all applications running it assume > they get connectivity through a non-RDMA NIC. Hence we cannot expect the > applications to think about aspects such as NUMA, and we should do the right > thing within SMC-R. And here comes the problem, you are doing the right thing for very specific and narrow use case, where application and ibdev run on same node. It is not true for multi-core systems as application will be scheduled on less load node (in very simplistic form). In general case, the application will get CPU and memory based on scheduler heuristic as you don't use memory policy to restrict it. The assumption that allocations need to be close to ibdev and not to applications can lead to worse performance. Thanks > > Ciao, > Stefan
On Tue, Feb 08, 2022 at 11:32:23AM +0200, Leon Romanovsky wrote: > On Tue, Feb 08, 2022 at 10:10:55AM +0100, Stefan Raspl wrote: > > On 2/7/22 14:49, Leon Romanovsky wrote: > > > On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote: > > > > On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote: > > > > > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote: > > > > > > Currently, pages are allocated in the process context, for its NUMA node > > > > > > isn't equal to ibdev's, which is not the best policy for performance. > > > > > > > > > > > > Applications will generally perform best when the processes are > > > > > > accessing memory on the same NUMA node. When numa_balancing enabled > > > > > > (which is enabled by most of OS distributions), it moves tasks closer to > > > > > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind > > > > > > to the same node usually. This reduces the latency when accessing remote > > > > > > memory. > > > > > > > > > > It is very subjective per-specific test. I would expect that > > > > > application will control NUMA memory policies (set_mempolicy(), ...) > > > > > by itself without kernel setting NUMA node. > > > > > > > > > > Various *_alloc_node() APIs are applicable for in-kernel allocations > > > > > where user can't control memory policy. > > > > > > > > > > I don't know SMC-R enough, but if I judge from your description, this > > > > > allocation is controlled by the application. > > > > > > > > The original design of SMC doesn't handle the memory allocation of > > > > different NUMA node, and the application can't control the NUMA policy > > > > in SMC. > > > > > > > > It allocates memory according to the NUMA node based on the process > > > > context, which is determined by the scheduler. If application process > > > > runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends > > > > on the scheduler. If RDMA device is attached to node 1, the process runs > > > > on node 0, it allocates memory on node 0. > > > > > > > > This patch tries to allocate memory on the same NUMA node of RDMA > > > > device. Applications can't know the current node of RDMA device. The > > > > scheduler knows the node of memory, and can let applications run on the > > > > same node of memory and RDMA device. > > > > > > I don't know, everything explained above is controlled through memory > > > policy, where application needs to run on same node as ibdev. > > > > The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP > > applications. The idea is to avoid almost any modification to the > > application, just switch the address family. So while what you say makes a > > lot of sense for applications that intend to use RDMA, in the case of SMC-R > > we can safely assume that most if not all applications running it assume > > they get connectivity through a non-RDMA NIC. Hence we cannot expect the > > applications to think about aspects such as NUMA, and we should do the right > > thing within SMC-R. > > And here comes the problem, you are doing the right thing for very > specific and narrow use case, where application and ibdev run on > same node. It is not true for multi-core systems as application will > be scheduled on less load node (in very simplistic form). > > In general case, the application will get CPU and memory based on scheduler > heuristic as you don't use memory policy to restrict it. The assumption > that allocations need to be close to ibdev and not to applications can > lead to worse performance. > Yes, the applications cannot run faster if they always access remote memory. There are something complex in SMC, so choose to bind to the RDMA device. As Stefan mentioned, SMC is to provide a drop-in replacement for TCP. SMC doesn't allocate memory for the new connection most of time, it has linkgroup-level buffer reuse pool. The memory is only allocated during connecting in process context or workqueue (non-blocking) if no buffer in the beginning. Later it will reuse the buffer in the link group. The data operations (send/recv) occurs in the following progress and wake up by scheduler (epoll). Also, local IRQ binding can help process runs on the node of RDMA device. NUMA 0 | NUMA 1 // Application A | connect() | smc_connect_rdma() | smc_conn_create() | // create buffer| smc_buf_create()| ... | | close() | ... | // recycle buffer | smc_buf_unuse() | | // Application B | connect() | smc_connect_rdma() | smc_conn_create() | // reuse buffer in NUMA 0 | smc_buf_create() Thanks, Tony Lu
On Wed, Feb 09, 2022 at 04:00:34PM +0800, Tony Lu wrote: > On Tue, Feb 08, 2022 at 11:32:23AM +0200, Leon Romanovsky wrote: > > On Tue, Feb 08, 2022 at 10:10:55AM +0100, Stefan Raspl wrote: > > > On 2/7/22 14:49, Leon Romanovsky wrote: > > > > On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote: > > > > > On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote: > > > > > > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote: > > > > > > > Currently, pages are allocated in the process context, for its NUMA node > > > > > > > isn't equal to ibdev's, which is not the best policy for performance. > > > > > > > > > > > > > > Applications will generally perform best when the processes are > > > > > > > accessing memory on the same NUMA node. When numa_balancing enabled > > > > > > > (which is enabled by most of OS distributions), it moves tasks closer to > > > > > > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind > > > > > > > to the same node usually. This reduces the latency when accessing remote > > > > > > > memory. > > > > > > > > > > > > It is very subjective per-specific test. I would expect that > > > > > > application will control NUMA memory policies (set_mempolicy(), ...) > > > > > > by itself without kernel setting NUMA node. > > > > > > > > > > > > Various *_alloc_node() APIs are applicable for in-kernel allocations > > > > > > where user can't control memory policy. > > > > > > > > > > > > I don't know SMC-R enough, but if I judge from your description, this > > > > > > allocation is controlled by the application. > > > > > > > > > > The original design of SMC doesn't handle the memory allocation of > > > > > different NUMA node, and the application can't control the NUMA policy > > > > > in SMC. > > > > > > > > > > It allocates memory according to the NUMA node based on the process > > > > > context, which is determined by the scheduler. If application process > > > > > runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends > > > > > on the scheduler. If RDMA device is attached to node 1, the process runs > > > > > on node 0, it allocates memory on node 0. > > > > > > > > > > This patch tries to allocate memory on the same NUMA node of RDMA > > > > > device. Applications can't know the current node of RDMA device. The > > > > > scheduler knows the node of memory, and can let applications run on the > > > > > same node of memory and RDMA device. > > > > > > > > I don't know, everything explained above is controlled through memory > > > > policy, where application needs to run on same node as ibdev. > > > > > > The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP > > > applications. The idea is to avoid almost any modification to the > > > application, just switch the address family. So while what you say makes a > > > lot of sense for applications that intend to use RDMA, in the case of SMC-R > > > we can safely assume that most if not all applications running it assume > > > they get connectivity through a non-RDMA NIC. Hence we cannot expect the > > > applications to think about aspects such as NUMA, and we should do the right > > > thing within SMC-R. > > > > And here comes the problem, you are doing the right thing for very > > specific and narrow use case, where application and ibdev run on > > same node. It is not true for multi-core systems as application will > > be scheduled on less load node (in very simplistic form). > > > > In general case, the application will get CPU and memory based on scheduler > > heuristic as you don't use memory policy to restrict it. The assumption > > that allocations need to be close to ibdev and not to applications can > > lead to worse performance. > > > > Yes, the applications cannot run faster if they always access remote > memory. There are something complex in SMC, so choose to bind to the > RDMA device. > > As Stefan mentioned, SMC is to provide a drop-in replacement for TCP. If I'm looking on the right piece of code (net/core/skbuff.c:build_skb), even SKB is not allocated close to ehternet device. I'm not convinced that SMC should be different here. Thanks
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index 8935ef4811b0..2a28b045edfa 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -2065,9 +2065,10 @@ int smcr_buf_reg_lgr(struct smc_link *lnk) return rc; } -static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr, +static struct smc_buf_desc *smcr_new_buf_create(struct smc_connection *conn, bool is_rmb, int bufsize) { + int node = ibdev_to_node(conn->lnk->smcibdev->ibdev); struct smc_buf_desc *buf_desc; /* try to alloc a new buffer */ @@ -2076,10 +2077,10 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr, return ERR_PTR(-ENOMEM); buf_desc->order = get_order(bufsize); - buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN | - __GFP_NOMEMALLOC | __GFP_COMP | - __GFP_NORETRY | __GFP_ZERO, - buf_desc->order); + buf_desc->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_NOWARN | + __GFP_NOMEMALLOC | __GFP_COMP | + __GFP_NORETRY | __GFP_ZERO, + buf_desc->order); if (!buf_desc->pages) { kfree(buf_desc); return ERR_PTR(-EAGAIN); @@ -2190,7 +2191,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) if (is_smcd) buf_desc = smcd_new_buf_create(lgr, is_rmb, bufsize); else - buf_desc = smcr_new_buf_create(lgr, is_rmb, bufsize); + buf_desc = smcr_new_buf_create(conn, is_rmb, bufsize); if (PTR_ERR(buf_desc) == -ENOMEM) break;
Currently, pages are allocated in the process context, for its NUMA node isn't equal to ibdev's, which is not the best policy for performance. Applications will generally perform best when the processes are accessing memory on the same NUMA node. When numa_balancing enabled (which is enabled by most of OS distributions), it moves tasks closer to the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind to the same node usually. This reduces the latency when accessing remote memory. According to our tests in different scenarios, there has up to 15.30% performance drop (Redis benchmark) when accessing remote memory. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> --- net/smc/smc_core.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)