Message ID | 20230711202047.3818697-10-willy@infradead.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Avoid the mmap lock for fault-around | expand |
On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle) <willy@infradead.org> wrote: > > From: Arjun Roy <arjunroy@google.com> > > Per-VMA locking allows us to lock a struct vm_area_struct without > taking the process-wide mmap lock in read mode. > > Consider a process workload where the mmap lock is taken constantly in > write mode. In this scenario, all zerocopy receives are periodically > blocked during that period of time - though in principle, the memory > ranges being used by TCP are not touched by the operations that need > the mmap write lock. This results in performance degradation. > > Now consider another workload where the mmap lock is never taken in > write mode, but there are many TCP connections using receive zerocopy > that are concurrently receiving. These connections all take the mmap > lock in read mode, but this does induce a lot of contention and atomic > ops for this process-wide lock. This results in additional CPU > overhead caused by contending on the cache line for this lock. > > However, with per-vma locking, both of these problems can be avoided. > > As a test, I ran an RPC-style request/response workload with 4KB > payloads and receive zerocopy enabled, with 100 simultaneous TCP > connections. I measured perf cycles within the > find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and > without per-vma locking enabled. > > When using process-wide mmap semaphore read locking, about 1% of > measured perf cycles were within this path. With per-VMA locking, this > value dropped to about 0.45%. > > Signed-off-by: Arjun Roy <arjunroy@google.com> > Reviewed-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Seems to match the original version with less churn. Reviewed-by: Suren Baghdasaryan <surenb@google.com> > --- > net/ipv4/tcp.c | 39 ++++++++++++++++++++++++++++++++------- > 1 file changed, 32 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 1542de3f66f7..7118ec6cf886 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2038,6 +2038,30 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk, > } > } > nit: Maybe add a comment that mmap_locked value is undefined/meaningless if the function returns NULL? > +static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm, > + unsigned long address, bool *mmap_locked) > +{ > + struct vm_area_struct *vma = lock_vma_under_rcu(mm, address); > + > + if (vma) { > + if (vma->vm_ops != &tcp_vm_ops) { > + vma_end_read(vma); > + return NULL; > + } > + *mmap_locked = false; > + return vma; > + } > + > + mmap_read_lock(mm); > + vma = vma_lookup(mm, address); > + if (!vma || vma->vm_ops != &tcp_vm_ops) { > + mmap_read_unlock(mm); > + return NULL; > + } > + *mmap_locked = true; > + return vma; > +} > + > #define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32 > static int tcp_zerocopy_receive(struct sock *sk, > struct tcp_zerocopy_receive *zc, > @@ -2055,6 +2079,7 @@ static int tcp_zerocopy_receive(struct sock *sk, > u32 seq = tp->copied_seq; > u32 total_bytes_to_map; > int inq = tcp_inq(sk); > + bool mmap_locked; > int ret; > > zc->copybuf_len = 0; > @@ -2079,13 +2104,10 @@ static int tcp_zerocopy_receive(struct sock *sk, > return 0; > } > > - mmap_read_lock(current->mm); > - > - vma = vma_lookup(current->mm, address); > - if (!vma || vma->vm_ops != &tcp_vm_ops) { > - mmap_read_unlock(current->mm); > + vma = find_tcp_vma(current->mm, address, &mmap_locked); > + if (!vma) > return -EINVAL; > - } > + > vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); > avail_len = min_t(u32, vma_len, inq); > total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); > @@ -2159,7 +2181,10 @@ static int tcp_zerocopy_receive(struct sock *sk, > zc, total_bytes_to_map); > } > out: > - mmap_read_unlock(current->mm); > + if (mmap_locked) > + mmap_read_unlock(current->mm); > + else > + vma_end_read(vma); > /* Try to copy straggler data. */ > if (!ret) > copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss); > -- > 2.39.2 >
Eric? Arjun? Any comments? On Tue, Jul 11, 2023 at 09:20:47PM +0100, Matthew Wilcox (Oracle) wrote: > From: Arjun Roy <arjunroy@google.com> > > Per-VMA locking allows us to lock a struct vm_area_struct without > taking the process-wide mmap lock in read mode. > > Consider a process workload where the mmap lock is taken constantly in > write mode. In this scenario, all zerocopy receives are periodically > blocked during that period of time - though in principle, the memory > ranges being used by TCP are not touched by the operations that need > the mmap write lock. This results in performance degradation. > > Now consider another workload where the mmap lock is never taken in > write mode, but there are many TCP connections using receive zerocopy > that are concurrently receiving. These connections all take the mmap > lock in read mode, but this does induce a lot of contention and atomic > ops for this process-wide lock. This results in additional CPU > overhead caused by contending on the cache line for this lock. > > However, with per-vma locking, both of these problems can be avoided. > > As a test, I ran an RPC-style request/response workload with 4KB > payloads and receive zerocopy enabled, with 100 simultaneous TCP > connections. I measured perf cycles within the > find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and > without per-vma locking enabled. > > When using process-wide mmap semaphore read locking, about 1% of > measured perf cycles were within this path. With per-VMA locking, this > value dropped to about 0.45%. > > Signed-off-by: Arjun Roy <arjunroy@google.com> > Reviewed-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > --- > net/ipv4/tcp.c | 39 ++++++++++++++++++++++++++++++++------- > 1 file changed, 32 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 1542de3f66f7..7118ec6cf886 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2038,6 +2038,30 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk, > } > } > > +static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm, > + unsigned long address, bool *mmap_locked) > +{ > + struct vm_area_struct *vma = lock_vma_under_rcu(mm, address); > + > + if (vma) { > + if (vma->vm_ops != &tcp_vm_ops) { > + vma_end_read(vma); > + return NULL; > + } > + *mmap_locked = false; > + return vma; > + } > + > + mmap_read_lock(mm); > + vma = vma_lookup(mm, address); > + if (!vma || vma->vm_ops != &tcp_vm_ops) { > + mmap_read_unlock(mm); > + return NULL; > + } > + *mmap_locked = true; > + return vma; > +} > + > #define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32 > static int tcp_zerocopy_receive(struct sock *sk, > struct tcp_zerocopy_receive *zc, > @@ -2055,6 +2079,7 @@ static int tcp_zerocopy_receive(struct sock *sk, > u32 seq = tp->copied_seq; > u32 total_bytes_to_map; > int inq = tcp_inq(sk); > + bool mmap_locked; > int ret; > > zc->copybuf_len = 0; > @@ -2079,13 +2104,10 @@ static int tcp_zerocopy_receive(struct sock *sk, > return 0; > } > > - mmap_read_lock(current->mm); > - > - vma = vma_lookup(current->mm, address); > - if (!vma || vma->vm_ops != &tcp_vm_ops) { > - mmap_read_unlock(current->mm); > + vma = find_tcp_vma(current->mm, address, &mmap_locked); > + if (!vma) > return -EINVAL; > - } > + > vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); > avail_len = min_t(u32, vma_len, inq); > total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); > @@ -2159,7 +2181,10 @@ static int tcp_zerocopy_receive(struct sock *sk, > zc, total_bytes_to_map); > } > out: > - mmap_read_unlock(current->mm); > + if (mmap_locked) > + mmap_read_unlock(current->mm); > + else > + vma_end_read(vma); > /* Try to copy straggler data. */ > if (!ret) > copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss); > -- > 2.39.2 >
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1542de3f66f7..7118ec6cf886 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2038,6 +2038,30 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk, } } +static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm, + unsigned long address, bool *mmap_locked) +{ + struct vm_area_struct *vma = lock_vma_under_rcu(mm, address); + + if (vma) { + if (vma->vm_ops != &tcp_vm_ops) { + vma_end_read(vma); + return NULL; + } + *mmap_locked = false; + return vma; + } + + mmap_read_lock(mm); + vma = vma_lookup(mm, address); + if (!vma || vma->vm_ops != &tcp_vm_ops) { + mmap_read_unlock(mm); + return NULL; + } + *mmap_locked = true; + return vma; +} + #define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32 static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc, @@ -2055,6 +2079,7 @@ static int tcp_zerocopy_receive(struct sock *sk, u32 seq = tp->copied_seq; u32 total_bytes_to_map; int inq = tcp_inq(sk); + bool mmap_locked; int ret; zc->copybuf_len = 0; @@ -2079,13 +2104,10 @@ static int tcp_zerocopy_receive(struct sock *sk, return 0; } - mmap_read_lock(current->mm); - - vma = vma_lookup(current->mm, address); - if (!vma || vma->vm_ops != &tcp_vm_ops) { - mmap_read_unlock(current->mm); + vma = find_tcp_vma(current->mm, address, &mmap_locked); + if (!vma) return -EINVAL; - } + vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); avail_len = min_t(u32, vma_len, inq); total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); @@ -2159,7 +2181,10 @@ static int tcp_zerocopy_receive(struct sock *sk, zc, total_bytes_to_map); } out: - mmap_read_unlock(current->mm); + if (mmap_locked) + mmap_read_unlock(current->mm); + else + vma_end_read(vma); /* Try to copy straggler data. */ if (!ret) copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);