Message ID | 1624591136-6647-3-git-send-email-linyunsheng@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | add benchmark selftest and optimization for ptr_ring | expand |
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: > Currently r->queue[] is cleared after r->consumer_head is moved > forward, which makes the __ptr_ring_empty() checking called in > page_pool_refill_alloc_cache() unreliable if the checking is done > after the r->queue clearing and before the consumer_head moving > forward. Well the documentation for __ptr_ring_empty clearly states is is not guaranteed to be reliable. * * NB: This is only safe to call if ring is never resized. * * However, if some other CPU consumes ring entries at the same time, the value * returned is not guaranteed to be correct. * * In this case - to avoid incorrectly detecting the ring * as empty - the CPU consuming the ring entries is responsible * for either consuming all ring entries until the ring is empty, * or synchronizing with some other CPU and causing it to * re-test __ptr_ring_empty and/or consume the ring enteries * after the synchronization point. * Is it then the case that page_pool_refill_alloc_cache violates this requirement? How? It looks like you are trying to make the guarantee stronger and ensure no false positives. If yes please document this as such, update the comment so all code can be evaluated with the eye towards whether the new stronger guarantee is maintained. In particular I think I see at least one issue with this immediately. > Move the r->queue[] clearing after consumer_head moving forward > to make __ptr_ring_empty() checking more reliable. > > As a side effect of above change, a consumer_head checking is > avoided for the likely case, and it has noticeable performance > improvement when it is tested using the ptr_ring_test selftest > added in the previous patch. > > Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" > to test the case of single thread doing both the enqueuing and > dequeuing: > > arch unpatched patched delta > arm64 4648 ms 4464 ms +3.9% > X86 2562 ms 2401 ms +6.2% > > Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" > to test the case of one thread doing enqueuing and another thread > doing dequeuing concurrently, also known as single-producer/single- > consumer: > > arch unpatched patched delta > arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% > x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > --- > V2: Add performance data. > --- > include/linux/ptr_ring.h | 25 ++++++++++++++++--------- > 1 file changed, 16 insertions(+), 9 deletions(-) > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > index 808f9d3..db9c282 100644 > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty > * to work correctly. > */ > - int consumer_head = r->consumer_head; > - int head = consumer_head++; > + int consumer_head = r->consumer_head + 1; > > /* Once we have processed enough entries invalidate them in > * the ring all at once so producer can reuse their space in the ring. > @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > */ > if (unlikely(consumer_head - r->consumer_tail >= r->batch || > consumer_head >= r->size)) { > + int tail = r->consumer_tail; > + > + if (unlikely(consumer_head >= r->size)) { > + r->consumer_tail = 0; > + WRITE_ONCE(r->consumer_head, 0); > + } else { > + r->consumer_tail = consumer_head; > + WRITE_ONCE(r->consumer_head, consumer_head); > + } > + > /* Zero out entries in the reverse order: this way we touch the > * cache line that producer might currently be reading the last; > * producer won't make progress and touch other cache lines > * besides the first one until we write out all entries. > */ > - while (likely(head >= r->consumer_tail)) > - r->queue[head--] = NULL; > - r->consumer_tail = consumer_head; > - } > - if (unlikely(consumer_head >= r->size)) { > - consumer_head = 0; > - r->consumer_tail = 0; > + while (likely(--consumer_head >= tail)) > + r->queue[consumer_head] = NULL; > + > + return; So if now we need this to be reliable then we also need smp_wmb before writing r->queue[consumer_head], there could be other gotchas. > } > + > /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ > WRITE_ONCE(r->consumer_head, consumer_head); > } > -- > 2.7.4
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: > Currently r->queue[] is cleared after r->consumer_head is moved > forward, which makes the __ptr_ring_empty() checking called in > page_pool_refill_alloc_cache() unreliable if the checking is done > after the r->queue clearing and before the consumer_head moving > forward. > > Move the r->queue[] clearing after consumer_head moving forward > to make __ptr_ring_empty() checking more reliable. > > As a side effect of above change, a consumer_head checking is > avoided for the likely case, and it has noticeable performance > improvement when it is tested using the ptr_ring_test selftest > added in the previous patch. > > Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" > to test the case of single thread doing both the enqueuing and > dequeuing: > > arch unpatched patched delta > arm64 4648 ms 4464 ms +3.9% > X86 2562 ms 2401 ms +6.2% > > Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" > to test the case of one thread doing enqueuing and another thread > doing dequeuing concurrently, also known as single-producer/single- > consumer: > > arch unpatched patched delta > arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% > x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% Nice but it's small - could be a fluke. How many tests did you run? What is the variance? Did you try pinning to different CPUs to observe numa effects? Please use perf or some other modern tool for this kind of benchmark. Thanks! > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > --- > V2: Add performance data. > --- > include/linux/ptr_ring.h | 25 ++++++++++++++++--------- > 1 file changed, 16 insertions(+), 9 deletions(-) > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > index 808f9d3..db9c282 100644 > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty > * to work correctly. > */ > - int consumer_head = r->consumer_head; > - int head = consumer_head++; > + int consumer_head = r->consumer_head + 1; > > /* Once we have processed enough entries invalidate them in > * the ring all at once so producer can reuse their space in the ring. > @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > */ > if (unlikely(consumer_head - r->consumer_tail >= r->batch || > consumer_head >= r->size)) { > + int tail = r->consumer_tail; > + > + if (unlikely(consumer_head >= r->size)) { > + r->consumer_tail = 0; > + WRITE_ONCE(r->consumer_head, 0); > + } else { > + r->consumer_tail = consumer_head; > + WRITE_ONCE(r->consumer_head, consumer_head); > + } > + > /* Zero out entries in the reverse order: this way we touch the > * cache line that producer might currently be reading the last; > * producer won't make progress and touch other cache lines > * besides the first one until we write out all entries. > */ > - while (likely(head >= r->consumer_tail)) > - r->queue[head--] = NULL; > - r->consumer_tail = consumer_head; > - } > - if (unlikely(consumer_head >= r->size)) { > - consumer_head = 0; > - r->consumer_tail = 0; > + while (likely(--consumer_head >= tail)) > + r->queue[consumer_head] = NULL; > + > + return; > } > + > /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ > WRITE_ONCE(r->consumer_head, consumer_head); > } > -- > 2.7.4
On 2021/6/25 14:32, Michael S. Tsirkin wrote: > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: >> Currently r->queue[] is cleared after r->consumer_head is moved >> forward, which makes the __ptr_ring_empty() checking called in >> page_pool_refill_alloc_cache() unreliable if the checking is done >> after the r->queue clearing and before the consumer_head moving >> forward. > > > Well the documentation for __ptr_ring_empty clearly states is > is not guaranteed to be reliable. Yes, this patch does not make __ptr_ring_empty() strictly reliable without taking the r->consumer_lock, as the disscuission in [1]. 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011 > > * > * NB: This is only safe to call if ring is never resized. > * > * However, if some other CPU consumes ring entries at the same time, the value > * returned is not guaranteed to be correct. > * > * In this case - to avoid incorrectly detecting the ring > * as empty - the CPU consuming the ring entries is responsible > * for either consuming all ring entries until the ring is empty, > * or synchronizing with some other CPU and causing it to > * re-test __ptr_ring_empty and/or consume the ring enteries > * after the synchronization point. > * > > Is it then the case that page_pool_refill_alloc_cache violates > this requirement? How? As my understanding: page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid taking r->consumer_lock, when the above data race happens, it will exit out and allocate page from the page allocator instead of reusing the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() is more reliable. > > It looks like you are trying to make the guarantee stronger and ensure > no false positives. > > If yes please document this as such, update the comment so all > code can be evaluated with the eye towards whether the new stronger > guarantee is maintained. In particular I think I see at least one > issue with this immediately. > > >> Move the r->queue[] clearing after consumer_head moving forward >> to make __ptr_ring_empty() checking more reliable. >> >> As a side effect of above change, a consumer_head checking is >> avoided for the likely case, and it has noticeable performance >> improvement when it is tested using the ptr_ring_test selftest >> added in the previous patch. >> >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" >> to test the case of single thread doing both the enqueuing and >> dequeuing: >> >> arch unpatched patched delta >> arm64 4648 ms 4464 ms +3.9% >> X86 2562 ms 2401 ms +6.2% >> >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" >> to test the case of one thread doing enqueuing and another thread >> doing dequeuing concurrently, also known as single-producer/single- >> consumer: >> >> arch unpatched patched delta >> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% >> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% >> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> >> --- >> V2: Add performance data. >> --- >> include/linux/ptr_ring.h | 25 ++++++++++++++++--------- >> 1 file changed, 16 insertions(+), 9 deletions(-) >> >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h >> index 808f9d3..db9c282 100644 >> --- a/include/linux/ptr_ring.h >> +++ b/include/linux/ptr_ring.h >> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) >> /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty >> * to work correctly. >> */ >> - int consumer_head = r->consumer_head; >> - int head = consumer_head++; >> + int consumer_head = r->consumer_head + 1; >> >> /* Once we have processed enough entries invalidate them in >> * the ring all at once so producer can reuse their space in the ring. >> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) >> */ >> if (unlikely(consumer_head - r->consumer_tail >= r->batch || >> consumer_head >= r->size)) { >> + int tail = r->consumer_tail; >> + >> + if (unlikely(consumer_head >= r->size)) { >> + r->consumer_tail = 0; >> + WRITE_ONCE(r->consumer_head, 0); >> + } else { >> + r->consumer_tail = consumer_head; >> + WRITE_ONCE(r->consumer_head, consumer_head); >> + } >> + >> /* Zero out entries in the reverse order: this way we touch the >> * cache line that producer might currently be reading the last; >> * producer won't make progress and touch other cache lines >> * besides the first one until we write out all entries. >> */ >> - while (likely(head >= r->consumer_tail)) >> - r->queue[head--] = NULL; >> - r->consumer_tail = consumer_head; >> - } >> - if (unlikely(consumer_head >= r->size)) { >> - consumer_head = 0; >> - r->consumer_tail = 0; >> + while (likely(--consumer_head >= tail)) >> + r->queue[consumer_head] = NULL; >> + >> + return; > > > So if now we need this to be reliable then > we also need smp_wmb before writing r->queue[consumer_head], > there could be other gotchas. Yes, This patch does not make it strictly reliable. T think I could mention that in the commit log? > >> } >> + >> /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ >> WRITE_ONCE(r->consumer_head, consumer_head); >> } >> -- >> 2.7.4 > > > . >
On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote: > On 2021/6/25 14:32, Michael S. Tsirkin wrote: > > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: > >> Currently r->queue[] is cleared after r->consumer_head is moved > >> forward, which makes the __ptr_ring_empty() checking called in > >> page_pool_refill_alloc_cache() unreliable if the checking is done > >> after the r->queue clearing and before the consumer_head moving > >> forward. > > > > > > Well the documentation for __ptr_ring_empty clearly states is > > is not guaranteed to be reliable. > > Yes, this patch does not make __ptr_ring_empty() strictly reliable > without taking the r->consumer_lock, as the disscuission in [1]. > > 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011 > > > > > * > > * NB: This is only safe to call if ring is never resized. > > * > > * However, if some other CPU consumes ring entries at the same time, the value > > * returned is not guaranteed to be correct. > > * > > * In this case - to avoid incorrectly detecting the ring > > * as empty - the CPU consuming the ring entries is responsible > > * for either consuming all ring entries until the ring is empty, > > * or synchronizing with some other CPU and causing it to > > * re-test __ptr_ring_empty and/or consume the ring enteries > > * after the synchronization point. > > * > > > > Is it then the case that page_pool_refill_alloc_cache violates > > this requirement? How? > > As my understanding: > page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid > taking r->consumer_lock, when the above data race happens, it will > exit out and allocate page from the page allocator instead of reusing > the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() > is more reliable. Question is how do we know it's more reliable? It would be nice if we did actually made it more reliable, as it is we are just shifting races around. > > > > It looks like you are trying to make the guarantee stronger and ensure > > no false positives. > > > > If yes please document this as such, update the comment so all > > code can be evaluated with the eye towards whether the new stronger > > guarantee is maintained. In particular I think I see at least one > > issue with this immediately. > > > > > >> Move the r->queue[] clearing after consumer_head moving forward > >> to make __ptr_ring_empty() checking more reliable. > >> > >> As a side effect of above change, a consumer_head checking is > >> avoided for the likely case, and it has noticeable performance > >> improvement when it is tested using the ptr_ring_test selftest > >> added in the previous patch. > >> > >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" > >> to test the case of single thread doing both the enqueuing and > >> dequeuing: > >> > >> arch unpatched patched delta > >> arm64 4648 ms 4464 ms +3.9% > >> X86 2562 ms 2401 ms +6.2% > >> > >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" > >> to test the case of one thread doing enqueuing and another thread > >> doing dequeuing concurrently, also known as single-producer/single- > >> consumer: > >> > >> arch unpatched patched delta > >> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% > >> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% > >> > >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > >> --- > >> V2: Add performance data. > >> --- > >> include/linux/ptr_ring.h | 25 ++++++++++++++++--------- > >> 1 file changed, 16 insertions(+), 9 deletions(-) > >> > >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > >> index 808f9d3..db9c282 100644 > >> --- a/include/linux/ptr_ring.h > >> +++ b/include/linux/ptr_ring.h > >> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > >> /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty > >> * to work correctly. > >> */ > >> - int consumer_head = r->consumer_head; > >> - int head = consumer_head++; > >> + int consumer_head = r->consumer_head + 1; > >> > >> /* Once we have processed enough entries invalidate them in > >> * the ring all at once so producer can reuse their space in the ring. > >> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > >> */ > >> if (unlikely(consumer_head - r->consumer_tail >= r->batch || > >> consumer_head >= r->size)) { > >> + int tail = r->consumer_tail; > >> + > >> + if (unlikely(consumer_head >= r->size)) { > >> + r->consumer_tail = 0; > >> + WRITE_ONCE(r->consumer_head, 0); > >> + } else { > >> + r->consumer_tail = consumer_head; > >> + WRITE_ONCE(r->consumer_head, consumer_head); > >> + } > >> + > >> /* Zero out entries in the reverse order: this way we touch the > >> * cache line that producer might currently be reading the last; > >> * producer won't make progress and touch other cache lines > >> * besides the first one until we write out all entries. > >> */ > >> - while (likely(head >= r->consumer_tail)) > >> - r->queue[head--] = NULL; > >> - r->consumer_tail = consumer_head; > >> - } > >> - if (unlikely(consumer_head >= r->size)) { > >> - consumer_head = 0; > >> - r->consumer_tail = 0; > >> + while (likely(--consumer_head >= tail)) > >> + r->queue[consumer_head] = NULL; > >> + > >> + return; > > > > > > So if now we need this to be reliable then > > we also need smp_wmb before writing r->queue[consumer_head], > > there could be other gotchas. > > Yes, This patch does not make it strictly reliable. > T think I could mention that in the commit log? OK so it's not that it makes it more reliable - this patch simply makes a possible false positive less likely while making a false negative more likely. Our assumption is that a false negative is cheaper then? How do we know that it is? And even if we prove the ptr_ring itself is faster now, how do we know what affects callers in a better way a false positive or a false negative? I would rather we worked on actually making it reliable e.g. if we can guarantee no false positives, that would be a net win. > > > > >> } > >> + > >> /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ > >> WRITE_ONCE(r->consumer_head, consumer_head); > >> } > >> -- > >> 2.7.4 > > > > > > . > >
On 2021/6/25 15:30, Michael S. Tsirkin wrote: > On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote: >> On 2021/6/25 14:32, Michael S. Tsirkin wrote: >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: >>>> Currently r->queue[] is cleared after r->consumer_head is moved >>>> forward, which makes the __ptr_ring_empty() checking called in >>>> page_pool_refill_alloc_cache() unreliable if the checking is done >>>> after the r->queue clearing and before the consumer_head moving >>>> forward. >>> >>> >>> Well the documentation for __ptr_ring_empty clearly states is >>> is not guaranteed to be reliable. >> >> Yes, this patch does not make __ptr_ring_empty() strictly reliable >> without taking the r->consumer_lock, as the disscuission in [1]. >> >> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011 >> >>> >>> * >>> * NB: This is only safe to call if ring is never resized. >>> * >>> * However, if some other CPU consumes ring entries at the same time, the value >>> * returned is not guaranteed to be correct. >>> * >>> * In this case - to avoid incorrectly detecting the ring >>> * as empty - the CPU consuming the ring entries is responsible >>> * for either consuming all ring entries until the ring is empty, >>> * or synchronizing with some other CPU and causing it to >>> * re-test __ptr_ring_empty and/or consume the ring enteries >>> * after the synchronization point. >>> * >>> >>> Is it then the case that page_pool_refill_alloc_cache violates >>> this requirement? How? >> >> As my understanding: >> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid >> taking r->consumer_lock, when the above data race happens, it will >> exit out and allocate page from the page allocator instead of reusing >> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() >> is more reliable. > > Question is how do we know it's more reliable? > It would be nice if we did actually made it more reliable, > as it is we are just shifting races around. > > >>> >>> It looks like you are trying to make the guarantee stronger and ensure >>> no false positives. >>> >>> If yes please document this as such, update the comment so all >>> code can be evaluated with the eye towards whether the new stronger >>> guarantee is maintained. In particular I think I see at least one >>> issue with this immediately. >>> >>> >>>> Move the r->queue[] clearing after consumer_head moving forward >>>> to make __ptr_ring_empty() checking more reliable. >>>> >>>> As a side effect of above change, a consumer_head checking is >>>> avoided for the likely case, and it has noticeable performance >>>> improvement when it is tested using the ptr_ring_test selftest >>>> added in the previous patch. >>>> >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" >>>> to test the case of single thread doing both the enqueuing and >>>> dequeuing: >>>> >>>> arch unpatched patched delta >>>> arm64 4648 ms 4464 ms +3.9% >>>> X86 2562 ms 2401 ms +6.2% >>>> >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" >>>> to test the case of one thread doing enqueuing and another thread >>>> doing dequeuing concurrently, also known as single-producer/single- >>>> consumer: >>>> >>>> arch unpatched patched delta >>>> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% >>>> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% >>>> >>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> >>>> --- >>>> V2: Add performance data. >>>> --- >>>> include/linux/ptr_ring.h | 25 ++++++++++++++++--------- >>>> 1 file changed, 16 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h >>>> index 808f9d3..db9c282 100644 >>>> --- a/include/linux/ptr_ring.h >>>> +++ b/include/linux/ptr_ring.h >>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) >>>> /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty >>>> * to work correctly. >>>> */ >>>> - int consumer_head = r->consumer_head; >>>> - int head = consumer_head++; >>>> + int consumer_head = r->consumer_head + 1; >>>> >>>> /* Once we have processed enough entries invalidate them in >>>> * the ring all at once so producer can reuse their space in the ring. >>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) >>>> */ >>>> if (unlikely(consumer_head - r->consumer_tail >= r->batch || >>>> consumer_head >= r->size)) { >>>> + int tail = r->consumer_tail; >>>> + >>>> + if (unlikely(consumer_head >= r->size)) { >>>> + r->consumer_tail = 0; >>>> + WRITE_ONCE(r->consumer_head, 0); >>>> + } else { >>>> + r->consumer_tail = consumer_head; >>>> + WRITE_ONCE(r->consumer_head, consumer_head); >>>> + } >>>> + >>>> /* Zero out entries in the reverse order: this way we touch the >>>> * cache line that producer might currently be reading the last; >>>> * producer won't make progress and touch other cache lines >>>> * besides the first one until we write out all entries. >>>> */ >>>> - while (likely(head >= r->consumer_tail)) >>>> - r->queue[head--] = NULL; >>>> - r->consumer_tail = consumer_head; >>>> - } >>>> - if (unlikely(consumer_head >= r->size)) { >>>> - consumer_head = 0; >>>> - r->consumer_tail = 0; >>>> + while (likely(--consumer_head >= tail)) >>>> + r->queue[consumer_head] = NULL; >>>> + >>>> + return; >>> >>> >>> So if now we need this to be reliable then >>> we also need smp_wmb before writing r->queue[consumer_head], >>> there could be other gotchas. >> >> Yes, This patch does not make it strictly reliable. >> T think I could mention that in the commit log? > > OK so it's not that it makes it more reliable - this patch simply makes > a possible false positive less likely while making a false negative > more likely. Our assumption is that a false negative is cheaper then? > > How do we know that it is? > > And even if we prove the ptr_ring itself is faster now, > how do we know what affects callers in a better way a > false positive or a false negative? > > I would rather we worked on actually making it reliable > e.g. if we can guarantee no false positives, that would be > a net win. I thought deeper about the case you mentioned above, it seems for the above to happen, the consumer_head need to be rolled back to zero and incremented to the point when caller of __ptr_ring_empty() is still *not* able to see the r->queue[] which has been set to NULL in __ptr_ring_discard_one(). It seems smp_wmb() only need to be done once when consumer_head is rolled back to zero, and maybe that is enough to make sure the case you mentioned is fixed too? And the smp_wmb() is only done once in a round of producing/ consuming, so the performance impact should be minimized?(of course we need to test it too). > >> >>> >>>> } >>>> + >>>> /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ >>>> WRITE_ONCE(r->consumer_head, consumer_head); >>>> } >>>> -- >>>> 2.7.4 >>> >>> >>> . >>> > > > . >
On 2021/6/25 14:39, Michael S. Tsirkin wrote: > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: >> Currently r->queue[] is cleared after r->consumer_head is moved >> forward, which makes the __ptr_ring_empty() checking called in >> page_pool_refill_alloc_cache() unreliable if the checking is done >> after the r->queue clearing and before the consumer_head moving >> forward. >> >> Move the r->queue[] clearing after consumer_head moving forward >> to make __ptr_ring_empty() checking more reliable. >> >> As a side effect of above change, a consumer_head checking is >> avoided for the likely case, and it has noticeable performance >> improvement when it is tested using the ptr_ring_test selftest >> added in the previous patch. >> >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" >> to test the case of single thread doing both the enqueuing and >> dequeuing: >> >> arch unpatched patched delta >> arm64 4648 ms 4464 ms +3.9% >> X86 2562 ms 2401 ms +6.2% >> >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" >> to test the case of one thread doing enqueuing and another thread >> doing dequeuing concurrently, also known as single-producer/single- >> consumer: >> >> arch unpatched patched delta >> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% >> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% > > Nice but it's small - could be a fluke. > How many tests did you run? What is the variance? > Did you try pinning to different CPUs to observe numa effects? > Please use perf or some other modern tool for this kind > of benchmark. Thanks! The result is quite stable, and retest using perf stat: ---------------unpatched ptr_ring.c begin---------------------------------- perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': 2385.49 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6202023521 cycles # 2.600 GHz 17424187640 instructions # 2.81 insn per cycle <not supported> branches 6506477 branch-misses 2.385785170 seconds time elapsed 2.384014000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': 2383.67 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6197278066 cycles # 2.600 GHz 17424207772 instructions # 2.81 insn per cycle <not supported> branches 6495766 branch-misses 2.383941170 seconds time elapsed 2.382215000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': 2391.16 msec task-clock # 1.000 CPUs utilized 25 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6216704120 cycles # 2.600 GHz 17424243041 instructions # 2.80 insn per cycle <not supported> branches 6483886 branch-misses 2.391420440 seconds time elapsed 2.389647000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': 2390.10 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.024 K/sec 6213995715 cycles # 2.600 GHz 17424227499 instructions # 2.80 insn per cycle <not supported> branches 6474069 branch-misses 2.390367070 seconds time elapsed 2.388644000 seconds user 0.000000000 seconds sys ---------------unpatched ptr_ring.c end---------------------------------- ---------------patched ptr_ring.c begin---------------------------------- root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2199.18 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 56 page-faults # 0.025 K/sec 5717671859 cycles # 2.600 GHz 16124164124 instructions # 2.82 insn per cycle <not supported> branches 6564829 branch-misses 2.199445990 seconds time elapsed 2.197859000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2222.63 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.026 K/sec 5778632853 cycles # 2.600 GHz 16124210769 instructions # 2.79 insn per cycle <not supported> branches 6603904 branch-misses 2.222901020 seconds time elapsed 2.221312000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2252.28 msec task-clock # 1.000 CPUs utilized 25 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.025 K/sec 5855668335 cycles # 2.600 GHz 16124310588 instructions # 2.75 insn per cycle <not supported> branches 6777279 branch-misses 2.252543340 seconds time elapsed 2.250897000 seconds user 0.000000000 seconds sys root@(none):~# root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2209.70 msec task-clock # 1.000 CPUs utilized 24 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.026 K/sec 5745003772 cycles # 2.600 GHz 16124198886 instructions # 2.81 insn per cycle <not supported> branches 6508414 branch-misses 2.209973960 seconds time elapsed 2.208354000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2211.70 msec task-clock # 1.000 CPUs utilized 24 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.026 K/sec 5750136694 cycles # 2.600 GHz 16124176577 instructions # 2.80 insn per cycle <not supported> branches 6553023 branch-misses 2.211968470 seconds time elapsed 2.210303000 seconds user 0.000000000 seconds sys ---------------patched ptr_ring.c end---------------------------------- > >>
On Fri, Jun 25, 2021 at 04:33:40PM +0800, Yunsheng Lin wrote: > On 2021/6/25 15:30, Michael S. Tsirkin wrote: > > On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote: > >> On 2021/6/25 14:32, Michael S. Tsirkin wrote: > >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: > >>>> Currently r->queue[] is cleared after r->consumer_head is moved > >>>> forward, which makes the __ptr_ring_empty() checking called in > >>>> page_pool_refill_alloc_cache() unreliable if the checking is done > >>>> after the r->queue clearing and before the consumer_head moving > >>>> forward. > >>> > >>> > >>> Well the documentation for __ptr_ring_empty clearly states is > >>> is not guaranteed to be reliable. > >> > >> Yes, this patch does not make __ptr_ring_empty() strictly reliable > >> without taking the r->consumer_lock, as the disscuission in [1]. > >> > >> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011 > >> > >>> > >>> * > >>> * NB: This is only safe to call if ring is never resized. > >>> * > >>> * However, if some other CPU consumes ring entries at the same time, the value > >>> * returned is not guaranteed to be correct. > >>> * > >>> * In this case - to avoid incorrectly detecting the ring > >>> * as empty - the CPU consuming the ring entries is responsible > >>> * for either consuming all ring entries until the ring is empty, > >>> * or synchronizing with some other CPU and causing it to > >>> * re-test __ptr_ring_empty and/or consume the ring enteries > >>> * after the synchronization point. > >>> * > >>> > >>> Is it then the case that page_pool_refill_alloc_cache violates > >>> this requirement? How? > >> > >> As my understanding: > >> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid > >> taking r->consumer_lock, when the above data race happens, it will > >> exit out and allocate page from the page allocator instead of reusing > >> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() > >> is more reliable. > > > > Question is how do we know it's more reliable? > > It would be nice if we did actually made it more reliable, > > as it is we are just shifting races around. > > > > > >>> > >>> It looks like you are trying to make the guarantee stronger and ensure > >>> no false positives. > >>> > >>> If yes please document this as such, update the comment so all > >>> code can be evaluated with the eye towards whether the new stronger > >>> guarantee is maintained. In particular I think I see at least one > >>> issue with this immediately. > >>> > >>> > >>>> Move the r->queue[] clearing after consumer_head moving forward > >>>> to make __ptr_ring_empty() checking more reliable. > >>>> > >>>> As a side effect of above change, a consumer_head checking is > >>>> avoided for the likely case, and it has noticeable performance > >>>> improvement when it is tested using the ptr_ring_test selftest > >>>> added in the previous patch. > >>>> > >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" > >>>> to test the case of single thread doing both the enqueuing and > >>>> dequeuing: > >>>> > >>>> arch unpatched patched delta > >>>> arm64 4648 ms 4464 ms +3.9% > >>>> X86 2562 ms 2401 ms +6.2% > >>>> > >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" > >>>> to test the case of one thread doing enqueuing and another thread > >>>> doing dequeuing concurrently, also known as single-producer/single- > >>>> consumer: > >>>> > >>>> arch unpatched patched delta > >>>> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% > >>>> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% > >>>> > >>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > >>>> --- > >>>> V2: Add performance data. > >>>> --- > >>>> include/linux/ptr_ring.h | 25 ++++++++++++++++--------- > >>>> 1 file changed, 16 insertions(+), 9 deletions(-) > >>>> > >>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > >>>> index 808f9d3..db9c282 100644 > >>>> --- a/include/linux/ptr_ring.h > >>>> +++ b/include/linux/ptr_ring.h > >>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > >>>> /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty > >>>> * to work correctly. > >>>> */ > >>>> - int consumer_head = r->consumer_head; > >>>> - int head = consumer_head++; > >>>> + int consumer_head = r->consumer_head + 1; > >>>> > >>>> /* Once we have processed enough entries invalidate them in > >>>> * the ring all at once so producer can reuse their space in the ring. > >>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > >>>> */ > >>>> if (unlikely(consumer_head - r->consumer_tail >= r->batch || > >>>> consumer_head >= r->size)) { > >>>> + int tail = r->consumer_tail; > >>>> + > >>>> + if (unlikely(consumer_head >= r->size)) { > >>>> + r->consumer_tail = 0; > >>>> + WRITE_ONCE(r->consumer_head, 0); > >>>> + } else { > >>>> + r->consumer_tail = consumer_head; > >>>> + WRITE_ONCE(r->consumer_head, consumer_head); > >>>> + } > >>>> + > >>>> /* Zero out entries in the reverse order: this way we touch the > >>>> * cache line that producer might currently be reading the last; > >>>> * producer won't make progress and touch other cache lines > >>>> * besides the first one until we write out all entries. > >>>> */ > >>>> - while (likely(head >= r->consumer_tail)) > >>>> - r->queue[head--] = NULL; > >>>> - r->consumer_tail = consumer_head; > >>>> - } > >>>> - if (unlikely(consumer_head >= r->size)) { > >>>> - consumer_head = 0; > >>>> - r->consumer_tail = 0; > >>>> + while (likely(--consumer_head >= tail)) > >>>> + r->queue[consumer_head] = NULL; > >>>> + > >>>> + return; > >>> > >>> > >>> So if now we need this to be reliable then > >>> we also need smp_wmb before writing r->queue[consumer_head], > >>> there could be other gotchas. > >> > >> Yes, This patch does not make it strictly reliable. > >> T think I could mention that in the commit log? > > > > OK so it's not that it makes it more reliable - this patch simply makes > > a possible false positive less likely while making a false negative > > more likely. Our assumption is that a false negative is cheaper then? > > > > How do we know that it is? > > > > And even if we prove the ptr_ring itself is faster now, > > how do we know what affects callers in a better way a > > false positive or a false negative? > > > > I would rather we worked on actually making it reliable > > e.g. if we can guarantee no false positives, that would be > > a net win. > I thought deeper about the case you mentioned above, it > seems for the above to happen, the consumer_head need to > be rolled back to zero and incremented to the point when > caller of __ptr_ring_empty() is still *not* able to see the > r->queue[] which has been set to NULL in __ptr_ring_discard_one(). > > It seems smp_wmb() only need to be done once when consumer_head > is rolled back to zero, and maybe that is enough to make sure the > case you mentioned is fixed too? > > And the smp_wmb() is only done once in a round of producing/ > consuming, so the performance impact should be minimized?(of > course we need to test it too). Sorry I don't really understand the question here. I think I agree it's enough to do one smp_wmb between the write of r->queue and write of consumer_head to help guarantee no false positives. What other code changes are necessary I can't yet say without more a deeper code review. > > > >> > >>> > >>>> } > >>>> + > >>>> /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ > >>>> WRITE_ONCE(r->consumer_head, consumer_head); > >>>> } > >>>> -- > >>>> 2.7.4 > >>> > >>> > >>> . > >>> > > > > > > . > >
On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote: > On 2021/6/25 14:39, Michael S. Tsirkin wrote: > > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: > >> Currently r->queue[] is cleared after r->consumer_head is moved > >> forward, which makes the __ptr_ring_empty() checking called in > >> page_pool_refill_alloc_cache() unreliable if the checking is done > >> after the r->queue clearing and before the consumer_head moving > >> forward. > >> > >> Move the r->queue[] clearing after consumer_head moving forward > >> to make __ptr_ring_empty() checking more reliable. > >> > >> As a side effect of above change, a consumer_head checking is > >> avoided for the likely case, and it has noticeable performance > >> improvement when it is tested using the ptr_ring_test selftest > >> added in the previous patch. > >> > >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" > >> to test the case of single thread doing both the enqueuing and > >> dequeuing: > >> > >> arch unpatched patched delta > >> arm64 4648 ms 4464 ms +3.9% > >> X86 2562 ms 2401 ms +6.2% > >> > >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" > >> to test the case of one thread doing enqueuing and another thread > >> doing dequeuing concurrently, also known as single-producer/single- > >> consumer: > >> > >> arch unpatched patched delta > >> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% > >> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% > > > > Nice but it's small - could be a fluke. > > How many tests did you run? What is the variance? > > Did you try pinning to different CPUs to observe numa effects? > > Please use perf or some other modern tool for this kind > > of benchmark. Thanks! > > The result is quite stable, and retest using perf stat: How stable exactly? Try with -r so we can find out. > ---------------unpatched ptr_ring.c begin---------------------------------- > > perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us > > Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': > > 2385.49 msec task-clock # 1.000 CPUs utilized > 26 context-switches # 0.011 K/sec > 0 cpu-migrations # 0.000 K/sec > 57 page-faults # 0.024 K/sec > 6202023521 cycles # 2.600 GHz > 17424187640 instructions # 2.81 insn per cycle > <not supported> branches > 6506477 branch-misses > > 2.385785170 seconds time elapsed > > 2.384014000 seconds user > 0.000000000 seconds sys > > > root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us > > Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': > > 2383.67 msec task-clock # 1.000 CPUs utilized > 26 context-switches # 0.011 K/sec > 0 cpu-migrations # 0.000 K/sec > 57 page-faults # 0.024 K/sec > 6197278066 cycles # 2.600 GHz > 17424207772 instructions # 2.81 insn per cycle > <not supported> branches > 6495766 branch-misses > > 2.383941170 seconds time elapsed > > 2.382215000 seconds user > 0.000000000 seconds sys > > > root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us > > Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': > > 2391.16 msec task-clock # 1.000 CPUs utilized > 25 context-switches # 0.010 K/sec > 0 cpu-migrations # 0.000 K/sec > 57 page-faults # 0.024 K/sec > 6216704120 cycles # 2.600 GHz > 17424243041 instructions # 2.80 insn per cycle > <not supported> branches > 6483886 branch-misses > > 2.391420440 seconds time elapsed > > 2.389647000 seconds user > 0.000000000 seconds sys > > > root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us > > Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': > > 2390.10 msec task-clock # 1.000 CPUs utilized > 26 context-switches # 0.011 K/sec > 0 cpu-migrations # 0.000 K/sec > 58 page-faults # 0.024 K/sec > 6213995715 cycles # 2.600 GHz > 17424227499 instructions # 2.80 insn per cycle > <not supported> branches > 6474069 branch-misses > > 2.390367070 seconds time elapsed > > 2.388644000 seconds user > 0.000000000 seconds sys > > ---------------unpatched ptr_ring.c end---------------------------------- > > > > ---------------patched ptr_ring.c begin---------------------------------- > root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us > > Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': > > 2199.18 msec task-clock # 1.000 CPUs utilized > 23 context-switches # 0.010 K/sec > 0 cpu-migrations # 0.000 K/sec > 56 page-faults # 0.025 K/sec > 5717671859 cycles # 2.600 GHz > 16124164124 instructions # 2.82 insn per cycle > <not supported> branches > 6564829 branch-misses > > 2.199445990 seconds time elapsed > > 2.197859000 seconds user > 0.000000000 seconds sys > > > root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us > > Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': > > 2222.63 msec task-clock # 1.000 CPUs utilized > 23 context-switches # 0.010 K/sec > 0 cpu-migrations # 0.000 K/sec > 57 page-faults # 0.026 K/sec > 5778632853 cycles # 2.600 GHz > 16124210769 instructions # 2.79 insn per cycle > <not supported> branches > 6603904 branch-misses > > 2.222901020 seconds time elapsed > > 2.221312000 seconds user > 0.000000000 seconds sys > > > root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us > > Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': > > 2252.28 msec task-clock # 1.000 CPUs utilized > 25 context-switches # 0.011 K/sec > 0 cpu-migrations # 0.000 K/sec > 57 page-faults # 0.025 K/sec > 5855668335 cycles # 2.600 GHz > 16124310588 instructions # 2.75 insn per cycle > <not supported> branches > 6777279 branch-misses > > 2.252543340 seconds time elapsed > > 2.250897000 seconds user > 0.000000000 seconds sys > > > root@(none):~# > root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us > > Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': > > 2209.70 msec task-clock # 1.000 CPUs utilized > 24 context-switches # 0.011 K/sec > 0 cpu-migrations # 0.000 K/sec > 58 page-faults # 0.026 K/sec > 5745003772 cycles # 2.600 GHz > 16124198886 instructions # 2.81 insn per cycle > <not supported> branches > 6508414 branch-misses > > 2.209973960 seconds time elapsed > > 2.208354000 seconds user > 0.000000000 seconds sys > > > root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 > ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us > > Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': > > 2211.70 msec task-clock # 1.000 CPUs utilized > 24 context-switches # 0.011 K/sec > 0 cpu-migrations # 0.000 K/sec > 57 page-faults # 0.026 K/sec > 5750136694 cycles # 2.600 GHz > 16124176577 instructions # 2.80 insn per cycle > <not supported> branches > 6553023 branch-misses > > 2.211968470 seconds time elapsed > > 2.210303000 seconds user > 0.000000000 seconds sys > > ---------------patched ptr_ring.c end---------------------------------- > > > > >> >
On 2021/6/27 14:07, Michael S. Tsirkin wrote: > On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote: >> On 2021/6/25 14:39, Michael S. Tsirkin wrote: >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: >>>> Currently r->queue[] is cleared after r->consumer_head is moved >>>> forward, which makes the __ptr_ring_empty() checking called in >>>> page_pool_refill_alloc_cache() unreliable if the checking is done >>>> after the r->queue clearing and before the consumer_head moving >>>> forward. >>>> >>>> Move the r->queue[] clearing after consumer_head moving forward >>>> to make __ptr_ring_empty() checking more reliable. >>>> >>>> As a side effect of above change, a consumer_head checking is >>>> avoided for the likely case, and it has noticeable performance >>>> improvement when it is tested using the ptr_ring_test selftest >>>> added in the previous patch. >>>> >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" >>>> to test the case of single thread doing both the enqueuing and >>>> dequeuing: >>>> >>>> arch unpatched patched delta >>>> arm64 4648 ms 4464 ms +3.9% >>>> X86 2562 ms 2401 ms +6.2% >>>> >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" >>>> to test the case of one thread doing enqueuing and another thread >>>> doing dequeuing concurrently, also known as single-producer/single- >>>> consumer: >>>> >>>> arch unpatched patched delta >>>> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% >>>> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% >>> >>> Nice but it's small - could be a fluke. >>> How many tests did you run? What is the variance? >>> Did you try pinning to different CPUs to observe numa effects? >>> Please use perf or some other modern tool for this kind >>> of benchmark. Thanks! >> >> The result is quite stable, and retest using perf stat: > > How stable exactly? Try with -r so we can find out. Retest with "perf stat -r": For unpatched one: Performance counter stats for './ptr_ring_test -s 1000 -m 1 -N 100000000' (100 runs): 6780.97 msec task-clock # 2.000 CPUs utilized ( +- 5.36% ) 73 context-switches # 0.011 K/sec ( +- 5.07% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 81 page-faults # 0.012 K/sec ( +- 0.76% ) 17629544748 cycles # 2.600 GHz ( +- 5.36% ) 25496488950 instructions # 1.45 insn per cycle ( +- 0.26% ) <not supported> branches 11489031 branch-misses ( +- 1.69% ) 3.391 +- 0.182 seconds time elapsed ( +- 5.35% ) For patched one: Performance counter stats for './ptr_ring_test_opt -s 1000 -m 1 -N 100000000' (100 runs): 6567.83 msec task-clock # 2.000 CPUs utilized ( +- 5.53% ) 71 context-switches # 0.011 K/sec ( +- 5.26% ) 0 cpu-migrations # 0.000 K/sec 82 page-faults # 0.012 K/sec ( +- 0.85% ) 17075489298 cycles # 2.600 GHz ( +- 5.53% ) 23861051578 instructions # 1.40 insn per cycle ( +- 0.07% ) <not supported> branches 10473776 branch-misses ( +- 0.60% ) 3.284 +- 0.182 seconds time elapsed ( +- 5.53% ) The result is more stable when using taskset to limit the running cpu, but I suppose the above data is stable enough to justify the performance improvement?
On 2021/6/27 14:03, Michael S. Tsirkin wrote: >>>>> >>>>> >>>>> So if now we need this to be reliable then >>>>> we also need smp_wmb before writing r->queue[consumer_head], >>>>> there could be other gotchas. >>>> >>>> Yes, This patch does not make it strictly reliable. >>>> T think I could mention that in the commit log? >>> >>> OK so it's not that it makes it more reliable - this patch simply makes >>> a possible false positive less likely while making a false negative >>> more likely. Our assumption is that a false negative is cheaper then? >>> >>> How do we know that it is? >>> >>> And even if we prove the ptr_ring itself is faster now, >>> how do we know what affects callers in a better way a >>> false positive or a false negative? >>> >>> I would rather we worked on actually making it reliable >>> e.g. if we can guarantee no false positives, that would be >>> a net win. >> I thought deeper about the case you mentioned above, it >> seems for the above to happen, the consumer_head need to >> be rolled back to zero and incremented to the point when >> caller of __ptr_ring_empty() is still *not* able to see the >> r->queue[] which has been set to NULL in __ptr_ring_discard_one(). >> >> It seems smp_wmb() only need to be done once when consumer_head >> is rolled back to zero, and maybe that is enough to make sure the >> case you mentioned is fixed too? >> >> And the smp_wmb() is only done once in a round of producing/ >> consuming, so the performance impact should be minimized?(of >> course we need to test it too). > > > Sorry I don't really understand the question here. > I think I agree it's enough to do one smp_wmb between > the write of r->queue and write of consumer_head > to help guarantee no false positives. > What other code changes are necessary I can't yet say > without more a deeper code review. > Ok, thanks for the reviewing. Will add handling the case you mentioned above in V3 if there is no noticable performanc impact for handling the above case.
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */ - int consumer_head = r->consumer_head; - int head = consumer_head++; + int consumer_head = r->consumer_head + 1; /* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) { + int tail = r->consumer_tail; + + if (unlikely(consumer_head >= r->size)) { + r->consumer_tail = 0; + WRITE_ONCE(r->consumer_head, 0); + } else { + r->consumer_tail = consumer_head; + WRITE_ONCE(r->consumer_head, consumer_head); + } + /* Zero out entries in the reverse order: this way we touch the * cache line that producer might currently be reading the last; * producer won't make progress and touch other cache lines * besides the first one until we write out all entries. */ - while (likely(head >= r->consumer_tail)) - r->queue[head--] = NULL; - r->consumer_tail = consumer_head; - } - if (unlikely(consumer_head >= r->size)) { - consumer_head = 0; - r->consumer_tail = 0; + while (likely(--consumer_head >= tail)) + r->queue[consumer_head] = NULL; + + return; } + /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head); }
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward. Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable. As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch. Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing: arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2% Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer: arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> --- V2: Add performance data. --- include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)