Message ID | 20240604031124.2261-1-yang.yang@vivo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] sbitmap: fix io hung due to race on sbitmap_word::cleared | expand |
On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: > > Configuration for sbq: > depth=64, wake_batch=6, shift=6, map_nr=1 > > 1. There are 64 requests in progress: > map->word = 0xFFFFFFFFFFFFFFFF > 2. After all the 64 requests complete, and no more requests come: > map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF > 3. Now two tasks try to allocate requests: > T1: T2: > __blk_mq_get_tag . > __sbitmap_queue_get . > sbitmap_get . > sbitmap_find_bit . > sbitmap_find_bit_in_word . > __sbitmap_get_word -> nr=-1 __blk_mq_get_tag > sbitmap_deferred_clear __sbitmap_queue_get > /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit > if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word > return false; __sbitmap_get_word -> nr=-1 > mask = xchg(&map->cleared, 0) sbitmap_deferred_clear > atomic_long_andnot() /* map->cleared=0 */ > if (!(map->cleared)) > return false; > /* > * map->cleared is cleared by T1 > * T2 fail to acquire the tag > */ > > 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken > up due to the wake_batch being set at 6. If no more requests come, T1 > will wait here indefinitely. > > To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: > remove swap_lock"), which causes this issue. I'd suggest to add the following words in commit log: Check on ->cleared and update on both ->cleared and ->word need to be done atomically, and using spinlock could be the simplest solution. Otherwise, the patch looks fine for me. Thanks,
Hi, 在 2024/06/04 11:25, Ming Lei 写道: > On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >> >> Configuration for sbq: >> depth=64, wake_batch=6, shift=6, map_nr=1 >> >> 1. There are 64 requests in progress: >> map->word = 0xFFFFFFFFFFFFFFFF >> 2. After all the 64 requests complete, and no more requests come: >> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >> 3. Now two tasks try to allocate requests: >> T1: T2: >> __blk_mq_get_tag . >> __sbitmap_queue_get . >> sbitmap_get . >> sbitmap_find_bit . >> sbitmap_find_bit_in_word . >> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >> sbitmap_deferred_clear __sbitmap_queue_get >> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >> return false; __sbitmap_get_word -> nr=-1 >> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >> atomic_long_andnot() /* map->cleared=0 */ >> if (!(map->cleared)) >> return false; >> /* >> * map->cleared is cleared by T1 >> * T2 fail to acquire the tag >> */ >> >> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >> up due to the wake_batch being set at 6. If no more requests come, T1 >> will wait here indefinitely. >> >> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >> remove swap_lock"), which causes this issue. > > I'd suggest to add the following words in commit log: > > Check on ->cleared and update on both ->cleared and ->word need to be > done atomically, and using spinlock could be the simplest solution. > > Otherwise, the patch looks fine for me. Maybe I'm noob, but I'm confused how can this fix the problem, looks like the race condition doesn't change. In sbitmap_find_bit_in_word: 1) __sbitmap_get_word read word; 2) sbitmap_deferred_clear clear cleared; 3) sbitmap_deferred_clear update word; 2) and 3) are done atomically while 1) can still concurrent with 3): t1: sbitmap_find_bit_in_word __sbitmap_get_word -> read old word, return -1 t2: sbitmap_find_bit_in_word __sbitmap_get_word -> read old word, return -1 sbitmap_deferred_clear -> clear cleared and update word sbitmap_deferred_clear -> cleared is cleared, fail BYW, I still think it's fine to fix this problem by trying the __sbitmap_get_word() at least one more time if __sbitmap_get_word() failed. Thanks, Kuai > > Thanks, > > > . >
On 2024/6/4 11:25, Ming Lei wrote: > On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >> >> Configuration for sbq: >> depth=64, wake_batch=6, shift=6, map_nr=1 >> >> 1. There are 64 requests in progress: >> map->word = 0xFFFFFFFFFFFFFFFF >> 2. After all the 64 requests complete, and no more requests come: >> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >> 3. Now two tasks try to allocate requests: >> T1: T2: >> __blk_mq_get_tag . >> __sbitmap_queue_get . >> sbitmap_get . >> sbitmap_find_bit . >> sbitmap_find_bit_in_word . >> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >> sbitmap_deferred_clear __sbitmap_queue_get >> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >> return false; __sbitmap_get_word -> nr=-1 >> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >> atomic_long_andnot() /* map->cleared=0 */ >> if (!(map->cleared)) >> return false; >> /* >> * map->cleared is cleared by T1 >> * T2 fail to acquire the tag >> */ >> >> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >> up due to the wake_batch being set at 6. If no more requests come, T1 >> will wait here indefinitely. >> >> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >> remove swap_lock"), which causes this issue. > > I'd suggest to add the following words in commit log: > > Check on ->cleared and update on both ->cleared and ->word need to be > done atomically, and using spinlock could be the simplest solution. Thank you. I will handle this in V3. > > Otherwise, the patch looks fine for me. > > Thanks, >
On 2024/6/4 14:12, Yu Kuai wrote: > Hi, > > 在 2024/06/04 11:25, Ming Lei 写道: >> On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >>> >>> Configuration for sbq: >>> depth=64, wake_batch=6, shift=6, map_nr=1 >>> >>> 1. There are 64 requests in progress: >>> map->word = 0xFFFFFFFFFFFFFFFF >>> 2. After all the 64 requests complete, and no more requests come: >>> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >>> 3. Now two tasks try to allocate requests: >>> T1: T2: >>> __blk_mq_get_tag . >>> __sbitmap_queue_get . >>> sbitmap_get . >>> sbitmap_find_bit . >>> sbitmap_find_bit_in_word . >>> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >>> sbitmap_deferred_clear __sbitmap_queue_get >>> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >>> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >>> return false; __sbitmap_get_word -> nr=-1 >>> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >>> atomic_long_andnot() /* map->cleared=0 */ >>> if (!(map->cleared)) >>> return false; >>> /* >>> * map->cleared is cleared by T1 >>> * T2 fail to acquire the tag >>> */ >>> >>> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >>> up due to the wake_batch being set at 6. If no more requests come, T1 >>> will wait here indefinitely. >>> >>> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >>> remove swap_lock"), which causes this issue. >> >> I'd suggest to add the following words in commit log: >> >> Check on ->cleared and update on both ->cleared and ->word need to be >> done atomically, and using spinlock could be the simplest solution. >> >> Otherwise, the patch looks fine for me. > > Maybe I'm noob, but I'm confused how can this fix the problem, looks > like the race condition doesn't change. > > In sbitmap_find_bit_in_word: > > 1) __sbitmap_get_word read word; > 2) sbitmap_deferred_clear clear cleared; > 3) sbitmap_deferred_clear update word; > > 2) and 3) are done atomically while 1) can still concurrent with 3): > > t1: > sbitmap_find_bit_in_word > __sbitmap_get_word > -> read old word, return -1 > t2: > sbitmap_find_bit_in_word > __sbitmap_get_word > -> read old word, return -1 > sbitmap_deferred_clear > -> clear cleared and update word > sbitmap_deferred_clear > -> cleared is cleared, fail Yes, you are right, this patch cannot fix this issue. > > BYW, I still think it's fine to fix this problem by trying the > __sbitmap_get_word() at least one more time if __sbitmap_get_word() > failed. Err, after trying one more time __sbitmap_get_word() may still fail. > > Thanks, > Kuai > >> >> Thanks, >> >> >> . >> >
On Tue, Jun 04, 2024 at 02:12:22PM +0800, Yu Kuai wrote: > Hi, > > 在 2024/06/04 11:25, Ming Lei 写道: > > On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: > > > > > > Configuration for sbq: > > > depth=64, wake_batch=6, shift=6, map_nr=1 > > > > > > 1. There are 64 requests in progress: > > > map->word = 0xFFFFFFFFFFFFFFFF > > > 2. After all the 64 requests complete, and no more requests come: > > > map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF > > > 3. Now two tasks try to allocate requests: > > > T1: T2: > > > __blk_mq_get_tag . > > > __sbitmap_queue_get . > > > sbitmap_get . > > > sbitmap_find_bit . > > > sbitmap_find_bit_in_word . > > > __sbitmap_get_word -> nr=-1 __blk_mq_get_tag > > > sbitmap_deferred_clear __sbitmap_queue_get > > > /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit > > > if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word > > > return false; __sbitmap_get_word -> nr=-1 > > > mask = xchg(&map->cleared, 0) sbitmap_deferred_clear > > > atomic_long_andnot() /* map->cleared=0 */ > > > if (!(map->cleared)) > > > return false; > > > /* > > > * map->cleared is cleared by T1 > > > * T2 fail to acquire the tag > > > */ > > > > > > 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken > > > up due to the wake_batch being set at 6. If no more requests come, T1 > > > will wait here indefinitely. > > > > > > To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: > > > remove swap_lock"), which causes this issue. > > > > I'd suggest to add the following words in commit log: > > > > Check on ->cleared and update on both ->cleared and ->word need to be > > done atomically, and using spinlock could be the simplest solution. > > > > Otherwise, the patch looks fine for me. > > Maybe I'm noob, but I'm confused how can this fix the problem, looks > like the race condition doesn't change. > > In sbitmap_find_bit_in_word: > > 1) __sbitmap_get_word read word; > 2) sbitmap_deferred_clear clear cleared; > 3) sbitmap_deferred_clear update word; > > 2) and 3) are done atomically while 1) can still concurrent with 3): After 1) fails, sbitmap_deferred_clear() is called with spinlock, then it is pretty easy to solve the race, such as, the following patch against the revert patch. diff --git a/lib/sbitmap.c b/lib/sbitmap.c index dee02a0266a6..c015ecd8e10e 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -63,13 +63,15 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) { unsigned long mask; - bool ret = false; unsigned long flags; + bool ret; spin_lock_irqsave(&map->swap_lock, flags); - if (!map->cleared) + if (!map->cleared) { + ret = !!map->word; goto out_unlock; + } /* * First get a stable cleared mask, setting the old mask to 0. Thanks, Ming
On 2024/6/6 11:12, Ming Lei wrote: > On Tue, Jun 04, 2024 at 02:12:22PM +0800, Yu Kuai wrote: >> Hi, >> >> 在 2024/06/04 11:25, Ming Lei 写道: >>> On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >>>> >>>> Configuration for sbq: >>>> depth=64, wake_batch=6, shift=6, map_nr=1 >>>> >>>> 1. There are 64 requests in progress: >>>> map->word = 0xFFFFFFFFFFFFFFFF >>>> 2. After all the 64 requests complete, and no more requests come: >>>> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >>>> 3. Now two tasks try to allocate requests: >>>> T1: T2: >>>> __blk_mq_get_tag . >>>> __sbitmap_queue_get . >>>> sbitmap_get . >>>> sbitmap_find_bit . >>>> sbitmap_find_bit_in_word . >>>> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >>>> sbitmap_deferred_clear __sbitmap_queue_get >>>> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >>>> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >>>> return false; __sbitmap_get_word -> nr=-1 >>>> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >>>> atomic_long_andnot() /* map->cleared=0 */ >>>> if (!(map->cleared)) >>>> return false; >>>> /* >>>> * map->cleared is cleared by T1 >>>> * T2 fail to acquire the tag >>>> */ >>>> >>>> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >>>> up due to the wake_batch being set at 6. If no more requests come, T1 >>>> will wait here indefinitely. >>>> >>>> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >>>> remove swap_lock"), which causes this issue. >>> >>> I'd suggest to add the following words in commit log: >>> >>> Check on ->cleared and update on both ->cleared and ->word need to be >>> done atomically, and using spinlock could be the simplest solution. >>> >>> Otherwise, the patch looks fine for me. >> >> Maybe I'm noob, but I'm confused how can this fix the problem, looks >> like the race condition doesn't change. >> >> In sbitmap_find_bit_in_word: >> >> 1) __sbitmap_get_word read word; >> 2) sbitmap_deferred_clear clear cleared; >> 3) sbitmap_deferred_clear update word; >> >> 2) and 3) are done atomically while 1) can still concurrent with 3): > > After 1) fails, sbitmap_deferred_clear() is called with spinlock, > then it is pretty easy to solve the race, such as, the following patch > against the revert patch. > > > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index dee02a0266a6..c015ecd8e10e 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -63,13 +63,15 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, > static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > { > unsigned long mask; > - bool ret = false; > unsigned long flags; > + bool ret; > > spin_lock_irqsave(&map->swap_lock, flags); > > - if (!map->cleared) > + if (!map->cleared) { > + ret = !!map->word; After atomic_long_andnot(mask, (atomic_long_t *)&map->word), map->word may be 0 if all requests have completed, or not 0 if some requests are still in flight. Therefore, using !!map->word to determine the availability of free tags is inaccurate. Thanks > goto out_unlock; > + } > > /* > * First get a stable cleared mask, setting the old mask to 0. > > > Thanks, > Ming >
On 2024/6/4 14:12, Yu Kuai wrote: > Hi, > > 在 2024/06/04 11:25, Ming Lei 写道: >> On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >>> >>> Configuration for sbq: >>> depth=64, wake_batch=6, shift=6, map_nr=1 >>> >>> 1. There are 64 requests in progress: >>> map->word = 0xFFFFFFFFFFFFFFFF >>> 2. After all the 64 requests complete, and no more requests come: >>> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >>> 3. Now two tasks try to allocate requests: >>> T1: T2: >>> __blk_mq_get_tag . >>> __sbitmap_queue_get . >>> sbitmap_get . >>> sbitmap_find_bit . >>> sbitmap_find_bit_in_word . >>> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >>> sbitmap_deferred_clear __sbitmap_queue_get >>> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >>> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >>> return false; __sbitmap_get_word -> nr=-1 >>> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >>> atomic_long_andnot() /* map->cleared=0 */ >>> if (!(map->cleared)) >>> return false; >>> /* >>> * map->cleared is cleared by T1 >>> * T2 fail to acquire the tag >>> */ >>> >>> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >>> up due to the wake_batch being set at 6. If no more requests come, T1 >>> will wait here indefinitely. >>> >>> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >>> remove swap_lock"), which causes this issue. >> >> I'd suggest to add the following words in commit log: >> >> Check on ->cleared and update on both ->cleared and ->word need to be >> done atomically, and using spinlock could be the simplest solution. >> >> Otherwise, the patch looks fine for me. > > Maybe I'm noob, but I'm confused how can this fix the problem, looks > like the race condition doesn't change. > > In sbitmap_find_bit_in_word: > > 1) __sbitmap_get_word read word; > 2) sbitmap_deferred_clear clear cleared; > 3) sbitmap_deferred_clear update word; > > 2) and 3) are done atomically while 1) can still concurrent with 3): > > t1: > sbitmap_find_bit_in_word > __sbitmap_get_word > -> read old word, return -1 > t2: > sbitmap_find_bit_in_word > __sbitmap_get_word > -> read old word, return -1 > sbitmap_deferred_clear > -> clear cleared and update word > sbitmap_deferred_clear > -> cleared is cleared, fail > > BYW, I still think it's fine to fix this problem by trying the > __sbitmap_get_word() at least one more time if __sbitmap_get_word() > failed. How about this one: 1. Add extra check in sbitmap_find_bit_in_word() referenced from Yu kuai's suggestion. 2. Change from atomic_long_andnot to atomic_long_fetch_andnot_release --- include/linux/sbitmap.h | 5 +++++ lib/sbitmap.c | 23 ++++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index d662cf136021..ec0b0e73c906 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -36,6 +36,11 @@ struct sbitmap_word { * @cleared: word holding cleared bits */ unsigned long cleared ____cacheline_aligned_in_smp; + + /** + * @swap_lock: Held while swapping word <-> cleared + */ + spinlock_t swap_lock; } ____cacheline_aligned_in_smp; /** diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 1e453f825c05..63dadf91e40b 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -63,9 +63,13 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) { unsigned long mask; + bool ret = false; + unsigned long flags; - if (!READ_ONCE(map->cleared)) - return false; + spin_lock_irqsave(&map->swap_lock, flags); + + if (!map->cleared) + goto out_unlock; /* * First get a stable cleared mask, setting the old mask to 0. @@ -75,9 +79,12 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) /* * Now clear the masked bits in our free word */ - atomic_long_andnot(mask, (atomic_long_t *)&map->word); + atomic_long_fetch_andnot_release(mask, (atomic_long_t *)&map->word); BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); - return true; + ret = true; +out_unlock: + spin_unlock_irqrestore(&map->swap_lock, flags); + return ret; } int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, @@ -85,6 +92,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, bool alloc_hint) { unsigned int bits_per_word; + int i; if (shift < 0) shift = sbitmap_calculate_shift(depth); @@ -116,6 +124,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, return -ENOMEM; } + for (i = 0; i < sb->map_nr; i++) + spin_lock_init(&sb->map[i].swap_lock); + return 0; } EXPORT_SYMBOL_GPL(sbitmap_init_node); @@ -175,11 +186,13 @@ static int sbitmap_find_bit_in_word(struct sbitmap_word *map, int nr; do { + unsigned long cleared = READ_ONCE(map->cleared); + nr = __sbitmap_get_word(&map->word, depth, alloc_hint, wrap); if (nr != -1) break; - if (!sbitmap_deferred_clear(map)) + if (!sbitmap_deferred_clear(map) && !cleared) break; } while (1);
On Thu, Jun 06, 2024 at 03:21:38PM +0800, YangYang wrote: > On 2024/6/6 11:12, Ming Lei wrote: > > On Tue, Jun 04, 2024 at 02:12:22PM +0800, Yu Kuai wrote: > > > Hi, > > > > > > 在 2024/06/04 11:25, Ming Lei 写道: > > > > On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: > > > > > > > > > > Configuration for sbq: > > > > > depth=64, wake_batch=6, shift=6, map_nr=1 > > > > > > > > > > 1. There are 64 requests in progress: > > > > > map->word = 0xFFFFFFFFFFFFFFFF > > > > > 2. After all the 64 requests complete, and no more requests come: > > > > > map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF > > > > > 3. Now two tasks try to allocate requests: > > > > > T1: T2: > > > > > __blk_mq_get_tag . > > > > > __sbitmap_queue_get . > > > > > sbitmap_get . > > > > > sbitmap_find_bit . > > > > > sbitmap_find_bit_in_word . > > > > > __sbitmap_get_word -> nr=-1 __blk_mq_get_tag > > > > > sbitmap_deferred_clear __sbitmap_queue_get > > > > > /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit > > > > > if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word > > > > > return false; __sbitmap_get_word -> nr=-1 > > > > > mask = xchg(&map->cleared, 0) sbitmap_deferred_clear > > > > > atomic_long_andnot() /* map->cleared=0 */ > > > > > if (!(map->cleared)) > > > > > return false; > > > > > /* > > > > > * map->cleared is cleared by T1 > > > > > * T2 fail to acquire the tag > > > > > */ > > > > > > > > > > 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken > > > > > up due to the wake_batch being set at 6. If no more requests come, T1 > > > > > will wait here indefinitely. > > > > > > > > > > To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: > > > > > remove swap_lock"), which causes this issue. > > > > > > > > I'd suggest to add the following words in commit log: > > > > > > > > Check on ->cleared and update on both ->cleared and ->word need to be > > > > done atomically, and using spinlock could be the simplest solution. > > > > > > > > Otherwise, the patch looks fine for me. > > > > > > Maybe I'm noob, but I'm confused how can this fix the problem, looks > > > like the race condition doesn't change. > > > > > > In sbitmap_find_bit_in_word: > > > > > > 1) __sbitmap_get_word read word; > > > 2) sbitmap_deferred_clear clear cleared; > > > 3) sbitmap_deferred_clear update word; > > > > > > 2) and 3) are done atomically while 1) can still concurrent with 3): > > > > After 1) fails, sbitmap_deferred_clear() is called with spinlock, > > then it is pretty easy to solve the race, such as, the following patch > > against the revert patch. > > > > > > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > > index dee02a0266a6..c015ecd8e10e 100644 > > --- a/lib/sbitmap.c > > +++ b/lib/sbitmap.c > > @@ -63,13 +63,15 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, > > static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > > { > > unsigned long mask; > > - bool ret = false; > > unsigned long flags; > > + bool ret; > > spin_lock_irqsave(&map->swap_lock, flags); > > - if (!map->cleared) > > + if (!map->cleared) { > > + ret = !!map->word; > > After atomic_long_andnot(mask, (atomic_long_t *)&map->word), map->word > may be 0 if all requests have completed, or not 0 if some requests are > still in flight. setting ->word is lockless, but zeroing ->word is serialized with ->swap_lock. > Therefore, using !!map->word to determine the > availability of free tags is inaccurate. The check should be changed to decide if any free bit is available in map->word instead of !!map->word, and 'shift' need to be passed in sbitmap_deferred_clear(). Just be curious, do you have reproducer for this issue? Thanks, Ming
On 2024/6/6 17:02, Ming Lei wrote: > On Thu, Jun 06, 2024 at 03:21:38PM +0800, YangYang wrote: >> On 2024/6/6 11:12, Ming Lei wrote: >>> On Tue, Jun 04, 2024 at 02:12:22PM +0800, Yu Kuai wrote: >>>> Hi, >>>> >>>> 在 2024/06/04 11:25, Ming Lei 写道: >>>>> On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >>>>>> >>>>>> Configuration for sbq: >>>>>> depth=64, wake_batch=6, shift=6, map_nr=1 >>>>>> >>>>>> 1. There are 64 requests in progress: >>>>>> map->word = 0xFFFFFFFFFFFFFFFF >>>>>> 2. After all the 64 requests complete, and no more requests come: >>>>>> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >>>>>> 3. Now two tasks try to allocate requests: >>>>>> T1: T2: >>>>>> __blk_mq_get_tag . >>>>>> __sbitmap_queue_get . >>>>>> sbitmap_get . >>>>>> sbitmap_find_bit . >>>>>> sbitmap_find_bit_in_word . >>>>>> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >>>>>> sbitmap_deferred_clear __sbitmap_queue_get >>>>>> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >>>>>> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >>>>>> return false; __sbitmap_get_word -> nr=-1 >>>>>> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >>>>>> atomic_long_andnot() /* map->cleared=0 */ >>>>>> if (!(map->cleared)) >>>>>> return false; >>>>>> /* >>>>>> * map->cleared is cleared by T1 >>>>>> * T2 fail to acquire the tag >>>>>> */ >>>>>> >>>>>> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >>>>>> up due to the wake_batch being set at 6. If no more requests come, T1 >>>>>> will wait here indefinitely. >>>>>> >>>>>> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >>>>>> remove swap_lock"), which causes this issue. >>>>> >>>>> I'd suggest to add the following words in commit log: >>>>> >>>>> Check on ->cleared and update on both ->cleared and ->word need to be >>>>> done atomically, and using spinlock could be the simplest solution. >>>>> >>>>> Otherwise, the patch looks fine for me. >>>> >>>> Maybe I'm noob, but I'm confused how can this fix the problem, looks >>>> like the race condition doesn't change. >>>> >>>> In sbitmap_find_bit_in_word: >>>> >>>> 1) __sbitmap_get_word read word; >>>> 2) sbitmap_deferred_clear clear cleared; >>>> 3) sbitmap_deferred_clear update word; >>>> >>>> 2) and 3) are done atomically while 1) can still concurrent with 3): >>> >>> After 1) fails, sbitmap_deferred_clear() is called with spinlock, >>> then it is pretty easy to solve the race, such as, the following patch >>> against the revert patch. >>> >>> >>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c >>> index dee02a0266a6..c015ecd8e10e 100644 >>> --- a/lib/sbitmap.c >>> +++ b/lib/sbitmap.c >>> @@ -63,13 +63,15 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, >>> static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) >>> { >>> unsigned long mask; >>> - bool ret = false; >>> unsigned long flags; >>> + bool ret; >>> spin_lock_irqsave(&map->swap_lock, flags); >>> - if (!map->cleared) >>> + if (!map->cleared) { >>> + ret = !!map->word; >> >> After atomic_long_andnot(mask, (atomic_long_t *)&map->word), map->word >> may be 0 if all requests have completed, or not 0 if some requests are >> still in flight. > > setting ->word is lockless, but zeroing ->word is serialized with ->swap_lock. > >> Therefore, using !!map->word to determine the >> availability of free tags is inaccurate. > > The check should be changed to decide if any free bit is available in > map->word instead of !!map->word, and 'shift' need to be passed in > sbitmap_deferred_clear(). > > Just be curious, do you have reproducer for this issue? It's hard to reproduce, but it's reproducible. We've encountered this issue several times. Thanks > > Thanks, > Ming >
On Thu, Jun 06, 2024 at 04:55:17PM +0800, YangYang wrote: > On 2024/6/4 14:12, Yu Kuai wrote: > > Hi, > > > > 在 2024/06/04 11:25, Ming Lei 写道: > > > On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: > > > > > > > > Configuration for sbq: > > > > depth=64, wake_batch=6, shift=6, map_nr=1 > > > > > > > > 1. There are 64 requests in progress: > > > > map->word = 0xFFFFFFFFFFFFFFFF > > > > 2. After all the 64 requests complete, and no more requests come: > > > > map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF > > > > 3. Now two tasks try to allocate requests: > > > > T1: T2: > > > > __blk_mq_get_tag . > > > > __sbitmap_queue_get . > > > > sbitmap_get . > > > > sbitmap_find_bit . > > > > sbitmap_find_bit_in_word . > > > > __sbitmap_get_word -> nr=-1 __blk_mq_get_tag > > > > sbitmap_deferred_clear __sbitmap_queue_get > > > > /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit > > > > if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word > > > > return false; __sbitmap_get_word -> nr=-1 > > > > mask = xchg(&map->cleared, 0) sbitmap_deferred_clear > > > > atomic_long_andnot() /* map->cleared=0 */ > > > > if (!(map->cleared)) > > > > return false; > > > > /* > > > > * map->cleared is cleared by T1 > > > > * T2 fail to acquire the tag > > > > */ > > > > > > > > 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken > > > > up due to the wake_batch being set at 6. If no more requests come, T1 > > > > will wait here indefinitely. > > > > > > > > To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: > > > > remove swap_lock"), which causes this issue. > > > > > > I'd suggest to add the following words in commit log: > > > > > > Check on ->cleared and update on both ->cleared and ->word need to be > > > done atomically, and using spinlock could be the simplest solution. > > > > > > Otherwise, the patch looks fine for me. > > > > Maybe I'm noob, but I'm confused how can this fix the problem, looks > > like the race condition doesn't change. > > > > In sbitmap_find_bit_in_word: > > > > 1) __sbitmap_get_word read word; > > 2) sbitmap_deferred_clear clear cleared; > > 3) sbitmap_deferred_clear update word; > > > > 2) and 3) are done atomically while 1) can still concurrent with 3): > > > > t1: > > sbitmap_find_bit_in_word > > __sbitmap_get_word > > -> read old word, return -1 > > t2: > > sbitmap_find_bit_in_word > > __sbitmap_get_word > > -> read old word, return -1 > > sbitmap_deferred_clear > > -> clear cleared and update word > > sbitmap_deferred_clear > > -> cleared is cleared, fail > > > > BYW, I still think it's fine to fix this problem by trying the > > __sbitmap_get_word() at least one more time if __sbitmap_get_word() > > failed. > > How about this one: > 1. Add extra check in sbitmap_find_bit_in_word() referenced from > Yu kuai's suggestion. > 2. Change from atomic_long_andnot to atomic_long_fetch_andnot_release > > --- > include/linux/sbitmap.h | 5 +++++ > lib/sbitmap.c | 23 ++++++++++++++++++----- > 2 files changed, 23 insertions(+), 5 deletions(-) > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h > index d662cf136021..ec0b0e73c906 100644 > --- a/include/linux/sbitmap.h > +++ b/include/linux/sbitmap.h > @@ -36,6 +36,11 @@ struct sbitmap_word { > * @cleared: word holding cleared bits > */ > unsigned long cleared ____cacheline_aligned_in_smp; > + > + /** > + * @swap_lock: Held while swapping word <-> cleared > + */ > + spinlock_t swap_lock; > } ____cacheline_aligned_in_smp; > > /** > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index 1e453f825c05..63dadf91e40b 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -63,9 +63,13 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, > static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > { > unsigned long mask; > + bool ret = false; > + unsigned long flags; > > - if (!READ_ONCE(map->cleared)) > - return false; > + spin_lock_irqsave(&map->swap_lock, flags); > + > + if (!map->cleared) > + goto out_unlock; > > /* > * First get a stable cleared mask, setting the old mask to 0. > @@ -75,9 +79,12 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > /* > * Now clear the masked bits in our free word > */ > - atomic_long_andnot(mask, (atomic_long_t *)&map->word); > + atomic_long_fetch_andnot_release(mask, (atomic_long_t *)&map->word); > BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); > - return true; > + ret = true; > +out_unlock: > + spin_unlock_irqrestore(&map->swap_lock, flags); > + return ret; > } > > int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > @@ -85,6 +92,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > bool alloc_hint) > { > unsigned int bits_per_word; > + int i; > > if (shift < 0) > shift = sbitmap_calculate_shift(depth); > @@ -116,6 +124,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > return -ENOMEM; > } > > + for (i = 0; i < sb->map_nr; i++) > + spin_lock_init(&sb->map[i].swap_lock); > + > return 0; > } > EXPORT_SYMBOL_GPL(sbitmap_init_node); > @@ -175,11 +186,13 @@ static int sbitmap_find_bit_in_word(struct sbitmap_word *map, > int nr; > > do { > + unsigned long cleared = READ_ONCE(map->cleared); > + ->cleared is stored in standalone cacheline, the above line adds one extra l1 load, so I still prefer to check ->word & ->cleared in sbitmap_deferred_clear(). Thanks, Ming
On 2024/6/6 23:48, Ming Lei wrote: > On Thu, Jun 06, 2024 at 04:55:17PM +0800, YangYang wrote: >> On 2024/6/4 14:12, Yu Kuai wrote: >>> Hi, >>> >>> 在 2024/06/04 11:25, Ming Lei 写道: >>>> On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >>>>> >>>>> Configuration for sbq: >>>>> depth=64, wake_batch=6, shift=6, map_nr=1 >>>>> >>>>> 1. There are 64 requests in progress: >>>>> map->word = 0xFFFFFFFFFFFFFFFF >>>>> 2. After all the 64 requests complete, and no more requests come: >>>>> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >>>>> 3. Now two tasks try to allocate requests: >>>>> T1: T2: >>>>> __blk_mq_get_tag . >>>>> __sbitmap_queue_get . >>>>> sbitmap_get . >>>>> sbitmap_find_bit . >>>>> sbitmap_find_bit_in_word . >>>>> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >>>>> sbitmap_deferred_clear __sbitmap_queue_get >>>>> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >>>>> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >>>>> return false; __sbitmap_get_word -> nr=-1 >>>>> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >>>>> atomic_long_andnot() /* map->cleared=0 */ >>>>> if (!(map->cleared)) >>>>> return false; >>>>> /* >>>>> * map->cleared is cleared by T1 >>>>> * T2 fail to acquire the tag >>>>> */ >>>>> >>>>> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >>>>> up due to the wake_batch being set at 6. If no more requests come, T1 >>>>> will wait here indefinitely. >>>>> >>>>> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >>>>> remove swap_lock"), which causes this issue. >>>> >>>> I'd suggest to add the following words in commit log: >>>> >>>> Check on ->cleared and update on both ->cleared and ->word need to be >>>> done atomically, and using spinlock could be the simplest solution. >>>> >>>> Otherwise, the patch looks fine for me. >>> >>> Maybe I'm noob, but I'm confused how can this fix the problem, looks >>> like the race condition doesn't change. >>> >>> In sbitmap_find_bit_in_word: >>> >>> 1) __sbitmap_get_word read word; >>> 2) sbitmap_deferred_clear clear cleared; >>> 3) sbitmap_deferred_clear update word; >>> >>> 2) and 3) are done atomically while 1) can still concurrent with 3): >>> >>> t1: >>> sbitmap_find_bit_in_word >>> __sbitmap_get_word >>> -> read old word, return -1 >>> t2: >>> sbitmap_find_bit_in_word >>> __sbitmap_get_word >>> -> read old word, return -1 >>> sbitmap_deferred_clear >>> -> clear cleared and update word >>> sbitmap_deferred_clear >>> -> cleared is cleared, fail >>> >>> BYW, I still think it's fine to fix this problem by trying the >>> __sbitmap_get_word() at least one more time if __sbitmap_get_word() >>> failed. >> >> How about this one: >> 1. Add extra check in sbitmap_find_bit_in_word() referenced from >> Yu kuai's suggestion. >> 2. Change from atomic_long_andnot to atomic_long_fetch_andnot_release >> >> --- >> include/linux/sbitmap.h | 5 +++++ >> lib/sbitmap.c | 23 ++++++++++++++++++----- >> 2 files changed, 23 insertions(+), 5 deletions(-) >> >> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h >> index d662cf136021..ec0b0e73c906 100644 >> --- a/include/linux/sbitmap.h >> +++ b/include/linux/sbitmap.h >> @@ -36,6 +36,11 @@ struct sbitmap_word { >> * @cleared: word holding cleared bits >> */ >> unsigned long cleared ____cacheline_aligned_in_smp; >> + >> + /** >> + * @swap_lock: Held while swapping word <-> cleared >> + */ >> + spinlock_t swap_lock; >> } ____cacheline_aligned_in_smp; >> >> /** >> diff --git a/lib/sbitmap.c b/lib/sbitmap.c >> index 1e453f825c05..63dadf91e40b 100644 >> --- a/lib/sbitmap.c >> +++ b/lib/sbitmap.c >> @@ -63,9 +63,13 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, >> static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) >> { >> unsigned long mask; >> + bool ret = false; >> + unsigned long flags; >> >> - if (!READ_ONCE(map->cleared)) >> - return false; >> + spin_lock_irqsave(&map->swap_lock, flags); >> + >> + if (!map->cleared) >> + goto out_unlock; >> >> /* >> * First get a stable cleared mask, setting the old mask to 0. >> @@ -75,9 +79,12 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) >> /* >> * Now clear the masked bits in our free word >> */ >> - atomic_long_andnot(mask, (atomic_long_t *)&map->word); >> + atomic_long_fetch_andnot_release(mask, (atomic_long_t *)&map->word); >> BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); >> - return true; >> + ret = true; >> +out_unlock: >> + spin_unlock_irqrestore(&map->swap_lock, flags); >> + return ret; >> } >> >> int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, >> @@ -85,6 +92,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, >> bool alloc_hint) >> { >> unsigned int bits_per_word; >> + int i; >> >> if (shift < 0) >> shift = sbitmap_calculate_shift(depth); >> @@ -116,6 +124,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, >> return -ENOMEM; >> } >> >> + for (i = 0; i < sb->map_nr; i++) >> + spin_lock_init(&sb->map[i].swap_lock); >> + >> return 0; >> } >> EXPORT_SYMBOL_GPL(sbitmap_init_node); >> @@ -175,11 +186,13 @@ static int sbitmap_find_bit_in_word(struct sbitmap_word *map, >> int nr; >> >> do { >> + unsigned long cleared = READ_ONCE(map->cleared); >> + > > ->cleared is stored in standalone cacheline, the above line adds one extra l1 > load, so I still prefer to check ->word & ->cleared in sbitmap_deferred_clear(). Yes, it's very reasonable. I have sent V3, which does the check in sbitmap_deferred_clear() referenced from your suggestion. + if (!map->cleared) { + ret = find_first_zero_bit(&map->word, depth) >= depth ? false : true; + goto out_unlock; + } https://lore.kernel.org/linux-block/20240606135723.2794-1-yang.yang@vivo.com/T/#u Thanks.
On 6/4/24 08:03, YangYang wrote: > On 2024/6/4 14:12, Yu Kuai wrote: >> Hi, >> >> 在 2024/06/04 11:25, Ming Lei 写道: >>> On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >>>> >>>> Configuration for sbq: >>>> depth=64, wake_batch=6, shift=6, map_nr=1 >>>> >>>> 1. There are 64 requests in progress: >>>> map->word = 0xFFFFFFFFFFFFFFFF >>>> 2. After all the 64 requests complete, and no more requests come: >>>> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >>>> 3. Now two tasks try to allocate requests: >>>> T1: T2: >>>> __blk_mq_get_tag . >>>> __sbitmap_queue_get . >>>> sbitmap_get . >>>> sbitmap_find_bit . >>>> sbitmap_find_bit_in_word . >>>> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >>>> sbitmap_deferred_clear __sbitmap_queue_get >>>> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >>>> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >>>> return false; __sbitmap_get_word -> nr=-1 >>>> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >>>> atomic_long_andnot() /* map->cleared=0 */ >>>> if (!(map->cleared)) >>>> return false; >>>> /* >>>> * map->cleared is cleared by T1 >>>> * T2 fail to acquire the tag >>>> */ >>>> >>>> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >>>> up due to the wake_batch being set at 6. If no more requests come, T1 >>>> will wait here indefinitely. >>>> >>>> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >>>> remove swap_lock"), which causes this issue. >>> >>> I'd suggest to add the following words in commit log: >>> >>> Check on ->cleared and update on both ->cleared and ->word need to be >>> done atomically, and using spinlock could be the simplest solution. >>> >>> Otherwise, the patch looks fine for me. >> >> Maybe I'm noob, but I'm confused how can this fix the problem, looks >> like the race condition doesn't change. >> >> In sbitmap_find_bit_in_word: >> >> 1) __sbitmap_get_word read word; >> 2) sbitmap_deferred_clear clear cleared; >> 3) sbitmap_deferred_clear update word; >> >> 2) and 3) are done atomically while 1) can still concurrent with 3): >> >> t1: >> sbitmap_find_bit_in_word >> __sbitmap_get_word >> -> read old word, return -1 > t2: >> sbitmap_find_bit_in_word >> __sbitmap_get_word >> -> read old word, return -1 >> sbitmap_deferred_clear >> -> clear cleared and update word >> sbitmap_deferred_clear >> -> cleared is cleared, fail > > Yes, you are right, this patch cannot fix this issue. One other alternative is to kill ->cleared. It's not immediately clear how important it is. Do we have any numbers? >> BYW, I still think it's fine to fix this problem by trying the >> __sbitmap_get_word() at least one more time if __sbitmap_get_word() >> failed. > > Err, after trying one more time __sbitmap_get_word() may still fail.
On 2024/6/7 20:59, Pavel Begunkov wrote: > On 6/4/24 08:03, YangYang wrote: >> On 2024/6/4 14:12, Yu Kuai wrote: >>> Hi, >>> >>> 在 2024/06/04 11:25, Ming Lei 写道: >>>> On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: >>>>> >>>>> Configuration for sbq: >>>>> depth=64, wake_batch=6, shift=6, map_nr=1 >>>>> >>>>> 1. There are 64 requests in progress: >>>>> map->word = 0xFFFFFFFFFFFFFFFF >>>>> 2. After all the 64 requests complete, and no more requests come: >>>>> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >>>>> 3. Now two tasks try to allocate requests: >>>>> T1: T2: >>>>> __blk_mq_get_tag . >>>>> __sbitmap_queue_get . >>>>> sbitmap_get . >>>>> sbitmap_find_bit . >>>>> sbitmap_find_bit_in_word . >>>>> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >>>>> sbitmap_deferred_clear __sbitmap_queue_get >>>>> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >>>>> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >>>>> return false; __sbitmap_get_word -> nr=-1 >>>>> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >>>>> atomic_long_andnot() /* map->cleared=0 */ >>>>> if (!(map->cleared)) >>>>> return false; >>>>> /* >>>>> * map->cleared is cleared by T1 >>>>> * T2 fail to acquire the tag >>>>> */ >>>>> >>>>> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >>>>> up due to the wake_batch being set at 6. If no more requests come, T1 >>>>> will wait here indefinitely. >>>>> >>>>> To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: >>>>> remove swap_lock"), which causes this issue. >>>> >>>> I'd suggest to add the following words in commit log: >>>> >>>> Check on ->cleared and update on both ->cleared and ->word need to be >>>> done atomically, and using spinlock could be the simplest solution. >>>> >>>> Otherwise, the patch looks fine for me. >>> >>> Maybe I'm noob, but I'm confused how can this fix the problem, looks >>> like the race condition doesn't change. >>> >>> In sbitmap_find_bit_in_word: >>> >>> 1) __sbitmap_get_word read word; >>> 2) sbitmap_deferred_clear clear cleared; >>> 3) sbitmap_deferred_clear update word; >>> >>> 2) and 3) are done atomically while 1) can still concurrent with 3): >>> >>> t1: >>> sbitmap_find_bit_in_word >>> __sbitmap_get_word >>> -> read old word, return -1 > t2: >>> sbitmap_find_bit_in_word >>> __sbitmap_get_word >>> -> read old word, return -1 >>> sbitmap_deferred_clear >>> -> clear cleared and update word >>> sbitmap_deferred_clear >>> -> cleared is cleared, fail >> >> Yes, you are right, this patch cannot fix this issue. > > One other alternative is to kill ->cleared. It's not > immediately clear how important it is. Do we have any > numbers? Sorry, I can't get it. Are you suggesting to remove ->cleared from struct sbitmap_word entirely? Thanks. > >>> BYW, I still think it's fine to fix this problem by trying the >>> __sbitmap_get_word() at least one more time if __sbitmap_get_word() >>> failed. >> >> Err, after trying one more time __sbitmap_get_word() may still fail. >
On Fri, Jun 07, 2024 at 01:59:07PM +0100, Pavel Begunkov wrote: > On 6/4/24 08:03, YangYang wrote: > > On 2024/6/4 14:12, Yu Kuai wrote: > > > Hi, > > > > > > 在 2024/06/04 11:25, Ming Lei 写道: > > > > On Tue, Jun 4, 2024 at 11:12 AM Yang Yang <yang.yang@vivo.com> wrote: > > > > > > > > > > Configuration for sbq: > > > > > depth=64, wake_batch=6, shift=6, map_nr=1 > > > > > > > > > > 1. There are 64 requests in progress: > > > > > map->word = 0xFFFFFFFFFFFFFFFF > > > > > 2. After all the 64 requests complete, and no more requests come: > > > > > map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF > > > > > 3. Now two tasks try to allocate requests: > > > > > T1: T2: > > > > > __blk_mq_get_tag . > > > > > __sbitmap_queue_get . > > > > > sbitmap_get . > > > > > sbitmap_find_bit . > > > > > sbitmap_find_bit_in_word . > > > > > __sbitmap_get_word -> nr=-1 __blk_mq_get_tag > > > > > sbitmap_deferred_clear __sbitmap_queue_get > > > > > /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit > > > > > if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word > > > > > return false; __sbitmap_get_word -> nr=-1 > > > > > mask = xchg(&map->cleared, 0) sbitmap_deferred_clear > > > > > atomic_long_andnot() /* map->cleared=0 */ > > > > > if (!(map->cleared)) > > > > > return false; > > > > > /* > > > > > * map->cleared is cleared by T1 > > > > > * T2 fail to acquire the tag > > > > > */ > > > > > > > > > > 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken > > > > > up due to the wake_batch being set at 6. If no more requests come, T1 > > > > > will wait here indefinitely. > > > > > > > > > > To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: > > > > > remove swap_lock"), which causes this issue. > > > > > > > > I'd suggest to add the following words in commit log: > > > > > > > > Check on ->cleared and update on both ->cleared and ->word need to be > > > > done atomically, and using spinlock could be the simplest solution. > > > > > > > > Otherwise, the patch looks fine for me. > > > > > > Maybe I'm noob, but I'm confused how can this fix the problem, looks > > > like the race condition doesn't change. > > > > > > In sbitmap_find_bit_in_word: > > > > > > 1) __sbitmap_get_word read word; > > > 2) sbitmap_deferred_clear clear cleared; > > > 3) sbitmap_deferred_clear update word; > > > > > > 2) and 3) are done atomically while 1) can still concurrent with 3): > > > > > > t1: > > > sbitmap_find_bit_in_word > > > __sbitmap_get_word > > > -> read old word, return -1 > t2: > > > sbitmap_find_bit_in_word > > > __sbitmap_get_word > > > -> read old word, return -1 > > > sbitmap_deferred_clear > > > -> clear cleared and update word > > > sbitmap_deferred_clear > > > -> cleared is cleared, fail > > > > Yes, you are right, this patch cannot fix this issue. > > One other alternative is to kill ->cleared. It's not > immediately clear how important it is. Do we have any > numbers? Please see commit ea86ea2cdced ("sbitmap: ammortize cost of clearing bits"). ``` In a threaded poll test case, half the overhead of getting and clearing tags is removed with this change. ``` Thanks, Ming
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index d662cf136021..ec0b0e73c906 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -36,6 +36,11 @@ struct sbitmap_word { * @cleared: word holding cleared bits */ unsigned long cleared ____cacheline_aligned_in_smp; + + /** + * @swap_lock: Held while swapping word <-> cleared + */ + spinlock_t swap_lock; } ____cacheline_aligned_in_smp; /** diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 1e453f825c05..dee02a0266a6 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -63,9 +63,13 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) { unsigned long mask; + bool ret = false; + unsigned long flags; - if (!READ_ONCE(map->cleared)) - return false; + spin_lock_irqsave(&map->swap_lock, flags); + + if (!map->cleared) + goto out_unlock; /* * First get a stable cleared mask, setting the old mask to 0. @@ -77,7 +81,10 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) */ atomic_long_andnot(mask, (atomic_long_t *)&map->word); BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); - return true; + ret = true; +out_unlock: + spin_unlock_irqrestore(&map->swap_lock, flags); + return ret; } int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, @@ -85,6 +92,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, bool alloc_hint) { unsigned int bits_per_word; + int i; if (shift < 0) shift = sbitmap_calculate_shift(depth); @@ -116,6 +124,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, return -ENOMEM; } + for (i = 0; i < sb->map_nr; i++) + spin_lock_init(&sb->map[i].swap_lock); + return 0; } EXPORT_SYMBOL_GPL(sbitmap_init_node);
Configuration for sbq: depth=64, wake_batch=6, shift=6, map_nr=1 1. There are 64 requests in progress: map->word = 0xFFFFFFFFFFFFFFFF 2. After all the 64 requests complete, and no more requests come: map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF 3. Now two tasks try to allocate requests: T1: T2: __blk_mq_get_tag . __sbitmap_queue_get . sbitmap_get . sbitmap_find_bit . sbitmap_find_bit_in_word . __sbitmap_get_word -> nr=-1 __blk_mq_get_tag sbitmap_deferred_clear __sbitmap_queue_get /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word return false; __sbitmap_get_word -> nr=-1 mask = xchg(&map->cleared, 0) sbitmap_deferred_clear atomic_long_andnot() /* map->cleared=0 */ if (!(map->cleared)) return false; /* * map->cleared is cleared by T1 * T2 fail to acquire the tag */ 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken up due to the wake_batch being set at 6. If no more requests come, T1 will wait here indefinitely. To fix this issue, simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock"), which causes this issue. Fixes: 661d4f55a794 ("sbitmap: remove swap_lock") Signed-off-by: Yang Yang <yang.yang@vivo.com> --- Changes from v1: - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock") --- include/linux/sbitmap.h | 5 +++++ lib/sbitmap.c | 17 ++++++++++++++--- 2 files changed, 19 insertions(+), 3 deletions(-)