Message ID | 20240606135723.2794-1-yang.yang@vivo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] sbitmap: fix io hung due to race on sbitmap_word::cleared | expand |
On Thu, Jun 06, 2024 at 09:57:21PM +0800, Yang Yang wrote: > Configuration for sbq: > depth=64, wake_batch=6, shift=6, map_nr=1 > > 1. There are 64 requests in progress: > map->word = 0xFFFFFFFFFFFFFFFF > 2. After all the 64 requests complete, and no more requests come: > map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF > 3. Now two tasks try to allocate requests: > T1: T2: > __blk_mq_get_tag . > __sbitmap_queue_get . > sbitmap_get . > sbitmap_find_bit . > sbitmap_find_bit_in_word . > __sbitmap_get_word -> nr=-1 __blk_mq_get_tag > sbitmap_deferred_clear __sbitmap_queue_get > /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit > if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word > return false; __sbitmap_get_word -> nr=-1 > mask = xchg(&map->cleared, 0) sbitmap_deferred_clear > atomic_long_andnot() /* map->cleared=0 */ > if (!(map->cleared)) > return false; > /* > * map->cleared is cleared by T1 > * T2 fail to acquire the tag > */ > > 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken > up due to the wake_batch being set at 6. If no more requests come, T1 > will wait here indefinitely. > > This patch achieves two purposes: > First: > Check on ->cleared and update on both ->cleared and ->word need to be > done atomically, and using spinlock could be the simplest solution. > So revert commit 661d4f55a794 ("sbitmap: remove swap_lock"), which > may cause potential race. > > Second: > Add extra check in sbitmap_deferred_clear(), to identify whether > map->cleared is cleared by another task after failing to get a tag. > > Fixes: 661d4f55a794 ("sbitmap: remove swap_lock") > Signed-off-by: Yang Yang <yang.yang@vivo.com> > > --- > Changes from v2: > - Modify commit message by suggestion > - Add extra check in sbitmap_deferred_clear() by suggestion > Changes from v1: > - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock") > --- > include/linux/sbitmap.h | 5 +++++ > lib/sbitmap.c | 28 +++++++++++++++++++++------- > 2 files changed, 26 insertions(+), 7 deletions(-) > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h > index d662cf136021..ec0b0e73c906 100644 > --- a/include/linux/sbitmap.h > +++ b/include/linux/sbitmap.h > @@ -36,6 +36,11 @@ struct sbitmap_word { > * @cleared: word holding cleared bits > */ > unsigned long cleared ____cacheline_aligned_in_smp; > + > + /** > + * @swap_lock: Held while swapping word <-> cleared > + */ > + spinlock_t swap_lock; > } ____cacheline_aligned_in_smp; > > /** > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index 1e453f825c05..06b837311e03 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -60,12 +60,19 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, > /* > * See if we have deferred clears that we can batch move > */ > -static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > +static inline bool sbitmap_deferred_clear(struct sbitmap_word *map, > + unsigned int depth) > { > unsigned long mask; > + unsigned long flags; > + bool ret; > > - if (!READ_ONCE(map->cleared)) > - return false; > + spin_lock_irqsave(&map->swap_lock, flags); > + > + if (!map->cleared) { > + ret = find_first_zero_bit(&map->word, depth) >= depth ? false : true; > + goto out_unlock; > + } Direct check over map->word should be more efficient than find_first_zero_bit(): if (READ_ONCE(map->word) == (1UL << depth) - 1) ret = false; else ret = true; > > /* > * First get a stable cleared mask, setting the old mask to 0. > @@ -77,7 +84,10 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > */ > atomic_long_andnot(mask, (atomic_long_t *)&map->word); > BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); > - return true; > + ret = true; > +out_unlock: > + spin_unlock_irqrestore(&map->swap_lock, flags); > + return ret; > } > > int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > @@ -85,6 +95,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > bool alloc_hint) > { > unsigned int bits_per_word; > + int i; > > if (shift < 0) > shift = sbitmap_calculate_shift(depth); > @@ -116,6 +127,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > return -ENOMEM; > } > > + for (i = 0; i < sb->map_nr; i++) > + spin_lock_init(&sb->map[i].swap_lock); > + > return 0; > } > EXPORT_SYMBOL_GPL(sbitmap_init_node); > @@ -126,7 +140,7 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth) > unsigned int i; > > for (i = 0; i < sb->map_nr; i++) > - sbitmap_deferred_clear(&sb->map[i]); > + sbitmap_deferred_clear(&sb->map[i], depth); The above 'depth' is whole sbitmap depth, and you need to figure out word depth from __map_depth(). Thanks, Ming
On 2024/6/7 17:44, Ming Lei wrote: > On Thu, Jun 06, 2024 at 09:57:21PM +0800, Yang Yang wrote: >> Configuration for sbq: >> depth=64, wake_batch=6, shift=6, map_nr=1 >> >> 1. There are 64 requests in progress: >> map->word = 0xFFFFFFFFFFFFFFFF >> 2. After all the 64 requests complete, and no more requests come: >> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >> 3. Now two tasks try to allocate requests: >> T1: T2: >> __blk_mq_get_tag . >> __sbitmap_queue_get . >> sbitmap_get . >> sbitmap_find_bit . >> sbitmap_find_bit_in_word . >> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >> sbitmap_deferred_clear __sbitmap_queue_get >> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >> return false; __sbitmap_get_word -> nr=-1 >> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >> atomic_long_andnot() /* map->cleared=0 */ >> if (!(map->cleared)) >> return false; >> /* >> * map->cleared is cleared by T1 >> * T2 fail to acquire the tag >> */ >> >> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >> up due to the wake_batch being set at 6. If no more requests come, T1 >> will wait here indefinitely. >> >> This patch achieves two purposes: >> First: >> Check on ->cleared and update on both ->cleared and ->word need to be >> done atomically, and using spinlock could be the simplest solution. >> So revert commit 661d4f55a794 ("sbitmap: remove swap_lock"), which >> may cause potential race. >> >> Second: >> Add extra check in sbitmap_deferred_clear(), to identify whether >> map->cleared is cleared by another task after failing to get a tag. >> >> Fixes: 661d4f55a794 ("sbitmap: remove swap_lock") >> Signed-off-by: Yang Yang <yang.yang@vivo.com> >> >> --- >> Changes from v2: >> - Modify commit message by suggestion >> - Add extra check in sbitmap_deferred_clear() by suggestion >> Changes from v1: >> - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock") >> --- >> include/linux/sbitmap.h | 5 +++++ >> lib/sbitmap.c | 28 +++++++++++++++++++++------- >> 2 files changed, 26 insertions(+), 7 deletions(-) >> >> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h >> index d662cf136021..ec0b0e73c906 100644 >> --- a/include/linux/sbitmap.h >> +++ b/include/linux/sbitmap.h >> @@ -36,6 +36,11 @@ struct sbitmap_word { >> * @cleared: word holding cleared bits >> */ >> unsigned long cleared ____cacheline_aligned_in_smp; >> + >> + /** >> + * @swap_lock: Held while swapping word <-> cleared >> + */ >> + spinlock_t swap_lock; >> } ____cacheline_aligned_in_smp; >> >> /** >> diff --git a/lib/sbitmap.c b/lib/sbitmap.c >> index 1e453f825c05..06b837311e03 100644 >> --- a/lib/sbitmap.c >> +++ b/lib/sbitmap.c >> @@ -60,12 +60,19 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, >> /* >> * See if we have deferred clears that we can batch move >> */ >> -static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) >> +static inline bool sbitmap_deferred_clear(struct sbitmap_word *map, >> + unsigned int depth) >> { >> unsigned long mask; >> + unsigned long flags; >> + bool ret; >> >> - if (!READ_ONCE(map->cleared)) >> - return false; >> + spin_lock_irqsave(&map->swap_lock, flags); >> + >> + if (!map->cleared) { >> + ret = find_first_zero_bit(&map->word, depth) >= depth ? false : true; >> + goto out_unlock; >> + } > > Direct check over map->word should be more efficient than find_first_zero_bit(): > > if (READ_ONCE(map->word) == (1UL << depth) - 1) > ret = false; > else > ret = true; Got it. A little modified as follows, because: 1. The depth may be shallow_depth, which is less than __map_depth() 2. (1UL << depth) may overflow 72 #if BITS_PER_LONG == 64 73 unsigned long word_mask = U64_MAX >> (BITS_PER_LONG - depth); 74 #else 75 unsigned long word_mask = U32_MAX >> (BITS_PER_LONG - depth); 76 #endif 77 78 if ((READ_ONCE(map->word) & word_mask) == word_mask) 79 ret = false; 80 else 81 ret = true; >> >> /* >> * First get a stable cleared mask, setting the old mask to 0. >> @@ -77,7 +84,10 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) >> */ >> atomic_long_andnot(mask, (atomic_long_t *)&map->word); >> BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); >> - return true; >> + ret = true; >> +out_unlock: >> + spin_unlock_irqrestore(&map->swap_lock, flags); >> + return ret; >> } >> >> int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, >> @@ -85,6 +95,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, >> bool alloc_hint) >> { >> unsigned int bits_per_word; >> + int i; >> >> if (shift < 0) >> shift = sbitmap_calculate_shift(depth); >> @@ -116,6 +127,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, >> return -ENOMEM; >> } >> >> + for (i = 0; i < sb->map_nr; i++) >> + spin_lock_init(&sb->map[i].swap_lock); >> + >> return 0; >> } >> EXPORT_SYMBOL_GPL(sbitmap_init_node); >> @@ -126,7 +140,7 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth) >> unsigned int i; >> >> for (i = 0; i < sb->map_nr; i++) >> - sbitmap_deferred_clear(&sb->map[i]); >> + sbitmap_deferred_clear(&sb->map[i], depth); > > The above 'depth' is whole sbitmap depth, and you need to figure out > word depth from __map_depth(). Got it. Thanks.
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index d662cf136021..ec0b0e73c906 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -36,6 +36,11 @@ struct sbitmap_word { * @cleared: word holding cleared bits */ unsigned long cleared ____cacheline_aligned_in_smp; + + /** + * @swap_lock: Held while swapping word <-> cleared + */ + spinlock_t swap_lock; } ____cacheline_aligned_in_smp; /** diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 1e453f825c05..06b837311e03 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -60,12 +60,19 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, /* * See if we have deferred clears that we can batch move */ -static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) +static inline bool sbitmap_deferred_clear(struct sbitmap_word *map, + unsigned int depth) { unsigned long mask; + unsigned long flags; + bool ret; - if (!READ_ONCE(map->cleared)) - return false; + spin_lock_irqsave(&map->swap_lock, flags); + + if (!map->cleared) { + ret = find_first_zero_bit(&map->word, depth) >= depth ? false : true; + goto out_unlock; + } /* * First get a stable cleared mask, setting the old mask to 0. @@ -77,7 +84,10 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) */ atomic_long_andnot(mask, (atomic_long_t *)&map->word); BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); - return true; + ret = true; +out_unlock: + spin_unlock_irqrestore(&map->swap_lock, flags); + return ret; } int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, @@ -85,6 +95,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, bool alloc_hint) { unsigned int bits_per_word; + int i; if (shift < 0) shift = sbitmap_calculate_shift(depth); @@ -116,6 +127,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, return -ENOMEM; } + for (i = 0; i < sb->map_nr; i++) + spin_lock_init(&sb->map[i].swap_lock); + return 0; } EXPORT_SYMBOL_GPL(sbitmap_init_node); @@ -126,7 +140,7 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth) unsigned int i; for (i = 0; i < sb->map_nr; i++) - sbitmap_deferred_clear(&sb->map[i]); + sbitmap_deferred_clear(&sb->map[i], depth); sb->depth = depth; sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word); @@ -179,7 +193,7 @@ static int sbitmap_find_bit_in_word(struct sbitmap_word *map, alloc_hint, wrap); if (nr != -1) break; - if (!sbitmap_deferred_clear(map)) + if (!sbitmap_deferred_clear(map, depth)) break; } while (1); @@ -496,7 +510,7 @@ unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags, unsigned int map_depth = __map_depth(sb, index); unsigned long val; - sbitmap_deferred_clear(map); + sbitmap_deferred_clear(map, map_depth); val = READ_ONCE(map->word); if (val == (1UL << (map_depth - 1)) - 1) goto next;
Configuration for sbq: depth=64, wake_batch=6, shift=6, map_nr=1 1. There are 64 requests in progress: map->word = 0xFFFFFFFFFFFFFFFF 2. After all the 64 requests complete, and no more requests come: map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF 3. Now two tasks try to allocate requests: T1: T2: __blk_mq_get_tag . __sbitmap_queue_get . sbitmap_get . sbitmap_find_bit . sbitmap_find_bit_in_word . __sbitmap_get_word -> nr=-1 __blk_mq_get_tag sbitmap_deferred_clear __sbitmap_queue_get /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word return false; __sbitmap_get_word -> nr=-1 mask = xchg(&map->cleared, 0) sbitmap_deferred_clear atomic_long_andnot() /* map->cleared=0 */ if (!(map->cleared)) return false; /* * map->cleared is cleared by T1 * T2 fail to acquire the tag */ 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken up due to the wake_batch being set at 6. If no more requests come, T1 will wait here indefinitely. This patch achieves two purposes: First: Check on ->cleared and update on both ->cleared and ->word need to be done atomically, and using spinlock could be the simplest solution. So revert commit 661d4f55a794 ("sbitmap: remove swap_lock"), which may cause potential race. Second: Add extra check in sbitmap_deferred_clear(), to identify whether map->cleared is cleared by another task after failing to get a tag. Fixes: 661d4f55a794 ("sbitmap: remove swap_lock") Signed-off-by: Yang Yang <yang.yang@vivo.com> --- Changes from v2: - Modify commit message by suggestion - Add extra check in sbitmap_deferred_clear() by suggestion Changes from v1: - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock") --- include/linux/sbitmap.h | 5 +++++ lib/sbitmap.c | 28 +++++++++++++++++++++------- 2 files changed, 26 insertions(+), 7 deletions(-)