Message ID | 6f136c90-faa9-4bc0-b02f-3a112b4d8360@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello, Joseph. On Wed, Feb 07, 2018 at 04:40:02PM +0800, Joseph Qi wrote: > writeback kworker > blkcg_bio_issue_check > rcu_read_lock > blkg_lookup > <<< *race window* > blk_throtl_bio > spin_lock_irq(q->queue_lock) > spin_unlock_irq(q->queue_lock) > rcu_read_unlock > > cgroup_rmdir > cgroup_destroy_locked > kill_css > css_killed_ref_fn > css_killed_work_fn > offline_css > blkcg_css_offline > spin_trylock(q->queue_lock) > blkg_destroy > spin_unlock(q->queue_lock) Ah, right. Thanks for spotting the bug. > Since rcu can only prevent blkg from releasing when it is being used, > the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule > blkg release. > Then trying to blkg_get in blk_throtl_bio will complains the WARNING. > And then the corresponding blkg_put will schedule blkg release again, > which result in double free. > This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg > creation in blkcg_bio_issue_check()"). Before this commit, it will lookup > first and then try to lookup/create again with queue_lock. So revive > this logic to fix the race. The change seems a bit drastic to me. Can't we do something like the following instead? blk_throtl_bio() { ... non throttled cases ... /* out-of-limit, queue to @tg */ /* * We can look up and retry but the race window is tiny here. * Just letting it through should be good enough. */ if (!css_tryget(blkcg->css)) goto out; ... actual queueing ... css_put(blkcg->css); ... } Thanks.
Hi Tejun, Thanks very much for reviewing this patch. On 18/2/8 05:38, Tejun Heo wrote: > Hello, Joseph. > > On Wed, Feb 07, 2018 at 04:40:02PM +0800, Joseph Qi wrote: >> writeback kworker >> blkcg_bio_issue_check >> rcu_read_lock >> blkg_lookup >> <<< *race window* >> blk_throtl_bio >> spin_lock_irq(q->queue_lock) >> spin_unlock_irq(q->queue_lock) >> rcu_read_unlock >> >> cgroup_rmdir >> cgroup_destroy_locked >> kill_css >> css_killed_ref_fn >> css_killed_work_fn >> offline_css >> blkcg_css_offline >> spin_trylock(q->queue_lock) >> blkg_destroy >> spin_unlock(q->queue_lock) > > Ah, right. Thanks for spotting the bug. > >> Since rcu can only prevent blkg from releasing when it is being used, >> the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule >> blkg release. >> Then trying to blkg_get in blk_throtl_bio will complains the WARNING. >> And then the corresponding blkg_put will schedule blkg release again, >> which result in double free. >> This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg >> creation in blkcg_bio_issue_check()"). Before this commit, it will lookup >> first and then try to lookup/create again with queue_lock. So revive >> this logic to fix the race. > > The change seems a bit drastic to me. Can't we do something like the > following instead? > > blk_throtl_bio() > { > ... non throttled cases ... > > /* out-of-limit, queue to @tg */ > > /* > * We can look up and retry but the race window is tiny here. > * Just letting it through should be good enough. > */ > if (!css_tryget(blkcg->css)) > goto out; > > ... actual queueing ... > css_put(blkcg->css); > ... > } So you mean checking css->refcnt to prevent the further use of blkg_get? I think it makes sense. IMO, we should use css_tryget_online instead, and rightly after taking queue_lock. Because there may be more use of blkg_get in blk_throtl_bio in the futher. Actually it already has two now. One is in blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio. What do you think of this? Thanks, Joseph
Hello, Joseph. On Thu, Feb 08, 2018 at 10:29:43AM +0800, Joseph Qi wrote: > So you mean checking css->refcnt to prevent the further use of > blkg_get? I think it makes sense. Yes. > IMO, we should use css_tryget_online instead, and rightly after taking Not really. An offline css still can have a vast amount of IO to drain and write out. > queue_lock. Because there may be more use of blkg_get in blk_throtl_bio > in the futher. Actually it already has two now. One is in > blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio. > What do you think of this? As long as we avoid making the bypass paths expensive, whatever makes the code easier to read and maintain would be better. css ref ops are extremely cheap anyway. Thanks.
Hi Tejun, On 18/2/8 23:23, Tejun Heo wrote: > Hello, Joseph. > > On Thu, Feb 08, 2018 at 10:29:43AM +0800, Joseph Qi wrote: >> So you mean checking css->refcnt to prevent the further use of >> blkg_get? I think it makes sense. > > Yes. > >> IMO, we should use css_tryget_online instead, and rightly after taking > > Not really. An offline css still can have a vast amount of IO to > drain and write out. > IIUC, we have to identify it is in blkcg_css_offline now which will blkg_put. Since percpu_ref_kill_and_confirm in kill_css will set flag __PERCPU_REF_DEAD, so we can use this to avoid the race. IOW, if __PERCPU_REF_DEAD is set now, we know blkcg css is in offline and continue access blkg may risk double free. Thus we choose to skip these ios. I don't get how css_tryget works since it doesn't care the flag __PERCPU_REF_DEAD. Also css_tryget can't prevent blkcg_css from offlining since the race happens blkcg_css_offline is in progress. Am I missing something here? Thanks, Joseph >> queue_lock. Because there may be more use of blkg_get in blk_throtl_bio >> in the futher. Actually it already has two now. One is in >> blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio. >> What do you think of this? > > As long as we avoid making the bypass paths expensive, whatever makes > the code easier to read and maintain would be better. css ref ops are > extremely cheap anyway. > > Thanks. >
Hello, Joseph. On Fri, Feb 09, 2018 at 10:15:19AM +0800, Joseph Qi wrote: > IIUC, we have to identify it is in blkcg_css_offline now which will > blkg_put. Since percpu_ref_kill_and_confirm in kill_css will set flag > __PERCPU_REF_DEAD, so we can use this to avoid the race. IOW, if > __PERCPU_REF_DEAD is set now, we know blkcg css is in offline and > continue access blkg may risk double free. Thus we choose to skip these > ios. > I don't get how css_tryget works since it doesn't care the flag > __PERCPU_REF_DEAD. Also css_tryget can't prevent blkcg_css from > offlining since the race happens blkcg_css_offline is in progress. > Am I missing something here? Once marked dead, the ref is in atomic mode and css_tryget() would hit the atomic counter. Here, we don't care about the offlining and draining. A draining memcg can still have a lot of memory to be written back attached to it and we don't want punt all of them to the root cgroup. Thanks.
Hi Tejun, Sorry for the delayed reply. On 18/2/13 01:11, Tejun Heo wrote: > Hello, Joseph. > > On Fri, Feb 09, 2018 at 10:15:19AM +0800, Joseph Qi wrote: >> IIUC, we have to identify it is in blkcg_css_offline now which will >> blkg_put. Since percpu_ref_kill_and_confirm in kill_css will set flag >> __PERCPU_REF_DEAD, so we can use this to avoid the race. IOW, if >> __PERCPU_REF_DEAD is set now, we know blkcg css is in offline and >> continue access blkg may risk double free. Thus we choose to skip these >> ios. >> I don't get how css_tryget works since it doesn't care the flag >> __PERCPU_REF_DEAD. Also css_tryget can't prevent blkcg_css from >> offlining since the race happens blkcg_css_offline is in progress. >> Am I missing something here? > > Once marked dead, the ref is in atomic mode and css_tryget() would hit > the atomic counter. Here, we don't care about the offlining and > draining. A draining memcg can still have a lot of memory to be > written back attached to it and we don't want punt all of them to the > root cgroup. I still don't get how css_tryget can work here. The race happens when: 1) writeback kworker has found the blkg with rcu; 2) blkcg is during offlining and blkg_destroy() has already been called. Then, writeback kworker will take queue lock and access the blkg with refcount 0. So, I think we should take queue lock when lookup blkg if we want to throttle the ios during offline (the way this patch tries), or use css_tryget_online() to skip the further use of the risky blkg, which means these ios won't be throttled either. Thanks, Joseph
Hello, On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: > I still don't get how css_tryget can work here. > > The race happens when: > 1) writeback kworker has found the blkg with rcu; > 2) blkcg is during offlining and blkg_destroy() has already been called. > Then, writeback kworker will take queue lock and access the blkg with > refcount 0. Yeah, then tryget would fail and it should go through the root. Thanks.
Hi Tejun, On 2018/2/22 下午11:18, Tejun Heo wrote: > Hello, > > On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: >> I still don't get how css_tryget can work here. >> >> The race happens when: >> 1) writeback kworker has found the blkg with rcu; >> 2) blkcg is during offlining and blkg_destroy() has already been called. >> Then, writeback kworker will take queue lock and access the blkg with >> refcount 0. > > Yeah, then tryget would fail and it should go through the root. > In this race, the refcount of blkg becomes zero and is destroyed. However css may still have refcount, and css_tryget can return success before other callers put the refcount. So I don't get how css_tryget can fix this race? Or I wonder if we can add another function blkg_tryget? Thanks, Jiufei > Thanks. >
Hello, On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote: > > On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: > >> I still don't get how css_tryget can work here. > >> > >> The race happens when: > >> 1) writeback kworker has found the blkg with rcu; > >> 2) blkcg is during offlining and blkg_destroy() has already been called. > >> Then, writeback kworker will take queue lock and access the blkg with > >> refcount 0. > > > > Yeah, then tryget would fail and it should go through the root. > > > In this race, the refcount of blkg becomes zero and is destroyed. > However css may still have refcount, and css_tryget can return success > before other callers put the refcount. > So I don't get how css_tryget can fix this race? Or I wonder if we can > add another function blkg_tryget? IIRC, as long as the blkcg and the device are there, the blkgs aren't gonna be destroyed. So, if you have a ref to the blkcg through tryget, the blkg shouldn't go away. Thanks.
Hi Tejun, On 18/2/23 22:23, Tejun Heo wrote: > Hello, > > On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote: >>> On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: >>>> I still don't get how css_tryget can work here. >>>> >>>> The race happens when: >>>> 1) writeback kworker has found the blkg with rcu; >>>> 2) blkcg is during offlining and blkg_destroy() has already been called. >>>> Then, writeback kworker will take queue lock and access the blkg with >>>> refcount 0. >>> >>> Yeah, then tryget would fail and it should go through the root. >>> >> In this race, the refcount of blkg becomes zero and is destroyed. >> However css may still have refcount, and css_tryget can return success >> before other callers put the refcount. >> So I don't get how css_tryget can fix this race? Or I wonder if we can >> add another function blkg_tryget? > > IIRC, as long as the blkcg and the device are there, the blkgs aren't > gonna be destroyed. So, if you have a ref to the blkcg through > tryget, the blkg shouldn't go away. > Maybe we have misunderstanding here. In this case, blkg doesn't go away as we have rcu protect, but blkg_destroy() can be called, in which blkg_put() will put the last refcnt and then schedule __blkg_release_rcu(). css refcnt can't prevent blkcg css from offlining, instead it is css online_cnt. css_tryget() will only get a refcnt of blkcg css, but can't be guaranteed to fail when css is confirmed to kill. The race sequence: writeback kworker cgroup_rmdir cgroup_destroy_locked kill_css css_killed_ref_fn css_killed_work_fn offline_css blkcg_css_offline blkcg_bio_issue_check rcu_read_lock blkg_lookup spin_trylock(q->queue_lock) blkg_destroy spin_unlock(q->queue_lock) blk_throtl_bio spin_lock_irq(q->queue_lock) spin_unlock_irq(q->queue_lock) rcu_read_unlock Thanks, Joseph
Hi Tejun, Could you please help take a look at this and give a confirmation? Thanks, Joseph On 18/2/24 09:45, Joseph Qi wrote: > Hi Tejun, > > On 18/2/23 22:23, Tejun Heo wrote: >> Hello, >> >> On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote: >>>> On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: >>>>> I still don't get how css_tryget can work here. >>>>> >>>>> The race happens when: >>>>> 1) writeback kworker has found the blkg with rcu; >>>>> 2) blkcg is during offlining and blkg_destroy() has already been called. >>>>> Then, writeback kworker will take queue lock and access the blkg with >>>>> refcount 0. >>>> >>>> Yeah, then tryget would fail and it should go through the root. >>>> >>> In this race, the refcount of blkg becomes zero and is destroyed. >>> However css may still have refcount, and css_tryget can return success >>> before other callers put the refcount. >>> So I don't get how css_tryget can fix this race? Or I wonder if we can >>> add another function blkg_tryget? >> >> IIRC, as long as the blkcg and the device are there, the blkgs aren't >> gonna be destroyed. So, if you have a ref to the blkcg through >> tryget, the blkg shouldn't go away. >> > > Maybe we have misunderstanding here. > > In this case, blkg doesn't go away as we have rcu protect, but > blkg_destroy() can be called, in which blkg_put() will put the last > refcnt and then schedule __blkg_release_rcu(). > > css refcnt can't prevent blkcg css from offlining, instead it is css > online_cnt. > > css_tryget() will only get a refcnt of blkcg css, but can't be > guaranteed to fail when css is confirmed to kill. > > The race sequence: > writeback kworker cgroup_rmdir > cgroup_destroy_locked > kill_css > css_killed_ref_fn > css_killed_work_fn > offline_css > blkcg_css_offline > blkcg_bio_issue_check > rcu_read_lock > blkg_lookup > spin_trylock(q->queue_lock) > blkg_destroy > spin_unlock(q->queue_lock) > blk_throtl_bio > spin_lock_irq(q->queue_lock) > spin_unlock_irq(q->queue_lock) > rcu_read_unlock > > Thanks, > Joseph >
Hello, Joseph. On Sat, Feb 24, 2018 at 09:45:49AM +0800, Joseph Qi wrote: > > IIRC, as long as the blkcg and the device are there, the blkgs aren't > > gonna be destroyed. So, if you have a ref to the blkcg through > > tryget, the blkg shouldn't go away. > > > > Maybe we have misunderstanding here. > > In this case, blkg doesn't go away as we have rcu protect, but > blkg_destroy() can be called, in which blkg_put() will put the last > refcnt and then schedule __blkg_release_rcu(). > > css refcnt can't prevent blkcg css from offlining, instead it is css > online_cnt. > > css_tryget() will only get a refcnt of blkcg css, but can't be > guaranteed to fail when css is confirmed to kill. Ah, you're right. I was thinking we only destroy blkgs from blkcg release path. Given that we primarily use blkcg refcnting to pin them, I believe that's what we should do - ie. only call pd_offline_fn() from blkcg_css_offline() path and do the rest of destruction from blkcg_css_free(). What do you think? Thanks.
Hi Tejun, On 18/2/28 02:33, Tejun Heo wrote: > Hello, Joseph. > > On Sat, Feb 24, 2018 at 09:45:49AM +0800, Joseph Qi wrote: >>> IIRC, as long as the blkcg and the device are there, the blkgs aren't >>> gonna be destroyed. So, if you have a ref to the blkcg through >>> tryget, the blkg shouldn't go away. >>> >> >> Maybe we have misunderstanding here. >> >> In this case, blkg doesn't go away as we have rcu protect, but >> blkg_destroy() can be called, in which blkg_put() will put the last >> refcnt and then schedule __blkg_release_rcu(). >> >> css refcnt can't prevent blkcg css from offlining, instead it is css >> online_cnt. >> >> css_tryget() will only get a refcnt of blkcg css, but can't be >> guaranteed to fail when css is confirmed to kill. > > Ah, you're right. I was thinking we only destroy blkgs from blkcg > release path. Given that we primarily use blkcg refcnting to pin > them, I believe that's what we should do - ie. only call > pd_offline_fn() from blkcg_css_offline() path and do the rest of > destruction from blkcg_css_free(). What do you think? > In current code, I'm afraid pd_offline_fn() as well as the rest destruction have to be called together under the same blkcg->lock and q->queue_lock. For example, if we split the pd_offline_fn() and radix_tree_delete() into 2 phases, it may introduce a race between blkcg_deactivate_policy() when exit queue and blkcg_css_free(), which will result in pd_offline_fn() to be called twice. Thanks, Joseph
Hello, Joseph. Sorry about late reply. On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote: > In current code, I'm afraid pd_offline_fn() as well as the rest > destruction have to be called together under the same blkcg->lock and > q->queue_lock. > For example, if we split the pd_offline_fn() and radix_tree_delete() > into 2 phases, it may introduce a race between blkcg_deactivate_policy() > when exit queue and blkcg_css_free(), which will result in > pd_offline_fn() to be called twice. So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need some restructuring to separate out blkg offlining and release, but it looks like that'd be the right thing to do, no? Thanks.
On 18/3/5 04:23, Tejun Heo wrote: > Hello, Joseph. > > Sorry about late reply. > > On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote: >> In current code, I'm afraid pd_offline_fn() as well as the rest >> destruction have to be called together under the same blkcg->lock and >> q->queue_lock. >> For example, if we split the pd_offline_fn() and radix_tree_delete() >> into 2 phases, it may introduce a race between blkcg_deactivate_policy() >> when exit queue and blkcg_css_free(), which will result in >> pd_offline_fn() to be called twice. > > So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need > some restructuring to separate out blkg offlining and release, but it > looks like that'd be the right thing to do, no? > Agree, except the restriction above, as of now I don't find any more. I'll try to fix in the way you suggested and post v3. Thanks, Joseph
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 4117524..b1d22e5 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -162,7 +162,6 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg, return NULL; } -EXPORT_SYMBOL_GPL(blkg_lookup_slowpath); /* * If @new_blkg is %NULL, this function tries to allocate a new one as @@ -306,6 +305,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, return blkg; } } +EXPORT_SYMBOL_GPL(blkg_lookup_create); static void blkg_destroy(struct blkcg_gq *blkg) { diff --git a/block/blk-throttle.c b/block/blk-throttle.c index c5a1316..decdd76 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2143,26 +2143,41 @@ static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio) #endif } -bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, +bool blk_throtl_bio(struct request_queue *q, struct blkcg *blkcg, struct bio *bio) { + struct throtl_data *td = q->td; struct throtl_qnode *qn = NULL; - struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg); + struct throtl_grp *tg; + struct blkcg_gq *blkg; struct throtl_service_queue *sq; bool rw = bio_data_dir(bio); bool throttled = false; - struct throtl_data *td = tg->td; WARN_ON_ONCE(!rcu_read_lock_held()); + /* + * If a group has no rules, just update the dispatch stats in lockless + * manner and return. + */ + blkg = blkg_lookup(blkcg, q); + tg = blkg_to_tg(blkg); + if (tg && !tg->has_rules[rw]) + goto out; + /* see throtl_charge_bio() */ - if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw]) + if (bio_flagged(bio, BIO_THROTTLED)) goto out; spin_lock_irq(q->queue_lock); throtl_update_latency_buckets(td); + blkg = blkg_lookup_create(blkcg, q); + if (IS_ERR(blkg)) + blkg = q->root_blkg; + tg = blkg_to_tg(blkg); + if (unlikely(blk_queue_bypass(q))) goto out_unlock; @@ -2253,6 +2268,13 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, if (throttled || !td->track_bio_latency) bio->bi_issue_stat.stat |= SKIP_LATENCY; #endif + if (!throttled) { + blkg = blkg ?: q->root_blkg; + blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf, + bio->bi_iter.bi_size); + blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1); + } + return throttled; } diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 9f342ef..60f53b5 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1674,15 +1674,28 @@ static void cfq_pd_reset_stats(struct blkg_policy_data *pd) cfqg_stats_reset(&cfqg->stats); } -static struct cfq_group *cfq_lookup_cfqg(struct cfq_data *cfqd, - struct blkcg *blkcg) +/* + * Search for the cfq group current task belongs to. request_queue lock must + * be held. + */ +static struct cfq_group *cfq_lookup_create_cfqg(struct cfq_data *cfqd, + struct blkcg *blkcg) { - struct blkcg_gq *blkg; + struct request_queue *q = cfqd->queue; + struct cfq_group *cfqg = NULL; - blkg = blkg_lookup(blkcg, cfqd->queue); - if (likely(blkg)) - return blkg_to_cfqg(blkg); - return NULL; + /* avoid lookup for the common case where there's no blkcg */ + if (blkcg == &blkcg_root) { + cfqg = cfqd->root_group; + } else { + struct blkcg_gq *blkg; + + blkg = blkg_lookup_create(blkcg, q); + if (!IS_ERR(blkg)) + cfqg = blkg_to_cfqg(blkg); + } + + return cfqg; } static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) @@ -2178,8 +2191,8 @@ static ssize_t cfq_set_weight_on_dfl(struct kernfs_open_file *of, }; #else /* GROUP_IOSCHED */ -static struct cfq_group *cfq_lookup_cfqg(struct cfq_data *cfqd, - struct blkcg *blkcg) +static struct cfq_group *cfq_lookup_create_cfqg(struct cfq_data *cfqd, + struct blkcg *blkcg) { return cfqd->root_group; } @@ -3814,7 +3827,7 @@ static inline void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) struct cfq_group *cfqg; rcu_read_lock(); - cfqg = cfq_lookup_cfqg(cfqd, bio_blkcg(bio)); + cfqg = cfq_lookup_create_cfqg(cfqd, bio_blkcg(bio)); if (!cfqg) { cfqq = &cfqd->oom_cfqq; goto out; diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 69bea82..e667841 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -428,8 +428,8 @@ static inline struct request_list *blk_get_rl(struct request_queue *q, * or if either the blkcg or queue is going away. Fall back to * root_rl in such cases. */ - blkg = blkg_lookup(blkcg, q); - if (unlikely(!blkg)) + blkg = blkg_lookup_create(blkcg, q); + if (unlikely(IS_ERR(blkg))) goto root_rl; blkg_get(blkg); @@ -672,10 +672,10 @@ static inline void blkg_rwstat_add_aux(struct blkg_rwstat *to, } #ifdef CONFIG_BLK_DEV_THROTTLING -extern bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, +extern bool blk_throtl_bio(struct request_queue *q, struct blkcg *blkcg, struct bio *bio); #else -static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, +static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg *blkcg, struct bio *bio) { return false; } #endif @@ -683,7 +683,6 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q, struct bio *bio) { struct blkcg *blkcg; - struct blkcg_gq *blkg; bool throtl = false; rcu_read_lock(); @@ -692,23 +691,7 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q, /* associate blkcg if bio hasn't attached one */ bio_associate_blkcg(bio, &blkcg->css); - blkg = blkg_lookup(blkcg, q); - if (unlikely(!blkg)) { - spin_lock_irq(q->queue_lock); - blkg = blkg_lookup_create(blkcg, q); - if (IS_ERR(blkg)) - blkg = NULL; - spin_unlock_irq(q->queue_lock); - } - - throtl = blk_throtl_bio(q, blkg, bio); - - if (!throtl) { - blkg = blkg ?: q->root_blkg; - blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf, - bio->bi_iter.bi_size); - blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1); - } + throtl = blk_throtl_bio(q, blkcg, bio); rcu_read_unlock(); return !throtl;
We've triggered a WARNING in blk_throtl_bio when throttling writeback io, which complains blkg->refcnt is already 0 when calling blkg_get, and then kernel crashes with invalid page request. After investigating this issue, we've found there is a race between blkcg_bio_issue_check and cgroup_rmdir. The race is described below. writeback kworker blkcg_bio_issue_check rcu_read_lock blkg_lookup <<< *race window* blk_throtl_bio spin_lock_irq(q->queue_lock) spin_unlock_irq(q->queue_lock) rcu_read_unlock cgroup_rmdir cgroup_destroy_locked kill_css css_killed_ref_fn css_killed_work_fn offline_css blkcg_css_offline spin_trylock(q->queue_lock) blkg_destroy spin_unlock(q->queue_lock) Since rcu can only prevent blkg from releasing when it is being used, the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule blkg release. Then trying to blkg_get in blk_throtl_bio will complains the WARNING. And then the corresponding blkg_put will schedule blkg release again, which result in double free. This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()"). Before this commit, it will lookup first and then try to lookup/create again with queue_lock. So revive this logic to fix the race. v2: fix a potential NULL pointer dereference when stats Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()") Reported-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: stable@vger.kernel.org --- block/blk-cgroup.c | 2 +- block/blk-throttle.c | 30 ++++++++++++++++++++++++++---- block/cfq-iosched.c | 33 +++++++++++++++++++++++---------- include/linux/blk-cgroup.h | 27 +++++---------------------- 4 files changed, 55 insertions(+), 37 deletions(-)