Message ID | 20210610173944.1203706-4-schatzberg.dan@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Charge loop device i/o to issuing cgroup | expand |
Hi. On Thu, Jun 10, 2021 at 10:39:44AM -0700, Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > The current code only associates with the existing blkcg when aio is > used to access the backing file. This patch covers all types of i/o to > the backing file and also associates the memcg so if the backing file is > on tmpfs, memory is charged appropriately. > > This patch also exports cgroup_get_e_css and int_active_memcg so it > can be used by the loop module. Wouldn't it be clearer to export (not explicitly inlined anymore) set_active_memcg() instead of the int_active_memcg that's rather an implementation detail? > @@ -2111,13 +2112,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, > } > > /* always use the first bio's css */ > + cmd->blkcg_css = NULL; > + cmd->memcg_css = NULL; > #ifdef CONFIG_BLK_CGROUP > - if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) { > - cmd->css = &bio_blkcg(rq->bio)->css; > - css_get(cmd->css); > - } else > + if (rq->bio && rq->bio->bi_blkg) { > + cmd->blkcg_css = &bio_blkcg(rq->bio)->css; > +#ifdef CONFIG_MEMCG > + cmd->memcg_css = > + cgroup_get_e_css(cmd->blkcg_css->cgroup, > + &memory_cgrp_subsys); > +#endif > + } > #endif > - cmd->css = NULL; > loop_queue_work(lo, cmd); I see you dropped the cmd->blkcg_css reference (while rq is handled). Is it intentional? Thanks, Michal
Hi Michal, On Fri, Jun 25, 2021 at 05:01:03PM +0200, Michal Koutný wrote: > Hi. > > On Thu, Jun 10, 2021 at 10:39:44AM -0700, Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > The current code only associates with the existing blkcg when aio is > > used to access the backing file. This patch covers all types of i/o to > > the backing file and also associates the memcg so if the backing file is > > on tmpfs, memory is charged appropriately. > > > > This patch also exports cgroup_get_e_css and int_active_memcg so it > > can be used by the loop module. > > Wouldn't it be clearer to export (not explicitly inlined anymore) > set_active_memcg() instead of the int_active_memcg that's rather an > implementation detail? Agreed that exporting int_active_memcg is an implementation detail, but would this prevent set_active_memcg from being inlined? Is that desireable? > > > @@ -2111,13 +2112,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, > > } > > > > /* always use the first bio's css */ > > + cmd->blkcg_css = NULL; > > + cmd->memcg_css = NULL; > > #ifdef CONFIG_BLK_CGROUP > > - if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) { > > - cmd->css = &bio_blkcg(rq->bio)->css; > > - css_get(cmd->css); > > - } else > > + if (rq->bio && rq->bio->bi_blkg) { > > + cmd->blkcg_css = &bio_blkcg(rq->bio)->css; > > +#ifdef CONFIG_MEMCG > > + cmd->memcg_css = > > + cgroup_get_e_css(cmd->blkcg_css->cgroup, > > + &memory_cgrp_subsys); > > +#endif > > + } > > #endif > > - cmd->css = NULL; > > loop_queue_work(lo, cmd); > > I see you dropped the cmd->blkcg_css reference (while rq is handled). Is > it intentional? Yes it is intentional. All requests (not just aio) go through the loop worker which grabs the blkcg reference in loop_queue_work() on construction. So I believe grabbing a reference per request is unnecessary.
On Mon, Jun 28, 2021 at 10:17:18AM -0400, Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > Agreed that exporting int_active_memcg is an implementation detail, > but would this prevent set_active_memcg from being inlined? Non-inlining in the loop module doesn't seem like a big trouble. OTOH, other callers may be more sensitive and would need to rely on inlining. I can't currently think of a nice way to have both the exported and the exlicitly inlined variant at once. It seems it's either API or perf craft in the end but both are uncertain, so I guess the current approach is fine in the end. > Yes it is intentional. All requests (not just aio) go through the loop > worker which grabs the blkcg reference in loop_queue_work() on > construction. So I believe grabbing a reference per request is > unnecessary. Isn't there a window without the reference between loop_queue_rq and loop_queue_work? I don't know, you seem to know better, so I'd suggest dropping a comment line into the code explaining this. Thanks, Michal
> Non-inlining in the loop module doesn't seem like a big trouble. OTOH, > other callers may be more sensitive and would need to rely on inlining. Yes, this is my concern as well. > I can't currently think of a nice way to have both the exported and the > exlicitly inlined variant at once. It seems it's either API or perf > craft in the end but both are uncertain, so I guess the current approach > is fine in the end. > > > Yes it is intentional. All requests (not just aio) go through the loop > > worker which grabs the blkcg reference in loop_queue_work() on > > construction. So I believe grabbing a reference per request is > > unnecessary. > > Isn't there a window without the reference between loop_queue_rq and > loop_queue_work? Hmm, perhaps I'm not understanding how the reference counting works, but my understanding is that we enter loop_queue_rq with presumably some code earlier holding a reference to the blkcg, we only need to acquire a reference sometime before returning from loop_queue_rq. The "window" between loop_queue_rq and loop_queue_work is all straight-line code so there's no possibility for the earlier code to get control back and drop the reference. > I don't know, you seem to know better, so I'd suggest > dropping a comment line into the code explaining this. I wouldn't be so sure that I know any better here :D - I'm fairly inexperienced in this domain. Where would you suggest putting such a comment? The change in question removed a particular case where we explicitly grab a reference to the blkcg because now we do it uniformly in one place. Would you like a comment explaining why we acquire a reference for all loop workers or one explaining specifically why we don't need to acquire one for aio?
On Tue, Jun 29, 2021 at 10:03:33AM -0400, Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > Hmm, perhaps I'm not understanding how the reference counting works, > but my understanding is that we enter loop_queue_rq with presumably > some code earlier holding a reference to the blkcg, we only need to > acquire a reference sometime before returning from loop_queue_rq. The > "window" between loop_queue_rq and loop_queue_work is all > straight-line code so there's no possibility for the earlier code to > get control back and drop the reference. I don't say the current implementation is wrong, it just looked suspicious to me when the css address is copied without taking the reference. The straight path is clear, I'm not sure about later invocations through loop_workfn where the blkcg_css is accessed via the cmd->blkcg_css. > Where would you suggest putting such a comment? This is how I understand it: --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -996,6 +996,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) rb_insert_color(&worker->rb_node, &lo->worker_tree); queue_work: if (worker) { + WARN_ON_ONCE(worker->blkcg_css != cmd->blkcg_css); /* * We need to remove from the idle list here while * holding the lock so that the idle timer doesn't @@ -2106,6 +2107,8 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, cmd->memcg_css = NULL; #ifdef CONFIG_BLK_CGROUP if (rq->bio && rq->bio->bi_blkg) { + /* reference to blkcg_css will be held by loop_worker (outlives + * cmd) or it is the eternal root css */ cmd->blkcg_css = &bio_blkcg(rq->bio)->css; #ifdef CONFIG_MEMCG cmd->memcg_css = (On further thoughts, maybe the blkcg_css reference isn't needed even in the loop_worker if it can be reasoned that blkcg_css won't go away while there's an outstanding rq.) HTH, Michal
> This is how I understand it: > > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -996,6 +996,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) > rb_insert_color(&worker->rb_node, &lo->worker_tree); > queue_work: > if (worker) { > + WARN_ON_ONCE(worker->blkcg_css != cmd->blkcg_css); Yes, this is correct. Though the check here seems a bit obvious to me - it must be correct because we assign worker above: if (cur_worker->blkcg_css == cmd->blkcg_css) { worker = cur_worker; break; or when we construct the worker: worker->blkcg_css = cmd->blkcg_css; I think this WARN_ON_ONCE check might be more interesting in loop_process_work which invokes loop_handle_cmd and actually uses cmd->blkcg_css. In any event, your understanding is correct here. > /* > * We need to remove from the idle list here while > * holding the lock so that the idle timer doesn't > @@ -2106,6 +2107,8 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, > cmd->memcg_css = NULL; > #ifdef CONFIG_BLK_CGROUP > if (rq->bio && rq->bio->bi_blkg) { > + /* reference to blkcg_css will be held by loop_worker (outlives > + * cmd) or it is the eternal root css */ Yes, this is correct. Feel free to add my Acked-by to such a patch
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index fc4a0186d381..5198d8ad181c 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -78,6 +78,7 @@ #include <linux/uio.h> #include <linux/ioprio.h> #include <linux/blk-cgroup.h> +#include <linux/sched/mm.h> #include "loop.h" @@ -516,8 +517,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2) { struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); - if (cmd->css) - css_put(cmd->css); cmd->ret = ret; lo_rw_aio_do_completion(cmd); } @@ -578,8 +577,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, cmd->iocb.ki_complete = lo_rw_aio_complete; cmd->iocb.ki_flags = IOCB_DIRECT; cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); - if (cmd->css) - kthread_associate_blkcg(cmd->css); if (rw == WRITE) ret = call_write_iter(file, &cmd->iocb, &iter); @@ -587,7 +584,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, ret = call_read_iter(file, &cmd->iocb, &iter); lo_rw_aio_do_completion(cmd); - kthread_associate_blkcg(NULL); if (ret != -EIOCBQUEUED) cmd->iocb.ki_complete(&cmd->iocb, ret, 0); @@ -928,7 +924,7 @@ struct loop_worker { struct list_head cmd_list; struct list_head idle_list; struct loop_device *lo; - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *blkcg_css; unsigned long last_ran_at; }; @@ -957,7 +953,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) spin_lock_irq(&lo->lo_work_lock); - if (queue_on_root_worker(cmd->css)) + if (queue_on_root_worker(cmd->blkcg_css)) goto queue_work; node = &lo->worker_tree.rb_node; @@ -965,10 +961,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) while (*node) { parent = *node; cur_worker = container_of(*node, struct loop_worker, rb_node); - if (cur_worker->css == cmd->css) { + if (cur_worker->blkcg_css == cmd->blkcg_css) { worker = cur_worker; break; - } else if ((long)cur_worker->css < (long)cmd->css) { + } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) { node = &(*node)->rb_left; } else { node = &(*node)->rb_right; @@ -980,13 +976,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN); /* * In the event we cannot allocate a worker, just queue on the - * rootcg worker + * rootcg worker and issue the I/O as the rootcg */ - if (!worker) + if (!worker) { + cmd->blkcg_css = NULL; + if (cmd->memcg_css) + css_put(cmd->memcg_css); + cmd->memcg_css = NULL; goto queue_work; + } - worker->css = cmd->css; - css_get(worker->css); + worker->blkcg_css = cmd->blkcg_css; + css_get(worker->blkcg_css); INIT_WORK(&worker->work, loop_workfn); INIT_LIST_HEAD(&worker->cmd_list); INIT_LIST_HEAD(&worker->idle_list); @@ -1306,7 +1307,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) idle_list) { list_del(&worker->idle_list); rb_erase(&worker->rb_node, &lo->worker_tree); - css_put(worker->css); + css_put(worker->blkcg_css); kfree(worker); } spin_unlock_irq(&lo->lo_work_lock); @@ -2111,13 +2112,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, } /* always use the first bio's css */ + cmd->blkcg_css = NULL; + cmd->memcg_css = NULL; #ifdef CONFIG_BLK_CGROUP - if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) { - cmd->css = &bio_blkcg(rq->bio)->css; - css_get(cmd->css); - } else + if (rq->bio && rq->bio->bi_blkg) { + cmd->blkcg_css = &bio_blkcg(rq->bio)->css; +#ifdef CONFIG_MEMCG + cmd->memcg_css = + cgroup_get_e_css(cmd->blkcg_css->cgroup, + &memory_cgrp_subsys); +#endif + } #endif - cmd->css = NULL; loop_queue_work(lo, cmd); return BLK_STS_OK; @@ -2129,13 +2135,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd) const bool write = op_is_write(req_op(rq)); struct loop_device *lo = rq->q->queuedata; int ret = 0; + struct mem_cgroup *old_memcg = NULL; if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) { ret = -EIO; goto failed; } + if (cmd->blkcg_css) + kthread_associate_blkcg(cmd->blkcg_css); + if (cmd->memcg_css) + old_memcg = set_active_memcg( + mem_cgroup_from_css(cmd->memcg_css)); + ret = do_req_filebacked(lo, rq); + + if (cmd->blkcg_css) + kthread_associate_blkcg(NULL); + + if (cmd->memcg_css) { + set_active_memcg(old_memcg); + css_put(cmd->memcg_css); + } failed: /* complete non-aio request */ if (!cmd->use_aio || ret) { @@ -2214,7 +2235,7 @@ static void loop_free_idle_workers(struct timer_list *timer) break; list_del(&worker->idle_list); rb_erase(&worker->rb_node, &lo->worker_tree); - css_put(worker->css); + css_put(worker->blkcg_css); kfree(worker); } if (!list_empty(&lo->idle_worker_list)) diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 9289c1cd6374..cd24a81e00e6 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -76,7 +76,8 @@ struct loop_cmd { long ret; struct kiocb iocb; struct bio_vec *bvec; - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *blkcg_css; + struct cgroup_subsys_state *memcg_css; }; /* Support for loadable transfer modules */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bd0644d3a6df..360e61de53d7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1230,6 +1230,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) return NULL; } +static inline +struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) +{ + return NULL; +} + static inline void mem_cgroup_put(struct mem_cgroup *memcg) { } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 74e3cc801615..90329cfff48d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -577,6 +577,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp, rcu_read_unlock(); return css; } +EXPORT_SYMBOL_GPL(cgroup_get_e_css); static void cgroup_get_live(struct cgroup *cgrp) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 919736ee656b..ae1f5d0cb581 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -78,6 +78,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly; /* Active memory cgroup to use from an interrupt context */ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg); +EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg); /* Socket memory accounting disabled? */ static bool cgroup_memory_nosocket __ro_after_init;