Message ID | 20210402191638.3249835-1-schatzberg.dan@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Charge loop device i/o to issuing cgroup | expand |
Hi Hillf, thanks for the review On Sat, Apr 03, 2021 at 10:09:02AM +0800, Hillf Danton wrote: > On Fri, 2 Apr 2021 12:16:32 Dan Schatzberg wrote: > > +queue_work: > > + if (worker) { > > + /* > > + * We need to remove from the idle list here while > > + * holding the lock so that the idle timer doesn't > > + * free the worker > > + */ > > + if (!list_empty(&worker->idle_list)) > > + list_del_init(&worker->idle_list); > > Nit, only queue work if the worker is inactive - otherwise it is taking > care of the cmd_list. By worker is inactive, you mean worker is on the idle_list? Yes, I think you're right that queue_work() is unnecessary in that case since each worker checks empty cmd_list then adds itself to idle_list under the lock. > > > + work = &worker->work; > > + cmd_list = &worker->cmd_list; > > + } else { > > + work = &lo->rootcg_work; > > + cmd_list = &lo->rootcg_cmd_list; > > + } > > + list_add_tail(&cmd->list_entry, cmd_list); > > + queue_work(lo->workqueue, work); > > + spin_unlock_irq(&lo->lo_work_lock); > > } > [...] > > + /* > > + * We only add to the idle list if there are no pending cmds > > + * *and* the worker will not run again which ensures that it > > + * is safe to free any worker on the idle list > > + */ > > + if (worker && !work_pending(&worker->work)) { > > The empty cmd_list is a good enough reason for worker to become idle. This is only true with the above change to avoid a gratuitous queue_work(), right? Otherwise we run the risk of freeing a worker concurrently with loop_process_work() being invoked.
On Wed, Apr 07, 2021 at 02:53:00PM +0800, Hillf Danton wrote: > On Tue, 6 Apr 2021 Dan Schatzberg wrote: > >On Sat, Apr 03, 2021 at 10:09:02AM +0800, Hillf Danton wrote: > >> On Fri, 2 Apr 2021 12:16:32 Dan Schatzberg wrote: > >> > +queue_work: > >> > + if (worker) { > >> > + /* > >> > + * We need to remove from the idle list here while > >> > + * holding the lock so that the idle timer doesn't > >> > + * free the worker > >> > + */ > >> > + if (!list_empty(&worker->idle_list)) > >> > + list_del_init(&worker->idle_list); > >> > >> Nit, only queue work if the worker is inactive - otherwise it is taking > >> care of the cmd_list. > > > >By worker is inactive, you mean worker is on the idle_list? Yes, I > >think you're right that queue_work() is unnecessary in that case since > >each worker checks empty cmd_list then adds itself to idle_list under > >the lock. A couple other corner cases - When worker is just allocated, it needs a queue_work() and rootcg always needs a queue_work() since it never sits on the idle_list. It does add to the logic a bit rather than just unconditionally invoking queue_work() > > > >> > >> > + work = &worker->work; > >> > + cmd_list = &worker->cmd_list; > >> > + } else { > >> > + work = &lo->rootcg_work; > >> > + cmd_list = &lo->rootcg_cmd_list; > >> > + } > >> > + list_add_tail(&cmd->list_entry, cmd_list); > >> > + queue_work(lo->workqueue, work); > >> > + spin_unlock_irq(&lo->lo_work_lock); > >> > } > >> [...] > >> > + /* > >> > + * We only add to the idle list if there are no pending cmds > >> > + * *and* the worker will not run again which ensures that it > >> > + * is safe to free any worker on the idle list > >> > + */ > >> > + if (worker && !work_pending(&worker->work)) { > >> > >> The empty cmd_list is a good enough reason for worker to become idle. > > > >This is only true with the above change to avoid a gratuitous > >queue_work(), right? > > It is always true because of the empty cmd_list - the idle_list is the only > place for the worker to go at this point. > > >Otherwise we run the risk of freeing a worker > >concurrently with loop_process_work() being invoked. > > My suggestion is a minor optimization at most without any change to removing > worker off the idle_list on queuing work - that cuts the risk for you. If I just change this line from if (worker && !work_pending(&worker->work)) { to if (worker) { then the following sequence of events is possible: 1) loop_queue_work runs, adds a command to the worker list 2) loop_process_work runs, processes a single command and then drops the lock and reschedules 3) loop_queue_work runs again, acquires the lock, adds to the list and invokes queue_work() again 4) loop_process_work resumes, acquires lock, processes work, notices list is empty and adds itself to the idle_list 5) idle timer fires and frees the worker 6) loop_process_work runs again (because of the queue_work in 3) and accesses freed memory The !work_pending... check prevents 4) from adding itself to the idle_list so this is not possible. I believe we can only make this change if we also make the other change you suggested to avoid gratuitous queue_work()
It looks like all feedback has been addressed and there hasn't been any new activity on it in a while. As per the suggestion last time [1], Andrew, Jens, could this go through the -mm tree to deal with the memcg conflicts? [1] https://lore.kernel.org/lkml/CALvZod6FMQQC17Zsu9xoKs=dFWaJdMC2Qk3YiDPUUQHx8teLYg@mail.gmail.com/ On Fri, Apr 02, 2021 at 12:16:31PM -0700, Dan Schatzberg wrote: > No major changes, rebased on top of latest mm tree > > Changes since V12: > > * Small change to get_mem_cgroup_from_mm to avoid needing > get_active_memcg > > Changes since V11: > > * Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this > can be driven by writeback, but this was causing a warning in xfs > and likely other filesystems aren't equipped to be driven by reclaim > at the VFS layer. > * Included a small fix from Colin Ian King. > * reworked get_mem_cgroup_from_mm to institute the necessary charge > priority. > > Changes since V10: > > * Added page-cache charging to mm: Charge active memcg when no mm is set > > Changes since V9: > > * Rebased against linus's branch which now includes Roman Gushchin's > patch this series is based off of > > Changes since V8: > > * Rebased on top of Roman Gushchin's patch > (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting > support for setting active memcg. Dropped the patch from this series > that did the same thing. > > Changes since V7: > > * Rebased against linus's branch > > Changes since V6: > > * Added separate spinlock for worker synchronization > * Minor style changes > > Changes since V5: > > * Fixed a missing css_put when failing to allocate a worker > * Minor style changes > > Changes since V4: > > Only patches 1 and 2 have changed. > > * Fixed irq lock ordering bug > * Simplified loop detach > * Added support for nesting memalloc_use_memcg > > Changes since V3: > > * Fix race on loop device destruction and deferred worker cleanup > * Ensure charge on shmem_swapin_page works just like getpage > * Minor style changes > > Changes since V2: > > * Deferred destruction of workqueue items so in the common case there > is no allocation needed > > Changes since V1: > > * Split out and reordered patches so cgroup charging changes are > separate from kworker -> workqueue change > > * Add mem_css to struct loop_cmd to simplify logic > > The loop device runs all i/o to the backing file on a separate kworker > thread which results in all i/o being charged to the root cgroup. This > allows a loop device to be used to trivially bypass resource limits > and other policy. This patch series fixes this gap in accounting. > > A simple script to demonstrate this behavior on cgroupv2 machine: > > ''' > #!/bin/bash > set -e > > CGROUP=/sys/fs/cgroup/test.slice > LOOP_DEV=/dev/loop0 > > if [[ ! -d $CGROUP ]] > then > sudo mkdir $CGROUP > fi > > grep oom_kill $CGROUP/memory.events > > # Set a memory limit, write more than that limit to tmpfs -> OOM kill > sudo unshare -m bash -c " > echo \$\$ > $CGROUP/cgroup.procs; > echo 0 > $CGROUP/memory.swap.max; > echo 64M > $CGROUP/memory.max; > mount -t tmpfs -o size=512m tmpfs /tmp; > dd if=/dev/zero of=/tmp/file bs=1M count=256" || true > > grep oom_kill $CGROUP/memory.events > > # Set a memory limit, write more than that limit through loopback > # device -> no OOM kill > sudo unshare -m bash -c " > echo \$\$ > $CGROUP/cgroup.procs; > echo 0 > $CGROUP/memory.swap.max; > echo 64M > $CGROUP/memory.max; > mount -t tmpfs -o size=512m tmpfs /tmp; > truncate -s 512m /tmp/backing_file > losetup $LOOP_DEV /tmp/backing_file > dd if=/dev/zero of=$LOOP_DEV bs=1M count=256; > losetup -D $LOOP_DEV" || true > > grep oom_kill $CGROUP/memory.events > ''' > > Naively charging cgroups could result in priority inversions through > the single kworker thread in the case where multiple cgroups are > reading/writing to the same loop device. This patch series does some > minor modification to the loop driver so that each cgroup can make > forward progress independently to avoid this inversion. > > With this patch series applied, the above script triggers OOM kills > when writing through the loop device as expected. > > Dan Schatzberg (3): > loop: Use worker per cgroup instead of kworker > mm: Charge active memcg when no mm is set > loop: Charge i/o to mem and blk cg > > drivers/block/loop.c | 244 ++++++++++++++++++++++++++++++------- > drivers/block/loop.h | 15 ++- > include/linux/memcontrol.h | 6 + > kernel/cgroup/cgroup.c | 1 + > mm/filemap.c | 2 +- > mm/memcontrol.c | 49 +++++--- > mm/shmem.c | 4 +- > 7 files changed, 253 insertions(+), 68 deletions(-) > > -- > 2.30.2 > >
On 4/12/21 9:45 AM, Johannes Weiner wrote: > It looks like all feedback has been addressed and there hasn't been > any new activity on it in a while. > > As per the suggestion last time [1], Andrew, Jens, could this go > through the -mm tree to deal with the memcg conflicts? Yep, I think that would make it the most painless for everyone. Dan/Andrew, you can add my Acked-by to the series.