mbox series

[0/4] Charge loop device i/o to issuing cgroup

Message ID 20200420223936.6773-1-schatzberg.dan@gmail.com (mailing list archive)
Headers show
Series Charge loop device i/o to issuing cgroup | expand

Message

Dan Schatzberg April 20, 2020, 10:39 p.m. UTC
Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
    sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (4):
  loop: Use worker per cgroup instead of kworker
  mm: support nesting memalloc_use_memcg()
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

 drivers/block/loop.c                 | 246 ++++++++++++++++++++++-----
 drivers/block/loop.h                 |  14 +-
 fs/buffer.c                          |   6 +-
 fs/notify/fanotify/fanotify.c        |   5 +-
 fs/notify/inotify/inotify_fsnotify.c |   5 +-
 include/linux/memcontrol.h           |   6 +
 include/linux/sched/mm.h             |  28 +--
 kernel/cgroup/cgroup.c               |   1 +
 mm/memcontrol.c                      |  11 +-
 mm/shmem.c                           |   4 +-
 10 files changed, 246 insertions(+), 80 deletions(-)

Comments

Dan Schatzberg April 21, 2020, 1:55 p.m. UTC | #1
On Tue, Apr 21, 2020 at 10:48:45AM +0800, Hillf Danton wrote:
> 
> On Mon, 20 Apr 2020 18:39:29 -0400 Dan Schatzberg wrote:
> > 
> > @@ -1140,8 +1215,17 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> >  	blk_mq_freeze_queue(lo->lo_queue);
> >  
> >  	spin_lock_irq(&lo->lo_lock);
> > +	destroy_workqueue(lo->workqueue);
> 
> Destruct it out of atomic context.

I may as well do this, but it doesn't matter, does it? The
blk_mq_freeze_queue above should drain all I/O so the workqueue will
be idle.
Dan Schatzberg April 21, 2020, 1:57 p.m. UTC | #2
On Tue, Apr 21, 2020 at 11:33:37AM +0800, Hillf Danton wrote:
> 
> On Mon, 20 Apr 2020 18:39:32 -0400 Dan Schatzberg wrote:
> > 
> > @@ -964,13 +960,16 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
> >  	worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
> >  	/*
> >  	 * In the event we cannot allocate a worker, just queue on the
> > -	 * rootcg worker
> > +	 * rootcg worker and issue the I/O as the rootcg
> >  	 */
> > -	if (!worker)
> > +	if (!worker) {
> > +		cmd->blkcg_css = NULL;
> > +		cmd->memcg_css = NULL;
> 
> Dunno if 	css_put(cmd->memcg_css);

Good catch. Need to drop the reference here.