[V12,0/3] Charge loop device i/o to issuing cgroup

Message ID	20210402191638.3249835-1-schatzberg.dan@gmail.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Jens Axboe <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Hugh Dickins <hughd@google.com>, Shakeel Butt <shakeelb@google.com>, Roman Gushchin <guro@fb.com>, Muchun Song <songmuchun@bytedance.com>, Yang Shi <shy828301@gmail.com>, Alex Shi <alex.shi@linux.alibaba.com>, Alexander Duyck <alexander.h.duyck@linux.intel.com>, Wei Yang <richard.weiyang@gmail.com>, linux-block@vger.kernel.org (open list:BLOCK LAYER), linux-kernel@vger.kernel.org (open list), cgroups@vger.kernel.org (open list:CONTROL GROUP (CGROUP)), linux-mm@kvack.org (open list:MEMORY MANAGEMENT) Subject: [PATCH V12 0/3] Charge loop device i/o to issuing cgroup Date: Fri, 2 Apr 2021 12:16:31 -0700 Message-Id: <20210402191638.3249835-1-schatzberg.dan@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit To: unlisted-recipients:; (no To-header on input) Precedence: bulk
Series	Charge loop device i/o to issuing cgroup \| expand [V12,0/3] Charge loop device i/o to issuing cgroup [1/3] loop: Use worker per cgroup instead of kworker [2/3] mm: Charge active memcg when no mm is set [3/3] loop: Charge i/o to mem and blk cg

Message ID

20210402191638.3249835-1-schatzberg.dan@gmail.com (mailing list archive)

Headers

From: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>,
        Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Michal Hocko <mhocko@kernel.org>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        Hugh Dickins <hughd@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        Roman Gushchin <guro@fb.com>,
        Muchun Song <songmuchun@bytedance.com>,
        Yang Shi <shy828301@gmail.com>,
        Alex Shi <alex.shi@linux.alibaba.com>,
        Alexander Duyck <alexander.h.duyck@linux.intel.com>,
        Wei Yang <richard.weiyang@gmail.com>,
        linux-block@vger.kernel.org (open list:BLOCK LAYER),
        linux-kernel@vger.kernel.org (open list),
        cgroups@vger.kernel.org (open list:CONTROL GROUP (CGROUP)),
        linux-mm@kvack.org (open list:MEMORY MANAGEMENT)
Subject: [PATCH V12 0/3] Charge loop device i/o to issuing cgroup
Date: Fri,  2 Apr 2021 12:16:31 -0700
Message-Id: <20210402191638.3249835-1-schatzberg.dan@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
To: unlisted-recipients:; (no To-header on input)
Precedence: bulk

Series

Charge loop device i/o to issuing cgroup | expand

Message

Dan Schatzberg April 2, 2021, 7:16 p.m. UTC

No major changes, rebased on top of latest mm tree

Changes since V12:

* Small change to get_mem_cgroup_from_mm to avoid needing
  get_active_memcg

Changes since V11:

* Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
  can be driven by writeback, but this was causing a warning in xfs
  and likely other filesystems aren't equipped to be driven by reclaim
  at the VFS layer.
* Included a small fix from Colin Ian King.
* reworked get_mem_cgroup_from_mm to institute the necessary charge
  priority.

Changes since V10:

* Added page-cache charging to mm: Charge active memcg when no mm is set

Changes since V9:

* Rebased against linus's branch which now includes Roman Gushchin's
  patch this series is based off of

Changes since V8:

* Rebased on top of Roman Gushchin's patch
  (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
  support for setting active memcg. Dropped the patch from this series
  that did the same thing.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
    sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

 drivers/block/loop.c       | 244 ++++++++++++++++++++++++++++++-------
 drivers/block/loop.h       |  15 ++-
 include/linux/memcontrol.h |   6 +
 kernel/cgroup/cgroup.c     |   1 +
 mm/filemap.c               |   2 +-
 mm/memcontrol.c            |  49 +++++---
 mm/shmem.c                 |   4 +-
 7 files changed, 253 insertions(+), 68 deletions(-)

Comments

Dan Schatzberg April 6, 2021, 6:59 p.m. UTC | #1

Hi Hillf, thanks for the review

On Sat, Apr 03, 2021 at 10:09:02AM +0800, Hillf Danton wrote:
> On Fri,  2 Apr 2021 12:16:32 Dan Schatzberg wrote:
> > +queue_work:
> > +	if (worker) {
> > +		/*
> > +		 * We need to remove from the idle list here while
> > +		 * holding the lock so that the idle timer doesn't
> > +		 * free the worker
> > +		 */
> > +		if (!list_empty(&worker->idle_list))
> > +			list_del_init(&worker->idle_list);
> 
> Nit, only queue work if the worker is inactive - otherwise it is taking
> care of the cmd_list.

By worker is inactive, you mean worker is on the idle_list? Yes, I
think you're right that queue_work() is unnecessary in that case since
each worker checks empty cmd_list then adds itself to idle_list under
the lock.

> 
> > +		work = &worker->work;
> > +		cmd_list = &worker->cmd_list;
> > +	} else {
> > +		work = &lo->rootcg_work;
> > +		cmd_list = &lo->rootcg_cmd_list;
> > +	}
> > +	list_add_tail(&cmd->list_entry, cmd_list);
> > +	queue_work(lo->workqueue, work);
> > +	spin_unlock_irq(&lo->lo_work_lock);
> >  }
> [...]
> > +	/*
> > +	 * We only add to the idle list if there are no pending cmds
> > +	 * *and* the worker will not run again which ensures that it
> > +	 * is safe to free any worker on the idle list
> > +	 */
> > +	if (worker && !work_pending(&worker->work)) {
> 
> The empty cmd_list is a good enough reason for worker to become idle.

This is only true with the above change to avoid a gratuitous
queue_work(), right? Otherwise we run the risk of freeing a worker
concurrently with loop_process_work() being invoked.

Dan Schatzberg April 7, 2021, 2:43 p.m. UTC | #2

On Wed, Apr 07, 2021 at 02:53:00PM +0800, Hillf Danton wrote:
> On Tue, 6 Apr 2021 Dan Schatzberg wrote:
> >On Sat, Apr 03, 2021 at 10:09:02AM +0800, Hillf Danton wrote:
> >> On Fri,  2 Apr 2021 12:16:32 Dan Schatzberg wrote:
> >> > +queue_work:
> >> > +	if (worker) {
> >> > +		/*
> >> > +		 * We need to remove from the idle list here while
> >> > +		 * holding the lock so that the idle timer doesn't
> >> > +		 * free the worker
> >> > +		 */
> >> > +		if (!list_empty(&worker->idle_list))
> >> > +			list_del_init(&worker->idle_list);
> >> 
> >> Nit, only queue work if the worker is inactive - otherwise it is taking
> >> care of the cmd_list.
> >
> >By worker is inactive, you mean worker is on the idle_list? Yes, I
> >think you're right that queue_work() is unnecessary in that case since
> >each worker checks empty cmd_list then adds itself to idle_list under
> >the lock.

A couple other corner cases - When worker is just allocated, it needs
a queue_work() and rootcg always needs a queue_work() since it never
sits on the idle_list. It does add to the logic a bit rather than just
unconditionally invoking queue_work()

> >
> >> 
> >> > +		work = &worker->work;
> >> > +		cmd_list = &worker->cmd_list;
> >> > +	} else {
> >> > +		work = &lo->rootcg_work;
> >> > +		cmd_list = &lo->rootcg_cmd_list;
> >> > +	}
> >> > +	list_add_tail(&cmd->list_entry, cmd_list);
> >> > +	queue_work(lo->workqueue, work);
> >> > +	spin_unlock_irq(&lo->lo_work_lock);
> >> >  }
> >> [...]
> >> > +	/*
> >> > +	 * We only add to the idle list if there are no pending cmds
> >> > +	 * *and* the worker will not run again which ensures that it
> >> > +	 * is safe to free any worker on the idle list
> >> > +	 */
> >> > +	if (worker && !work_pending(&worker->work)) {
> >> 
> >> The empty cmd_list is a good enough reason for worker to become idle.
> >
> >This is only true with the above change to avoid a gratuitous
> >queue_work(), right?
> 
> It is always true because of the empty cmd_list - the idle_list is the only
> place for the worker to go at this point.
> 
> >Otherwise we run the risk of freeing a worker
> >concurrently with loop_process_work() being invoked.
> 
> My suggestion is a minor optimization at most without any change to removing
> worker off the idle_list on queuing work - that cuts the risk for you.

If I just change this line from

if (worker && !work_pending(&worker->work)) {

to

if (worker) {

then the following sequence of events is possible:

1) loop_queue_work runs, adds a command to the worker list
2) loop_process_work runs, processes a single command and then drops
the lock and reschedules
3) loop_queue_work runs again, acquires the lock, adds to the list and
invokes queue_work() again
4) loop_process_work resumes, acquires lock, processes work, notices
list is empty and adds itself to the idle_list
5) idle timer fires and frees the worker
6) loop_process_work runs again (because of the queue_work in 3) and
accesses freed memory

The !work_pending... check prevents 4) from adding itself to the
idle_list so this is not possible. I believe we can only make this
change if we also make the other change you suggested to avoid
gratuitous queue_work()

Johannes Weiner April 12, 2021, 3:45 p.m. UTC | #3

It looks like all feedback has been addressed and there hasn't been
any new activity on it in a while.

As per the suggestion last time [1], Andrew, Jens, could this go
through the -mm tree to deal with the memcg conflicts?

[1] https://lore.kernel.org/lkml/CALvZod6FMQQC17Zsu9xoKs=dFWaJdMC2Qk3YiDPUUQHx8teLYg@mail.gmail.com/

On Fri, Apr 02, 2021 at 12:16:31PM -0700, Dan Schatzberg wrote:
> No major changes, rebased on top of latest mm tree
> 
> Changes since V12:
> 
> * Small change to get_mem_cgroup_from_mm to avoid needing
>   get_active_memcg
> 
> Changes since V11:
> 
> * Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
>   can be driven by writeback, but this was causing a warning in xfs
>   and likely other filesystems aren't equipped to be driven by reclaim
>   at the VFS layer.
> * Included a small fix from Colin Ian King.
> * reworked get_mem_cgroup_from_mm to institute the necessary charge
>   priority.
> 
> Changes since V10:
> 
> * Added page-cache charging to mm: Charge active memcg when no mm is set
> 
> Changes since V9:
> 
> * Rebased against linus's branch which now includes Roman Gushchin's
>   patch this series is based off of
> 
> Changes since V8:
> 
> * Rebased on top of Roman Gushchin's patch
>   (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
>   support for setting active memcg. Dropped the patch from this series
>   that did the same thing.
> 
> Changes since V7:
> 
> * Rebased against linus's branch
> 
> Changes since V6:
> 
> * Added separate spinlock for worker synchronization
> * Minor style changes
> 
> Changes since V5:
> 
> * Fixed a missing css_put when failing to allocate a worker
> * Minor style changes
> 
> Changes since V4:
> 
> Only patches 1 and 2 have changed.
> 
> * Fixed irq lock ordering bug
> * Simplified loop detach
> * Added support for nesting memalloc_use_memcg
> 
> Changes since V3:
> 
> * Fix race on loop device destruction and deferred worker cleanup
> * Ensure charge on shmem_swapin_page works just like getpage
> * Minor style changes
> 
> Changes since V2:
> 
> * Deferred destruction of workqueue items so in the common case there
>   is no allocation needed
> 
> Changes since V1:
> 
> * Split out and reordered patches so cgroup charging changes are
>   separate from kworker -> workqueue change
> 
> * Add mem_css to struct loop_cmd to simplify logic
> 
> The loop device runs all i/o to the backing file on a separate kworker
> thread which results in all i/o being charged to the root cgroup. This
> allows a loop device to be used to trivially bypass resource limits
> and other policy. This patch series fixes this gap in accounting.
> 
> A simple script to demonstrate this behavior on cgroupv2 machine:
> 
> '''
> #!/bin/bash
> set -e
> 
> CGROUP=/sys/fs/cgroup/test.slice
> LOOP_DEV=/dev/loop0
> 
> if [[ ! -d $CGROUP ]]
> then
>     sudo mkdir $CGROUP
> fi
> 
> grep oom_kill $CGROUP/memory.events
> 
> # Set a memory limit, write more than that limit to tmpfs -> OOM kill
> sudo unshare -m bash -c "
> echo \$\$ > $CGROUP/cgroup.procs;
> echo 0 > $CGROUP/memory.swap.max;
> echo 64M > $CGROUP/memory.max;
> mount -t tmpfs -o size=512m tmpfs /tmp;
> dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
> 
> grep oom_kill $CGROUP/memory.events
> 
> # Set a memory limit, write more than that limit through loopback
> # device -> no OOM kill
> sudo unshare -m bash -c "
> echo \$\$ > $CGROUP/cgroup.procs;
> echo 0 > $CGROUP/memory.swap.max;
> echo 64M > $CGROUP/memory.max;
> mount -t tmpfs -o size=512m tmpfs /tmp;
> truncate -s 512m /tmp/backing_file
> losetup $LOOP_DEV /tmp/backing_file
> dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
> losetup -D $LOOP_DEV" || true
> 
> grep oom_kill $CGROUP/memory.events
> '''
> 
> Naively charging cgroups could result in priority inversions through
> the single kworker thread in the case where multiple cgroups are
> reading/writing to the same loop device. This patch series does some
> minor modification to the loop driver so that each cgroup can make
> forward progress independently to avoid this inversion.
> 
> With this patch series applied, the above script triggers OOM kills
> when writing through the loop device as expected.
> 
> Dan Schatzberg (3):
>   loop: Use worker per cgroup instead of kworker
>   mm: Charge active memcg when no mm is set
>   loop: Charge i/o to mem and blk cg
> 
>  drivers/block/loop.c       | 244 ++++++++++++++++++++++++++++++-------
>  drivers/block/loop.h       |  15 ++-
>  include/linux/memcontrol.h |   6 +
>  kernel/cgroup/cgroup.c     |   1 +
>  mm/filemap.c               |   2 +-
>  mm/memcontrol.c            |  49 +++++---
>  mm/shmem.c                 |   4 +-
>  7 files changed, 253 insertions(+), 68 deletions(-)
> 
> -- 
> 2.30.2
> 
>

Jens Axboe April 12, 2021, 3:50 p.m. UTC | #4

On 4/12/21 9:45 AM, Johannes Weiner wrote:
> It looks like all feedback has been addressed and there hasn't been
> any new activity on it in a while.
> 
> As per the suggestion last time [1], Andrew, Jens, could this go
> through the -mm tree to deal with the memcg conflicts?

Yep, I think that would make it the most painless for everyone.

Dan/Andrew, you can add my Acked-by to the series.