[v2,0/3] blkcg: sync() isolation

Message ID	20190307180834.22008-1-andrea.righi@canonical.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: best guess record for domain of andrea.righi@canonical.com designates 91.189.89.112 as permitted sender) client-ip=91.189.89.112; From: Andrea Righi <andrea.righi@canonical.com> To: Josef Bacik <josef@toxicpanda.com>, Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com>, Paolo Valente <paolo.valente@linaro.org>, Johannes Weiner <hannes@cmpxchg.org>, Jens Axboe <axboe@kernel.dk>, Vivek Goyal <vgoyal@redhat.com>, Dennis Zhou <dennis@kernel.org>, cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 0/3] blkcg: sync() isolation Date: Thu, 7 Mar 2019 19:08:31 +0100 Message-Id: <20190307180834.22008-1-andrea.righi@canonical.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	blkcg: sync() isolation \| expand [v2,0/3] blkcg: sync() isolation [v2,1/3] blkcg: prevent priority inversion problem during sync() [v2,2/3] blkcg: introduce io.sync_isolation [v2,3/3] blkcg: implement sync() isolation

Message ID

20190307180834.22008-1-andrea.righi@canonical.com (mailing list archive)

Headers

Received-SPF: pass (google.com: best guess record for domain of
 andrea.righi@canonical.com designates 91.189.89.112 as permitted sender)
 client-ip=91.189.89.112;
From: Andrea Righi <andrea.righi@canonical.com>
To: Josef Bacik <josef@toxicpanda.com>,
	Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>,
	Paolo Valente <paolo.valente@linaro.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jens Axboe <axboe@kernel.dk>,
	Vivek Goyal <vgoyal@redhat.com>,
	Dennis Zhou <dennis@kernel.org>,
	cgroups@vger.kernel.org,
	linux-block@vger.kernel.org,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 0/3] blkcg: sync() isolation
Date: Thu,  7 Mar 2019 19:08:31 +0100
Message-Id: <20190307180834.22008-1-andrea.righi@canonical.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

blkcg: sync() isolation | expand

Message

Andrea Righi March 7, 2019, 6:08 p.m. UTC

= Problem =

When sync() is executed from a high-priority cgroup, the process is forced to
wait the completion of the entire outstanding writeback I/O, even the I/O that
was originally generated by low-priority cgroups potentially.

This may cause massive latencies to random processes (even those running in the
root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
priority inversion problem.

This topic has been previously discussed here:
https://patchwork.kernel.org/patch/10804489/

[ Thanks to Josef for the suggestions ]

= Solution =

Here's a slightly more detailed description of the solution, as suggested by
Josef and Tejun (let me know if I misunderstood or missed anything):

 - track the submitter of wb work (when issuing sync()) and the cgroup that
   originally dirtied any inode, then use this information to determine the
   proper "sync() domain" and decide if the I/O speed needs to be boosted or
   not in order to prevent priority-inversion problems

 - by default when sync() is issued, all the outstanding writeback I/O is
   boosted to maximum speed to prevent priority inversion problems

 - if sync() is issued by the same throttled cgroup that generated the dirty
   pages, the corresponding writeback I/O is still throttled normally

 - add a new flag to cgroups (io.sync_isolation) that would make sync()'ers in
   that cgroup only be allowed to write out dirty pages that belong to its
   cgroup

= Test =

Here's a trivial example to trigger the problem:

 - create 2 cgroups: cg1 and cg2

 # mkdir /sys/fs/cgroup/unified/cg1
 # mkdir /sys/fs/cgroup/unified/cg2

 - set an I/O limit of 1MB/s on cg1/io.ma:

 # echo "8:0 rbps=1048576 wbps=1048576" > /sys/fs/cgroup/unified/cg1/io.max

 - run a write-intensive workload in cg1

 # cat /proc/self/cgroup
 0::/cg1
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30

 - run sync in cg2 and measure time

== Vanilla kernel ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync
 real	9m32,618s
 user	0m0,000s
 sys	0m0,018s

Ideally "sync" should complete almost immediately, because cg2 is unlimited and
it's not doing any I/O at all. Instead, the entire system is totally sluggish,
waiting for the throttled writeback I/O to complete, and it also triggers many
hung task timeout warnings.

== With this patch set applied and io.sync_isolation=0 (default) ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync
 real	0m2,044s
 user	0m0,009s
 sys	0m0,000s

[ Time range goes from 2s to 4s ]

== With this patch set applied and io.sync_isolation=1 ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync

 real	0m0,768s
 user	0m0,001s
 sys	0m0,008s

[ Time range goes from 0.7s to 1.6s ]

Changes in v2:
 - fix: properly keep track of sync waiters when a blkcg is writing to
   many block devices at the same time

Andrea Righi (3):
  blkcg: prevent priority inversion problem during sync()
  blkcg: introduce io.sync_isolation
  blkcg: implement sync() isolation

 Documentation/admin-guide/cgroup-v2.rst |   9 +++
 block/blk-cgroup.c                      | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-throttle.c                    |  48 ++++++++++++++-
 fs/fs-writeback.c                       |  57 +++++++++++++++++-
 fs/inode.c                              |   1 +
 fs/sync.c                               |   8 ++-
 include/linux/backing-dev-defs.h        |   2 +
 include/linux/blk-cgroup.h              |  52 +++++++++++++++++
 include/linux/fs.h                      |   4 ++
 mm/backing-dev.c                        |   2 +
 mm/page-writeback.c                     |   1 +
 11 files changed, 355 insertions(+), 7 deletions(-)

Comments

Josef Bacik March 8, 2019, 5:22 p.m. UTC | #1

On Thu, Mar 07, 2019 at 07:08:31PM +0100, Andrea Righi wrote:
> = Problem =
> 
> When sync() is executed from a high-priority cgroup, the process is forced to
> wait the completion of the entire outstanding writeback I/O, even the I/O that
> was originally generated by low-priority cgroups potentially.
> 
> This may cause massive latencies to random processes (even those running in the
> root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
> priority inversion problem.
> 
> This topic has been previously discussed here:
> https://patchwork.kernel.org/patch/10804489/
> 

Sorry to move the goal posts on you again Andrea, but Tejun and I talked about
this some more offline.

We don't want cgroup to become the arbiter of correctness/behavior here.  We
just want it to be isolating things.

For you that means you can drop the per-cgroup flag stuff, and only do the
priority boosting for multiple sync(2) waiters.  That is a real priority
inversion that needs to be fixed.  io.latency and io.max are capable of noticing
that a low priority group is going above their configured limits and putting
pressure elsewhere accordingly.

Tejun said he'd rather see the sync(2) isolation be done at the namespace level.
That way if you have fs namespacing you are already isolated to your namespace.
If you feel like tackling that then hooray, but that's a separate dragon to slay
so don't feel like you have to right now.

This way we keep cgroup doing its job, controlling resources.  Then we allow
namespacing to do its thing, isolating resources.  Thanks,

Josef

Andrea Righi March 8, 2019, 5:32 p.m. UTC | #2

On Fri, Mar 08, 2019 at 12:22:20PM -0500, Josef Bacik wrote:
> On Thu, Mar 07, 2019 at 07:08:31PM +0100, Andrea Righi wrote:
> > = Problem =
> > 
> > When sync() is executed from a high-priority cgroup, the process is forced to
> > wait the completion of the entire outstanding writeback I/O, even the I/O that
> > was originally generated by low-priority cgroups potentially.
> > 
> > This may cause massive latencies to random processes (even those running in the
> > root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
> > priority inversion problem.
> > 
> > This topic has been previously discussed here:
> > https://patchwork.kernel.org/patch/10804489/
> > 
> 
> Sorry to move the goal posts on you again Andrea, but Tejun and I talked about
> this some more offline.
> 
> We don't want cgroup to become the arbiter of correctness/behavior here.  We
> just want it to be isolating things.
> 
> For you that means you can drop the per-cgroup flag stuff, and only do the
> priority boosting for multiple sync(2) waiters.  That is a real priority
> inversion that needs to be fixed.  io.latency and io.max are capable of noticing
> that a low priority group is going above their configured limits and putting
> pressure elsewhere accordingly.

Alright, so IIUC that means we just need patch 1/3 for now (with the
per-bdi lock instead of the global lock). If that's the case I'll focus
at that patch then.

> 
> Tejun said he'd rather see the sync(2) isolation be done at the namespace level.
> That way if you have fs namespacing you are already isolated to your namespace.
> If you feel like tackling that then hooray, but that's a separate dragon to slay
> so don't feel like you have to right now.

Makes sense. I can take a look and see what I can do after posting the
new patch with the priority inversion fix only.

> 
> This way we keep cgroup doing its job, controlling resources.  Then we allow
> namespacing to do its thing, isolating resources.  Thanks,
> 
> Josef

Looks like a good plan to me. Thanks for the update.

-Andrea