loop: Add PF_LESS_THROTTLE to block/loop device thread.

Message ID	871staffus.fsf@notabene.neil.brown.name (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: NeilBrown <neilb@suse.com> To: Jens Axboe <axboe@fb.com> Date: Mon, 03 Apr 2017 11:18:51 +1000 Cc: linux-block@vger.kernel.org, linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org> Subject: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread. Message-ID: <871staffus.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: linux-block-owner@vger.kernel.org Precedence: bulk

NeilBrown April 3, 2017, 1:18 a.m. UTC

When a filesystem is mounted from a loop device, writes are
throttled by balance_dirty_pages() twice: once when writing
to the filesystem and once when the loop_handle_cmd() writes
to the backing file.  This double-throttling can trigger
positive feedback loops that create significant delays.  The
throttling at the lower level is seen by the upper level as
a slow device, so it throttles extra hard.

The PF_LESS_THROTTLE flag was created to handle exactly this
circumstance, though with an NFS filesystem mounted from a
local NFS server.  It reduces the throttling on the lower
layer so that it can proceed largely unthrottled.

To demonstrate this, create a filesystem on a loop device
and write (e.g. with dd) several large files which combine
to consume significantly more than the limit set by
/proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
time taken.

When I do this directly on a device (no loop device) the
total time for several runs (mkfs, mount, write 200 files,
umount) is fairly stable: 28-35 seconds.
When I do this over a loop device the times are much worse
and less stable.  52-460 seconds.  Half below 100seconds,
half above.
When I apply this patch, the times become stable again,
though not as fast as the no-loop-back case: 53-72 seconds.

There may be room for further improvement as the total overhead still
seems too high, but this is a big improvement.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/block/loop.c | 3 +++
 1 file changed, 3 insertions(+)

Christoph Hellwig April 4, 2017, 7:10 a.m. UTC | #1

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

But if you actually care about performance in any way I'd suggest
to use the loop device in direct I/O mode..

Michal Hocko April 4, 2017, 11:23 a.m. UTC | #2

On Mon 03-04-17 11:18:51, NeilBrown wrote:
> 
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file.  This double-throttling can trigger
> positive feedback loops that create significant delays.  The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
> 
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server.  It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
> 
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
> time taken.
> 
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable.  52-460 seconds.  Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
> 
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.

Yes this makes sense to me

> Signed-off-by: NeilBrown <neilb@suse.com>

Acked-by: Michal Hocko <mhocko@suse.com>

one nit below

> ---
>  drivers/block/loop.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..a7e1dd215fc2 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
>  {
>  	struct loop_cmd *cmd =
>  		container_of(work, struct loop_cmd, work);
> +	int oldflags = current->flags & PF_LESS_THROTTLE;
>  
> +	current->flags |= PF_LESS_THROTTLE;
>  	loop_handle_cmd(cmd);
> +	current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;

we have a helper for this tsk_restore_flags(). It is not used
consistently and maybe we want a dedicated api like we have for the
scope NOIO/NOFS but that is a separate thing. I would find
tsk_restore_flags easier to read.

Ming Lei April 4, 2017, 2:24 p.m. UTC | #3

On Mon, Apr 3, 2017 at 9:18 AM, NeilBrown <neilb@suse.com> wrote:
>
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file.  This double-throttling can trigger
> positive feedback loops that create significant delays.  The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
>
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server.  It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
>
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
> time taken.
>
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable.  52-460 seconds.  Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
>
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/block/loop.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..a7e1dd215fc2 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
>  {
>         struct loop_cmd *cmd =
>                 container_of(work, struct loop_cmd, work);
> +       int oldflags = current->flags & PF_LESS_THROTTLE;
>
> +       current->flags |= PF_LESS_THROTTLE;
>         loop_handle_cmd(cmd);
> +       current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;
>  }

You can do it against 'lo->worker_task' instead of doing it in each
loop_queue_work(),
and this flag needn't to be restored because the kernel thread is loop
specialized.


thanks,
Ming Lei

NeilBrown April 5, 2017, 4:27 a.m. UTC | #4

On Tue, Apr 04 2017, Christoph Hellwig wrote:

> Looks fine,
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> But if you actually care about performance in any way I'd suggest
> to use the loop device in direct I/O mode..

The losetup on my test VM is too old to support that :-(
I guess it might be time to upgraded.

It seems that there is not "mount -o direct_loop" or similar, so you
have to do the losetup and the mount separately.  Any thoughts on
whether that should be changed ?

Thanks,
NeilBrown

NeilBrown April 5, 2017, 4:31 a.m. UTC | #5

On Tue, Apr 04 2017, Ming Lei wrote:

> On Mon, Apr 3, 2017 at 9:18 AM, NeilBrown <neilb@suse.com> wrote:
>>
>> When a filesystem is mounted from a loop device, writes are
>> throttled by balance_dirty_pages() twice: once when writing
>> to the filesystem and once when the loop_handle_cmd() writes
>> to the backing file.  This double-throttling can trigger
>> positive feedback loops that create significant delays.  The
>> throttling at the lower level is seen by the upper level as
>> a slow device, so it throttles extra hard.
>>
>> The PF_LESS_THROTTLE flag was created to handle exactly this
>> circumstance, though with an NFS filesystem mounted from a
>> local NFS server.  It reduces the throttling on the lower
>> layer so that it can proceed largely unthrottled.
>>
>> To demonstrate this, create a filesystem on a loop device
>> and write (e.g. with dd) several large files which combine
>> to consume significantly more than the limit set by
>> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
>> time taken.
>>
>> When I do this directly on a device (no loop device) the
>> total time for several runs (mkfs, mount, write 200 files,
>> umount) is fairly stable: 28-35 seconds.
>> When I do this over a loop device the times are much worse
>> and less stable.  52-460 seconds.  Half below 100seconds,
>> half above.
>> When I apply this patch, the times become stable again,
>> though not as fast as the no-loop-back case: 53-72 seconds.
>>
>> There may be room for further improvement as the total overhead still
>> seems too high, but this is a big improvement.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/block/loop.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>> index 0ecb6461ed81..a7e1dd215fc2 100644
>> --- a/drivers/block/loop.c
>> +++ b/drivers/block/loop.c
>> @@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
>>  {
>>         struct loop_cmd *cmd =
>>                 container_of(work, struct loop_cmd, work);
>> +       int oldflags = current->flags & PF_LESS_THROTTLE;
>>
>> +       current->flags |= PF_LESS_THROTTLE;
>>         loop_handle_cmd(cmd);
>> +       current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;
>>  }
>
> You can do it against 'lo->worker_task' instead of doing it in each
> loop_queue_work(),
> and this flag needn't to be restored because the kernel thread is loop
> specialized.
>

good point.  I'll do that.  Thanks,
NeilBrown

Ming Lei April 5, 2017, 5:13 a.m. UTC | #6

On Wed, Apr 5, 2017 at 12:27 PM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Apr 04 2017, Christoph Hellwig wrote:
>
>> Looks fine,
>>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>
>> But if you actually care about performance in any way I'd suggest
>> to use the loop device in direct I/O mode..
>
> The losetup on my test VM is too old to support that :-(
> I guess it might be time to upgraded.
>
> It seems that there is not "mount -o direct_loop" or similar, so you
> have to do the losetup and the mount separately.  Any thoughts on

I guess the 'direct_loop' can be added into 'mount' directly? but not familiar
with mount utility.

> whether that should be changed ?

There was sysfs interface for controling direct IO in the initial
submission, but
looks it was reviewed out, :-)


Thanks,
Ming Lei

loop: Add PF_LESS_THROTTLE to block/loop device thread.

Commit Message

Comments

Patch