fs-writeback: drop wb->list_lock during blk_finish_plug()

Message ID	CA+55aFz2f8969nQAR6=TUTQOYA7T9vmcYJKGCyejYWYZ9a5S_w@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <CA+55aFw0JG5z-_h7MWMCUK1ZLi_sm+RsEq7=eoQGK2BQCamxXw@mail.gmail.com> References: <20150911193747.GA4150@ret.masoncoding.com> <CA+55aFw0JG5z-_h7MWMCUK1ZLi_sm+RsEq7=eoQGK2BQCamxXw@mail.gmail.com> Date: Fri, 11 Sep 2015 13:37:19 -0700 Message-ID: <CA+55aFz2f8969nQAR6=TUTQOYA7T9vmcYJKGCyejYWYZ9a5S_w@mail.gmail.com> Subject: Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() From: Linus Torvalds <torvalds@linux-foundation.org> To: Chris Mason <clm@fb.com>, LKML <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Josef Bacik <jbacik@fb.com>, Dave Chinner <david@fromorbit.com>, Neil Brown <neilb@suse.de>, Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de> Content-Type: multipart/mixed; boundary=089e013cc16899de04051f7eae62 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Linus Torvalds Sept. 11, 2015, 8:37 p.m. UTC

On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> How about we instead:
>
>  (a) revert that commit d353d7587 as broken (because it clearly is)
>
>  (b) add a big honking comment about the fact that we hold 'list_lock'
> in writeback_sb_inodes()
>
>  (c) move the plugging up to wb_writeback() and writeback_inodes_wb()
> _outside_ the spinlock.

Ok, I've done (a) and (b) in my tree. And attached is the totally
untested patch for (c). It looks ObviouslyCorrect(tm), but since this
is a performance issue, I'm not going to commit it without some more
ACK's from people.

I obviously think this is a *much* better approach than dropping and
retaking the lock, but there might be something silly I'm missing.

For example, maybe we want to unplug and replug around the
"inode_sleep_on_writeback()" in wb_writeback()? So while the revert
was a no-brainer, this one I really want people to think about.

                       Linus
fs/fs-writeback.c | 6 ++++++
 1 file changed, 6 insertions(+)

Josef Bacik Sept. 11, 2015, 8:40 p.m. UTC | #1

On 09/11/2015 04:37 PM, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> How about we instead:
>>
>>   (a) revert that commit d353d7587 as broken (because it clearly is)
>>
>>   (b) add a big honking comment about the fact that we hold 'list_lock'
>> in writeback_sb_inodes()
>>
>>   (c) move the plugging up to wb_writeback() and writeback_inodes_wb()
>> _outside_ the spinlock.
>
> Ok, I've done (a) and (b) in my tree. And attached is the totally
> untested patch for (c). It looks ObviouslyCorrect(tm), but since this
> is a performance issue, I'm not going to commit it without some more
> ACK's from people.
>
> I obviously think this is a *much* better approach than dropping and
> retaking the lock, but there might be something silly I'm missing.
>
> For example, maybe we want to unplug and replug around the
> "inode_sleep_on_writeback()" in wb_writeback()? So while the revert
> was a no-brainer, this one I really want people to think about.

So we talked about this when we were trying to figure out a solution. 
The problem with this approach is now we have a plug that covers 
multiple super blocks (__writeback_inodes_wb loops through the sb's 
starts writeback), which is likely to give us crappier performance than 
no plug at all.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 11, 2015, 9:04 p.m. UTC | #2

On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote:
>
> So we talked about this when we were trying to figure out a solution. The
> problem with this approach is now we have a plug that covers multiple super
> blocks (__writeback_inodes_wb loops through the sb's starts writeback),
> which is likely to give us crappier performance than no plug at all.

Why would that be? Either they are on separate disks, and the IO is
all independent anyway, and at most it got started by some (small)
CPU-amount later. Actual throughput should be the same. No?

Or the filesystems are on the same disk, in which case it should
presumably be a win to submit the IO together.

Of course, actual numbers would be the deciding factor if this really
is noticeable. But "cleaner code and saner locking" is definitely an
issue at least for me.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 11, 2015, 10:06 p.m. UTC | #3

On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Of course, actual numbers would be the deciding factor if this really
> is noticeable. But "cleaner code and saner locking" is definitely an
> issue at least for me.

Anyway, I'll hold off pushing out the revert too for a while, in the
hope that we'll have actual numbers for or against whatever the
particular solution should be.

I do get the feeling that the whole wb->list_lock needs more loving.
For example, there's that locked_inode_to_wb_and_lock_list() thing
(which might be better off trying to just do "trylock" on it, but
that's a separate issue) showing lock inversion worries.

Maybe we should *not* get that wb->list_lock early at all, and nest it
inside the inode spinlocks, and just move the list_lock locking down a
lot (ie not even try to hold it over big functions that then are
forced to releasing it anyway).

For example, realistically, it looks like the longest we ever *really*
hold that lock is at the top of the loop of writeback_sb_inodes() -
maybe we could just explicitly have a function that does "find the
first inode that matches this sb and needs writeout activity", and
literally only take hold the lock over that function. And *not* take
the lock in the caller at all?

The callers seem to want that lock mainly because they do that
"list_empty(&wb->b_io)" test. But the way our doubly linked lists
work, "list_empty()" is actually something that can be done racily
without the lock held...

That said, I doubt anybody really wants to touch this, so at least for
now we're stuck with either the "plug outside the lock" or the "drop
and retake lock" options. It really would be loveyl to haev numbers
either way.

<insert pricess leia gif>
"Dave, you're our only hope"

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason Sept. 11, 2015, 11:06 p.m. UTC | #4

On Fri, Sep 11, 2015 at 02:04:13PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote:
> >
> > So we talked about this when we were trying to figure out a solution. The
> > problem with this approach is now we have a plug that covers multiple super
> > blocks (__writeback_inodes_wb loops through the sb's starts writeback),
> > which is likely to give us crappier performance than no plug at all.
> 
> Why would that be? Either they are on separate disks, and the IO is
> all independent anyway, and at most it got started by some (small)
> CPU-amount later. Actual throughput should be the same. No?
> 
> Or the filesystems are on the same disk, in which case it should
> presumably be a win to submit the IO together.
> 
> Of course, actual numbers would be the deciding factor if this really
> is noticeable. But "cleaner code and saner locking" is definitely an
> issue at least for me.

Originally I was worried about the latency impact of holding the
plugs over more than one super with high end flash.  I just didn't want
to hold onto the IO for longer than we had to.

But, since this isn't really latency sensitive anyway, if we find we're
not keeping the flash pipelines full the right answer is to short
circuit the plugging in general.  I'd agree actual throughput should be
the same.

But benchmarking is the best choice, I'll be able to reproduce
Dave's original results without too much trouble.  Our thinking for this
duct tape patch was the lock wasn't very hot and this was the best
immediate compromise between the bug and the perf improvement.

Happy to sign up for pushing the lock higher if that's what people would
rather see.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 11, 2015, 11:13 p.m. UTC | #5

On Fri, Sep 11, 2015 at 4:06 PM, Chris Mason <clm@fb.com> wrote:
>
> Originally I was worried about the latency impact of holding the
> plugs over more than one super with high end flash.  I just didn't want
> to hold onto the IO for longer than we had to.
>
> But, since this isn't really latency sensitive anyway, if we find we're
> not keeping the flash pipelines full the right answer is to short
> circuit the plugging in general.  I'd agree actual throughput should be
> the same.

Yeah, this only triggers for system-wide writeback, so I don't seer
that it should be latency-sensitive at that level, afaik.

But hey, it's IO, and I've been surprised by magic pattern
sensitivites before...

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason Sept. 11, 2015, 11:16 p.m. UTC | #6

On Fri, Sep 11, 2015 at 03:06:18PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Of course, actual numbers would be the deciding factor if this really
> > is noticeable. But "cleaner code and saner locking" is definitely an
> > issue at least for me.
> 
> Anyway, I'll hold off pushing out the revert too for a while, in the
> hope that we'll have actual numbers for or against whatever the
> particular solution should be.
> 
> I do get the feeling that the whole wb->list_lock needs more loving.
> For example, there's that locked_inode_to_wb_and_lock_list() thing
> (which might be better off trying to just do "trylock" on it, but
> that's a separate issue) showing lock inversion worries.

At the very least, it's kind of sad how many of us were surprised to
find the lock held when Dave's patch was unplugging.  If this bug
slipped through, more are going to.  It's also true N-1 of those people
were really surprised about scheduling unplug functions, so maybe we
can't put all the blame on wb->list_lock.

> 
> Maybe we should *not* get that wb->list_lock early at all, and nest it
> inside the inode spinlocks, and just move the list_lock locking down a
> lot (ie not even try to hold it over big functions that then are
> forced to releasing it anyway).
> 
> For example, realistically, it looks like the longest we ever *really*
> hold that lock is at the top of the loop of writeback_sb_inodes() -
> maybe we could just explicitly have a function that does "find the
> first inode that matches this sb and needs writeout activity", and
> literally only take hold the lock over that function. And *not* take
> the lock in the caller at all?
> 
> The callers seem to want that lock mainly because they do that
> "list_empty(&wb->b_io)" test. But the way our doubly linked lists
> work, "list_empty()" is actually something that can be done racily
> without the lock held...
> 
> That said, I doubt anybody really wants to touch this, so at least for
> now we're stuck with either the "plug outside the lock" or the "drop
> and retake lock" options. It really would be loveyl to haev numbers
> either way.

For 4.3 timeframes, what runs do you want to see numbers for:

1) revert
2) my hack
3) plug over multiple sbs (on different devices)
4) ?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 11, 2015, 11:36 p.m. UTC | #7

On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
>
> For 4.3 timeframes, what runs do you want to see numbers for:
>
> 1) revert
> 2) my hack
> 3) plug over multiple sbs (on different devices)
> 4) ?

Just 2 or 3.

I don't think the plain revert is all that interesting, and I think
the "anything else" is far too late for this merge window.

So we'll go with either (2) your patch (which I obviously don't
_like_, but apart from the ugliness I don't think there's anything
technically wrong with), or with (3) the "plug across a bigger area".

So the only issue with (3) is whether that's just "revert plus the
patch I sent out", or whether we should unplug/replug over the "wait
synchronously for an inode" case (iow, the
"inode_sleep_on_writeback()").  The existing plug code (that has the
spinlock issue) already has a "wait on inode" case, and did *not*
unplug over that call, but broadening the plugging further now ends up
having two of those "wait synchronosly on inode".

Are we really ok with waiting synchronously for an inode while holding
the plug? No chance of deadlock (waiting for IO that we've plugged)?
That issue is true even of the current code, though, and I have _not_
really thought that through, it's just a worry.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 12, 2015, 12:52 a.m. UTC | #8

On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Are we really ok with waiting synchronously for an inode while holding
> the plug? No chance of deadlock (waiting for IO that we've plugged)?
> That issue is true even of the current code, though, and I have _not_
> really thought that through, it's just a worry.

Never mind. We still flush the plug on explicit scheduling events.  I
wonder why I thought we got rid of that. Some kind of "senior moment",
I guess.

            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason Sept. 12, 2015, 2:15 a.m. UTC | #9

On Fri, Sep 11, 2015 at 05:52:27PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Are we really ok with waiting synchronously for an inode while holding
> > the plug? No chance of deadlock (waiting for IO that we've plugged)?
> > That issue is true even of the current code, though, and I have _not_
> > really thought that through, it's just a worry.
> 
> Never mind. We still flush the plug on explicit scheduling events.  I
> wonder why I thought we got rid of that. Some kind of "senior moment",

But flushing on schedule is a little different.  It ends up calling
blk_schedule_flush_plug() which will hand off work to kblockd through
blk_run_queue_async()

Not a huge deal, but if we're scheduling to wait for that IO, we should
really run the plug ourselves so that we're not waiting for kblockd too.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 12, 2015, 2:27 a.m. UTC | #10

On Fri, Sep 11, 2015 at 7:15 PM, Chris Mason <clm@fb.com> wrote:
>
> But flushing on schedule is a little different.  It ends up calling
> blk_schedule_flush_plug() which will hand off work to kblockd through
> blk_run_queue_async()

I was more worried about some actual deadlock from the changes.  And
blk_schedule_flush_plug() is fine in that it doesn't actually remove
the plug, it just schedules the currently plugged pending IO, so the
IO will start from waiting on the inode, but the plug will still
remain for the rest of the writeback, and it all looks like it should
be fine.

And the reason we use kblockd is simple: stack usage. The reschedule
can happen pretty deep on the stack, we don't actually want to
necessarily then cause much more stack use through things like md/raid
allocating new requests etc.

So it all looks fine to me.

Btw, very tangentially related: grepping a bit shows that
"blk_flush_plug()" isn't actually used anywhere any more. Can we get
rid of that?

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason Sept. 12, 2015, 11 p.m. UTC | #11

On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
> >
> > For 4.3 timeframes, what runs do you want to see numbers for:
> >
> > 1) revert
> > 2) my hack
> > 3) plug over multiple sbs (on different devices)
> > 4) ?
> 
> Just 2 or 3.
> 
> I don't think the plain revert is all that interesting, and I think
> the "anything else" is far too late for this merge window.

I did the plain revert as well, just to have a baseline.  This box is a
little different from Dave's.   Bare metal two socket box (E5-2660 v2 @
2.20Ghz) with 144GB of ram.  I have two pcie flash devices, one nvme and
one fusionio, and I put a one FS on each device (two mounts total).

The test created 1.6M files, 4K each.  I used Dave's fs_mark command
line, spread out over 16 directories from each mounted filesystem.  In
btrfs they are spread over subvolumes to cut down lock contention.

I need to change around the dirty ratios more to smooth out the IO, and
I had trouble with both XFS and btrfs getting runs that were not CPU
bound.  I included the time to run sync at the end of the run because
the results were not very consistent without it.

The XFS runs generally had one CPU pegged at 100%, and I think this is
throwing off the results.  On Monday I'll redo them with two (four?)
filesystems per flash device, hopefully that'll break things up.

The btrfs runs generally had all the CPUs pegged at 100%.  I switched to
mount -o nodatasum and squeezed out an extra 50K files/sec at much lower
CPU utilization.

               wall time      fs_mark files/sec    bytes written/sec

XFS:
baseline v4.2:  5m6s           118,578              272MB/s
Dave's patch:   4m46s          151,421              294MB/s
my hack:        5m5s           150,714              275MB/s
Linus plug:     5m15s          147,735              266MB/s

Btrfs (nodatasum):
baseline v4.2:  4m39s          242,643              313MB/s	
Dave's patch:   3m46s          252,452              389MB/s
my hack:        3m48s          257,924              379MB/s
Linus plug:     3m58s          247,528              369MB/s

Bottom line, not as conclusive as I'd like.  My hack doesn't seem to
hurt, but FS internals are consuming enough CPU that this lock just
isn't showing up.

Linus' plug patch is consistently slower, and I don't have a great
explanation.  My guesses: not keeping the flash pipelines full, or the
imbalance between the different speed flash is averaging the overall
result down, or its my kblockd vs explicit unplug handwaving from
yesterday.

So, next step is either more runs on flash or grab a box with a bunch of
spindles.  I'd rather do the spindle runs, I agree with Dave that his
patch should help much more on actual drives.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 12, 2015, 11:29 p.m. UTC | #12

On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote:
>
> I did the plain revert as well, just to have a baseline.

Ahh, I ended up not expecting you to get this done until after rc1 was
out, so I in the meantime just merged my fix instead rather than leave
the expected scheduling-while-atomic problem.

And just as well that you did a baseline, since apparently the numbers
are all over the map. I don't see how your hack and dave's original
can _possibly_ differ that much, but they clearly did on your xfs
test. So there's probably huge variance that depends on random
details.

I'll leave things as they are until we have something that looks a bit
more believable ;)

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason Sept. 12, 2015, 11:46 p.m. UTC | #13

On Sat, Sep 12, 2015 at 04:29:06PM -0700, Linus Torvalds wrote:
> On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote:
> >
> > I did the plain revert as well, just to have a baseline.
> 
> Ahh, I ended up not expecting you to get this done until after rc1 was
> out, so I in the meantime just merged my fix instead rather than leave
> the expected scheduling-while-atomic problem.

Yeah, I wasn't sure I'd be able to do the runs, but it was a rainy
afternoon and this was more fun than cleaning.  Really glad something
got in for rc1 either way.

> 
> And just as well that you did a baseline, since apparently the numbers
> are all over the map. I don't see how your hack and dave's original
> can _possibly_ differ that much, but they clearly did on your xfs
> test. So there's probably huge variance that depends on random
> details.

I don't think the XFS numbers can be trusted too much since it was
basically bottlenecked behind that single pegged CPU.  It was bouncing
around and I couldn't quite track it down to a process name (or perf
profile).

The btrfs numbers were much more consistent, but your patch is still a
win over plain 4.2.

> 
> I'll leave things as they are until we have something that looks a bit
> more believable ;)

We can build from here, thanks Linus.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason Sept. 13, 2015, 1:12 p.m. UTC | #14

On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote:
> I don't think the XFS numbers can be trusted too much since it was
> basically bottlenecked behind that single pegged CPU.  It was bouncing
> around and I couldn't quite track it down to a process name (or perf
> profile).

I'll do more runs Monday, but I was able to grab a perf profile of the
pegged XFS CPU.  It was just the writeback worker thread, and it
hit btrfs differently because we defer more of this stuff to endio
workers, effectively spreading it out over more CPUs.

With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K.
Here's one of the profiles, but it bounced around a lot so I wouldn't
use this to actually tune anything:

   11.42%  kworker/u82:61  [kernel.kallsyms]  [k] _raw_spin_lock
            |
            ---_raw_spin_lock
               |
               |--83.43%-- xfs_extent_busy_trim
               |          xfs_alloc_compute_aligned
               |          |
               |          |--99.92%-- xfs_alloc_ag_vextent_near
               |          |          xfs_alloc_ag_vextent
               |          |          xfs_alloc_vextent
               |          |          xfs_bmap_btalloc
               |          |          xfs_bmap_alloc
               |          |          xfs_bmapi_write
               |          |          xfs_iomap_write_allocate
               |          |          xfs_map_blocks
               |          |          xfs_vm_writepage
               |          |          __writepage
               |          |          write_cache_pages
               |          |          generic_writepages
               |          |          xfs_vm_writepages
               |          |          do_writepages
               |          |          __writeback_single_inode
               |          |          writeback_sb_inodes
               |          |          __writeback_inodes_wb
               |          |          wb_writeback
               |          |          wb_do_writeback
               |          |          wb_workfn
               |          |          process_one_work
               |          |          worker_thread
               |          |          kthread
               |          |          ret_from_fork
               |           --0.08%-- [...]
               |

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 13, 2015, 10:56 p.m. UTC | #15

On Sun, Sep 13, 2015 at 09:12:44AM -0400, Chris Mason wrote:
> On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote:
> > I don't think the XFS numbers can be trusted too much since it was
> > basically bottlenecked behind that single pegged CPU.  It was bouncing
> > around and I couldn't quite track it down to a process name (or perf
> > profile).
> 
> I'll do more runs Monday, but I was able to grab a perf profile of the
> pegged XFS CPU.  It was just the writeback worker thread, and it
> hit btrfs differently because we defer more of this stuff to endio
> workers, effectively spreading it out over more CPUs.
> 
> With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K.
> Here's one of the profiles, but it bounced around a lot so I wouldn't
> use this to actually tune anything:

mkfs.xfs -d agcount=64 ....

Cheers,

Dave.

Dave Chinner Sept. 13, 2015, 11:12 p.m. UTC | #16

On Sat, Sep 12, 2015 at 07:00:27PM -0400, Chris Mason wrote:
> On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote:
> > On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
> > >
> > > For 4.3 timeframes, what runs do you want to see numbers for:
> > >
> > > 1) revert
> > > 2) my hack
> > > 3) plug over multiple sbs (on different devices)
> > > 4) ?
> > 
> > Just 2 or 3.
> > 
> > I don't think the plain revert is all that interesting, and I think
> > the "anything else" is far too late for this merge window.
> 
> I did the plain revert as well, just to have a baseline.  This box is a
> little different from Dave's.   Bare metal two socket box (E5-2660 v2 @
> 2.20Ghz) with 144GB of ram.  I have two pcie flash devices, one nvme and
> one fusionio, and I put a one FS on each device (two mounts total).
> 
> The test created 1.6M files, 4K each.  I used Dave's fs_mark command
> line, spread out over 16 directories from each mounted filesystem.  In
> btrfs they are spread over subvolumes to cut down lock contention.
> 
> I need to change around the dirty ratios more to smooth out the IO, and
> I had trouble with both XFS and btrfs getting runs that were not CPU
> bound.  I included the time to run sync at the end of the run because
> the results were not very consistent without it.
> 
> The XFS runs generally had one CPU pegged at 100%, and I think this is
> throwing off the results.  On Monday I'll redo them with two (four?)
> filesystems per flash device, hopefully that'll break things up.
> 
> The btrfs runs generally had all the CPUs pegged at 100%.  I switched to
> mount -o nodatasum and squeezed out an extra 50K files/sec at much lower
> CPU utilization.
> 
>                wall time      fs_mark files/sec    bytes written/sec
> 
> XFS:
> baseline v4.2:  5m6s           118,578              272MB/s
> Dave's patch:   4m46s          151,421              294MB/s
> my hack:        5m5s           150,714              275MB/s
> Linus plug:     5m15s          147,735              266MB/s
> 
> 
> Btrfs (nodatasum):
> baseline v4.2:  4m39s          242,643              313MB/s	
> Dave's patch:   3m46s          252,452              389MB/s
> my hack:        3m48s          257,924              379MB/s
> Linus plug:     3m58s          247,528              369MB/s

Really need to run these numbers on slower disks where block layer
merging makes a difference to performance. The high level plugging
improves performance on spinning disks by a huge amount on XFS
because the merging reduces the number of IOs issued to disk by 2
orders of magnitude. Plugging makes comparitively little difference
on devices that can sustain extremely high IOPS and hence sink the
tens to hundreds of thousand individual 4k IOs that this workload
generates through writeback.

i.e. while throughput increases, that's not the numbers that matters
here - it's the change in write IO behaviour that needs to be
categorised and measured by the different patches...

(I'm on holidays, so I'm not going to get to this any time soon)

Cheers,

Dave.

Linus Torvalds Sept. 14, 2015, 8:06 p.m. UTC | #17

On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Really need to run these numbers on slower disks where block layer
> merging makes a difference to performance.

Yeah. We've seen plugging and io schedulers not make much difference
for high-performance flash (although I think the people who argued
that noop should generally be used for non-rotating media were wrong,
I think - the elevator ends up still being critical to merging, and
while merging isn't a life-or-death situation, it tends to still
help).

For rotating rust with nasty seek times, the plugging is likely to
make the biggest difference.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason Sept. 16, 2015, 3:16 p.m. UTC | #18

On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote:
> On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Really need to run these numbers on slower disks where block layer
> > merging makes a difference to performance.
> 
> Yeah. We've seen plugging and io schedulers not make much difference
> for high-performance flash (although I think the people who argued
> that noop should generally be used for non-rotating media were wrong,
> I think - the elevator ends up still being critical to merging, and
> while merging isn't a life-or-death situation, it tends to still
> help).

Yeah, my big concern was that holding the plug longer would result in
lower overall perf because we weren't keeping the flash busy.  So I
started with the flash boxes to make sure we weren't regressing past 4.2
levels at least.

I'm still worried about that, but this probably isn't the right
benchmark to show it.  And if it's really a problem, it'll happen
everywhere we plug and not just here.

> 
> For rotating rust with nasty seek times, the plugging is likely to
> make the biggest difference.

For rotating storage, I grabbed a big box and did the fs_mark run
against 8 spindles.  These are all behind a megaraid card as jbods, so I
flipped the card's cache to write-through.

I changed around the run a bit, making enough files for fs_mark to run
for ~10 minutes, and I took out the sync.  I ran only xfs to cut down on
the iterations, and after the fs_mark run, I did short 30 second run with
blktrace in the background to capture the io sizes.

v4.2:    178K files/sec
Chinner: 192K files/sec
Mason:   192K files/sec
Linus:   193K files/sec

I added support to iowatcher to graph IO size, and attached the graph.

Short version, Linus' patch still gives bigger IOs and similar perf to
Dave's original.  I should have done the blktrace runs for 60 seconds
instead of 30, I suspect that would even out the average sizes between
the three patches.

-chris

Jan Kara Sept. 16, 2015, 7:58 p.m. UTC | #19

On Wed 16-09-15 11:16:21, Chris Mason wrote:
> On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote:
> > On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > Really need to run these numbers on slower disks where block layer
> > > merging makes a difference to performance.
> > 
> > Yeah. We've seen plugging and io schedulers not make much difference
> > for high-performance flash (although I think the people who argued
> > that noop should generally be used for non-rotating media were wrong,
> > I think - the elevator ends up still being critical to merging, and
> > while merging isn't a life-or-death situation, it tends to still
> > help).
> 
> 
> Yeah, my big concern was that holding the plug longer would result in
> lower overall perf because we weren't keeping the flash busy.  So I
> started with the flash boxes to make sure we weren't regressing past 4.2
> levels at least.
> 
> I'm still worried about that, but this probably isn't the right
> benchmark to show it.  And if it's really a problem, it'll happen
> everywhere we plug and not just here.
> 
> > 
> > For rotating rust with nasty seek times, the plugging is likely to
> > make the biggest difference.
> 
> For rotating storage, I grabbed a big box and did the fs_mark run
> against 8 spindles.  These are all behind a megaraid card as jbods, so I
> flipped the card's cache to write-through.
> 
> I changed around the run a bit, making enough files for fs_mark to run
> for ~10 minutes, and I took out the sync.  I ran only xfs to cut down on
> the iterations, and after the fs_mark run, I did short 30 second run with
> blktrace in the background to capture the io sizes.
> 
> v4.2:    178K files/sec
> Chinner: 192K files/sec
> Mason:   192K files/sec
> Linus:   193K files/sec
> 
> I added support to iowatcher to graph IO size, and attached the graph.
> 
> Short version, Linus' patch still gives bigger IOs and similar perf to
> Dave's original.  I should have done the blktrace runs for 60 seconds
> instead of 30, I suspect that would even out the average sizes between
> the three patches.

Thanks for the data Chris. So I guess we are fine with what's currently in,
right?

								Honza

Chris Mason Sept. 16, 2015, 8 p.m. UTC | #20

On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > Short version, Linus' patch still gives bigger IOs and similar perf to
> > Dave's original.  I should have done the blktrace runs for 60 seconds
> > instead of 30, I suspect that would even out the average sizes between
> > the three patches.
> 
> Thanks for the data Chris. So I guess we are fine with what's currently in,
> right?

Looks like it works well to me.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 16, 2015, 10:07 p.m. UTC | #21

On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > instead of 30, I suspect that would even out the average sizes between
> > > the three patches.
> > 
> > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > right?
> 
> Looks like it works well to me.

Graph looks good, though I'll confirm it on my test rig once I get
out from under the pile of email and other stuff that is queued up
after being away for a week...

Cheers,

Dave.

Dave Chinner Sept. 17, 2015, 12:37 a.m. UTC | #22

[cc Tejun]

On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > > instead of 30, I suspect that would even out the average sizes between
> > > > the three patches.
> > > 
> > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > right?
> > 
> > Looks like it works well to me.
> 
> Graph looks good, though I'll confirm it on my test rig once I get
> out from under the pile of email and other stuff that is queued up
> after being away for a week...

I ran some tests in the background while reading other email.....

TL;DR: Results look really bad - not only is the plugging
problematic, baseline writeback performance has regressed
significantly. We need to revert the plugging changes until the
underlying writeback performance regressions are sorted out.

In more detail, these tests were run on my usual 16p/16GB RAM
performance test VM with storage set up as described here:

http://permalink.gmane.org/gmane.linux.kernel/1768786

The test:

$ ~/tests/fsmark-10-4-test-xfs.sh
meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0
data     =                       bsize=4096   blocks=134217727500, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

#  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
#       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     0        80000         4096     106938.0           543310
     0       160000         4096     102922.7           476362
     0       240000         4096     107182.9           538206
     0       320000         4096     107871.7           619821
     0       400000         4096      99255.6           622021
     0       480000         4096     103217.8           609943
     0       560000         4096      96544.2           640988
     0       640000         4096     100347.3           676237
     0       720000         4096      87534.8           483495
     0       800000         4096      72577.5          2556920
     0       880000         4096      97569.0           646996

<RAM fills here, sustained performance is now dependent on writeback>

     0       960000         4096      80147.0           515679
     0      1040000         4096     100394.2           816979
     0      1120000         4096      91466.5           739009
     0      1200000         4096      85868.1           977506
     0      1280000         4096      89691.5           715207
     0      1360000         4096      52547.5           712810
     0      1440000         4096      47999.1           685282
     0      1520000         4096      47894.3           697261
     0      1600000         4096      47549.4           789977
     0      1680000         4096      40029.2           677885
     0      1760000         4096      16637.4         12804557
     0      1840000         4096      16883.6         24295975
     0      1920000         4096      18852.1         22264137
     0      2000000         4096      27107.3         10310408
     0      2080000         4096      23654.1         19823567
     0      2160000         4096      24588.9         19769484
     0      2240000         4096      13959.3         40168163
     0      2320000         4096      26239.2         17806813
     0      2400000         4096      17858.5         24596715
     0      2480000         4096      12072.9         38177650
     0      2560000         4096       9989.7         43413729

<This is Real Bad. It gets significantly worse as the test goes on:>

     0      6880000         4096      11773.4         16470400
     0      6960000         4096       5556.5         27325343
     0      7040000         4096      13816.6         22440991
     0      7120000         4096       7322.4         42086686
     0      7200000         4096       8033.0         31182907
     0      7280000         4096      10589.4         30876729
     0      7360000         4096      15844.5         19903105
     0      7440000         4096       9187.7         35041718
     0      7520000         4096      11800.9         12254920
     0      7600000         4096      20049.9         13052020
     0      7680000         4096      17444.9         12055262
     0      7760000         4096      25900.6         10633885
     0      7840000         4096       6398.8         12014901
     0      7920000         4096      12147.4         14655678

<...>

     0      9360000         4096      19406.0          5874193
     0      9440000         4096       7261.3         14542806
     0      9520000         4096      14978.7         21921186
     0      9600000         4096      12656.7         10539894

real    10m9.240s
user    0m51.817s
sys     13m23.874s

A quick reading of the IO behaviour shows an average of about 1000
IOPS and 100MB/s, which matches roughly with the average files/s
created. There's no obvious peaks and gaps in the IO throughput, but
CPU usage has regular gaps where it falls to nothing and the only
"usage" is IO wait time.

Reverting just Linus' patch, so no plugging at all in the code
(i.e. the baseline numbers):

FSUse%        Count         Size    Files/sec     App Overhead
     0        80000         4096     109017.5           586268
     0       160000         4096      93548.1           553333
     0       240000         4096     104141.7           817182
     0       320000         4096     112846.2           597494
     0       400000         4096      98628.6           641524
     0       480000         4096      86586.1          1274195
     0       560000         4096      95965.0           776961
     0       640000         4096      84585.9           583330
     0       720000         4096      93787.1           695658
     0       800000         4096      94112.7           645925
     0       880000         4096      92259.2           792950

<RAM fills here, sustained performance is now dependent on writeback>

     0       960000         4096      62936.9           763642
     0      1040000         4096      22484.8           737471
     0      1120000         4096      22038.6           706791
     0      1200000         4096      24096.5           680878
     0      1280000         4096      23973.1           839826
     0      1360000         4096      24761.9           779486
     0      1440000         4096      25467.7           976022
     0      1520000         4096      25338.7           935552
     0      1600000         4096      27484.8           683351
     0      1680000         4096      24383.2          3039781
     0      1760000         4096      24857.0          2657008

<maintains 13-26k files/s all the way to the end .... >

     0      9040000         4096      21438.6          4869317
     0      9120000         4096      25387.1           642198
     0      9200000         4096      26815.9           727763
     0      9280000         4096      25175.5          4458811
     0      9360000         4096      19832.8           982122
     0      9440000         4096      17643.8           805599
     0      9520000         4096      13923.4          7819152
     0      9600000         4096      18101.3          1071163

real    7m7.926s
user    0m57.976s
sys     14m14.620s

The baseline of no plugging is a full 3 minutes faster than the
plugging behaviour of Linus' patch. The IO behaviour demonstrates
that, sustaining between 25-30,000 IOPS and throughput of
130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
it does not result in a performance improvement like the original
plugging patch did.

So I went back and had a look at my original patch, which I've been
using locally for a couple of years and was similar to the original
commit. It has this description from when I last updated the perf
numbers from testing done on 3.17:

| Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
| metadata CRCs enabled.
| 
| Test:
| 
| $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
| /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
| /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
| /mnt/scratch/6  -d  /mnt/scratch/7
| 
| Result:
|                 wall    sys     create rate     Physical write IO
|                 time    CPU     (avg files/s)    IOPS   Bandwidth
|                 -----   -----   -------------   ------  ---------
| unpatched       5m54s   15m32s  32,500+/-2200   28,000  150MB/s
| patched         3m19s   13m28s  52,900+/-1800    1,500  280MB/s
| improvement     -43.8%  -13.3%    +62.7%        -94.6%  +86.6%

IOWs, what we are seeing here is that the baseline writeback
performance has regressed quite significantly since I took these
numbers back on 3.17.  I'm running on exactly the same test setup;
the only difference is the kernel and so the current kernel baseline
is ~20% slower than the baseline numbers I have in my patch.

So, let's put the original [broken] plugging back in:

.....
     0      8960000         4096      28420.0          2129129
     0      9040000         4096      26863.0          6893094
     0      9120000         4096      46721.4           705927
     0      9200000         4096      46384.0          1081627
     0      9280000         4096      39692.3          2148938
     0      9360000         4096      41976.0          3540489
     0      9440000         4096      28025.1         10364440
     0      9520000         4096      38580.2          2457019
     0      9600000         4096      33526.1          7661316

real    5m12.114s
user    0m50.294s
sys     12m54.425s

While we get back that regression and then some more, it's still
much slower than the high level plugging on slightly older kernels.
And the performance is, again, highly variable through the midpoint
of the test run. e.g:

....
     0      4720000         4096      39224.4          9401401
     0      4800000         4096      25342.9         18889665
     0      4880000         4096      14841.7         36668633
     0      4960000         4096      27624.6         17731873
     0      5040000         4096      37874.8         11390831
     0      5120000         4096      29009.2         12331965
     0      5200000         4096      49568.0          4346746
     0      5280000         4096      83074.5           877142
     0      5360000         4096      32498.8          5401470
     0      5440000         4096      55468.1          3188117
     0      5520000         4096      19156.0          4112120
     0      5600000         4096      44894.0          3186559
....

We have a low of ~15k files/s and a high of 83k files/s whilst
averaging about 38k files/s.

IOWs, ignoring the plugging, the recent changes to the writeback
code have really screwed up writeback performance and changed the
way it behaves from a throughput consistency point of view.  I'd
suggest we should start by reverting all the plugging changes for
now, because there are other problems these plugging changes have
exposed in the current writeback code that need to be fixed first...

Cheers

Dave.

PS: I was going to run this on an ext4 filesystem to check it
wasn't just XFS seeing this, but that reminded me why I don't run
this test on ext4:

FSUse%        Count         Size    Files/sec     App Overhead
     5        80000         4096     211668.3           534504
     5       160000         4096     191731.8           532701
     5       240000         4096     196251.2           489772
     5       320000         4096     207597.1           464032
     5       400000         4096     121616.0           601156
     5       480000         4096     197729.9           573925

<Really, really fast in memory, but background writeback starts here>

     5       560000         4096     128698.4           546362
     5       640000         4096     137904.3           543718
     5       720000         4096      63680.9           498110
     5       800000         4096      25825.9          1538213
     5       880000         4096      32708.7           632235

<dirty page threshold (20% RAM) is hit here. performance sucks from
 here on in (still 10GB of free RAM) because the writeback IO
 pattern is a random 4k write pattern>

     5       960000         4096       6940.2          7250666
     5      1040000         4096       7474.8           711872
     5      1120000         4096       4127.5         14034167
     5      1200000         4096       4822.1           752398
....

Hence it takes about around hour to do a single run of this test on
ext4 because it's delayed allocation algorithm does not linearise
writes to small files with similar inode locality like XFS does, and
that's the behaviour that high level plugging really helps. IOws,
this specific workload does not demonstrate any significant
performance difference on ext4 with the different plugging
algorithms because it does random IO rather than sequential IO.

I also ran it on btrfs, but it's pushing peaks of 15-25,000 IOPS
every 30s regardless of whether there is plugging at a high level or
not, so again btrfs behaviour is pretty much unaffected by the high
level plugging on this workload....

Cheers,

Dave.

Linus Torvalds Sept. 17, 2015, 1:12 a.m. UTC | #23

On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly.

Dave, if you're testing my current -git, the other performance issue
might still be the spinlock thing.

I haven't gotten the -tip pull fox the spinlock regression yet, and
since you were the only one who noticed, I didn't consider it critical
enough to bypass the normal schedule, so -rc1 still has the crazy-bad
test-and-set loop for spinlocks.

So it's waiting in the locking fix branch (and is cc'd for stable),
but hasn't reached me yet.

If you want to test that theory out, you can make virt_spin_lock()
just unconditionally return false to disable the bad cmpxchg loop for
now.

The plugging IO pauses are interesting, though. Plugging really
*shouldn't* cause that kind of pauses, _regardless_ of what level it
happens on, so I wonder if the patch ends up just exposing some really
basic problem that just normally goes hidden.

Can you match up the IO wait times with just *where* it is waiting? Is
it waiting for that inode I_SYNC thing in inode_sleep_on_writeback()?

But it does sound like we should just revert the whole plugging for
now, if only because "it has odd effects".

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 17, 2015, 2:14 a.m. UTC | #24

On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote:
> On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > TL;DR: Results look really bad - not only is the plugging
> > problematic, baseline writeback performance has regressed
> > significantly.
> 
> Dave, if you're testing my current -git, the other performance issue
> might still be the spinlock thing.

I have the fix as the first commit in my local tree - it'll remain
there until I get a conflict after an update. :)

> The plugging IO pauses are interesting, though. Plugging really
> *shouldn't* cause that kind of pauses, _regardless_ of what level it
> happens on, so I wonder if the patch ends up just exposing some really
> basic problem that just normally goes hidden.

Right, that's what I suspect - it didn't happen on older kernels,
but we've just completely reworked the writeback code for the
control group awareness since I last looked really closely at
this...

> Can you match up the IO wait times with just *where* it is
> waiting? Is it waiting for that inode I_SYNC thing in
> inode_sleep_on_writeback()?

I'll do some more investigation.

> But it does sound like we should just revert the whole plugging for
> now, if only because "it has odd effects".

Yup - we can add it again next merge window once we get to the
bottom of whatever is going on and have had time to test the new
code properly.

Cheers,

Dave.

Chris Mason Sept. 17, 2015, 3:48 a.m. UTC | #25

On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> [cc Tejun]
> 
> On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > > > instead of 30, I suspect that would even out the average sizes between
> > > > > the three patches.
> > > > 
> > > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > > right?
> > > 
> > > Looks like it works well to me.
> > 
> > Graph looks good, though I'll confirm it on my test rig once I get
> > out from under the pile of email and other stuff that is queued up
> > after being away for a week...
> 
> I ran some tests in the background while reading other email.....
> 
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly. We need to revert the plugging changes until the
> underlying writeback performance regressions are sorted out.
> 
> In more detail, these tests were run on my usual 16p/16GB RAM
> performance test VM with storage set up as described here:
> 
> https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b
> 
> The test:
> 
> $ ~/tests/fsmark-10-4-test-xfs.sh
> meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0
> data     =                       bsize=4096   blocks=134217727500, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> #  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> #       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> #       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0        80000         4096     106938.0           543310
>      0       160000         4096     102922.7           476362
>      0       240000         4096     107182.9           538206
>      0       320000         4096     107871.7           619821
>      0       400000         4096      99255.6           622021
>      0       480000         4096     103217.8           609943
>      0       560000         4096      96544.2           640988
>      0       640000         4096     100347.3           676237
>      0       720000         4096      87534.8           483495
>      0       800000         4096      72577.5          2556920
>      0       880000         4096      97569.0           646996
> 
> <RAM fills here, sustained performance is now dependent on writeback>

I think too many variables have changed here.

My numbers:

FSUse%        Count         Size    Files/sec     App Overhead
     0       160000         4096     356407.1          1458461
     0       320000         4096     368755.1          1030047
     0       480000         4096     358736.8           992123
     0       640000         4096     361912.5          1009566
     0       800000         4096     342851.4          1004152
     0       960000         4096     358357.2           996014
     0      1120000         4096     338025.8          1004412
     0      1280000         4096     354440.3           997380
     0      1440000         4096     335225.9          1000222
     0      1600000         4096     278786.1          1164962
     0      1760000         4096     268161.4          1205255
     0      1920000         4096     259158.0          1298054
     0      2080000         4096     276939.1          1219411
     0      2240000         4096     252385.1          1245496
     0      2400000         4096     280674.1          1189161
     0      2560000         4096     290155.4          1141941
     0      2720000         4096     280842.2          1179964
     0      2880000         4096     272446.4          1155527
     0      3040000         4096     268827.4          1235095
     0      3200000         4096     251767.1          1250006
     0      3360000         4096     248339.8          1235471
     0      3520000         4096     267129.9          1200834
     0      3680000         4096     257320.7          1244854
     0      3840000         4096     233540.8          1267764
     0      4000000         4096     269237.0          1216324
     0      4160000         4096     249787.6          1291767
     0      4320000         4096     256185.7          1253776
     0      4480000         4096     257849.7          1212953
     0      4640000         4096     253933.9          1181216
     0      4800000         4096     263567.2          1233937
     0      4960000         4096     255666.4          1231802
     0      5120000         4096     257083.2          1282893
     0      5280000         4096     254285.0          1229031
     0      5440000         4096     265561.6          1219472
     0      5600000         4096     266374.1          1229886
     0      5760000         4096     241003.7          1257064
     0      5920000         4096     245047.4          1298330
     0      6080000         4096     254771.7          1257241
     0      6240000         4096     254355.2          1261006
     0      6400000         4096     254800.4          1201074
     0      6560000         4096     262794.5          1234816
     0      6720000         4096     248103.0          1287921
     0      6880000         4096     231397.3          1291224
     0      7040000         4096     227898.0          1285359
     0      7200000         4096     227279.6          1296340
     0      7360000         4096     232561.5          1748248
     0      7520000         4096     231055.3          1169373
     0      7680000         4096     245738.5          1121856
     0      7840000         4096     234961.7          1147035
     0      8000000         4096     243973.0          1152202
     0      8160000         4096     246292.6          1169527
     0      8320000         4096     249433.2          1197921
     0      8480000         4096     222576.0          1253650
     0      8640000         4096     239407.5          1263257
     0      8800000         4096     246037.1          1218109
     0      8960000         4096     242306.5          1293567
     0      9120000         4096     238525.9          3745133
     0      9280000         4096     269869.5          1159541
     0      9440000         4096     266447.1          4794719
     0      9600000         4096     265748.9          1161584
     0      9760000         4096     269067.8          1149918
     0      9920000         4096     248896.2          1164112
     0     10080000         4096     261342.9          1174536
     0     10240000         4096     254778.3          1225425
     0     10400000         4096     257702.2          1211634
     0     10560000         4096     233972.5          1203665
     0     10720000         4096     232647.1          1197486
     0     10880000         4096     242320.6          1203984

I can push the dirty threshold lower to try and make sure we end up in
the hard dirty limits but none of this is going to be related to the
plugging patch.  I do see lower numbers if I let the test run even
longer, but there are a lot of things in the way that can slow it down
as the filesystem gets that big.

I'll try again with lower ratios.

[ ... ]

> The baseline of no plugging is a full 3 minutes faster than the
> plugging behaviour of Linus' patch. The IO behaviour demonstrates
> that, sustaining between 25-30,000 IOPS and throughput of
> 130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
> it does not result in a performance improvement like the original
> plugging patch did.
> 

How consistent is this across runs?

> So I went back and had a look at my original patch, which I've been
> using locally for a couple of years and was similar to the original
> commit. It has this description from when I last updated the perf
> numbers from testing done on 3.17:
> 
> | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
> | metadata CRCs enabled.
> | 
> | Test:
> | 
> | $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
> | /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
> | /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
> | /mnt/scratch/6  -d  /mnt/scratch/7
> | 
> | Result:
> |                 wall    sys     create rate     Physical write IO
> |                 time    CPU     (avg files/s)    IOPS   Bandwidth
> |                 -----   -----   -------------   ------  ---------
> | unpatched       5m54s   15m32s  32,500+/-2200   28,000  150MB/s
> | patched         3m19s   13m28s  52,900+/-1800    1,500  280MB/s
> | improvement     -43.8%  -13.3%    +62.7%        -94.6%  +86.6%
> 
> IOWs, what we are seeing here is that the baseline writeback
> performance has regressed quite significantly since I took these
> numbers back on 3.17.  I'm running on exactly the same test setup;
> the only difference is the kernel and so the current kernel baseline
> is ~20% slower than the baseline numbers I have in my patch.

All of this in a VM, I'd much rather see this reproduced on bare metal.
I've had really consistent results with VMs in the past, but there is a
huge amount of code between 3.17 and now.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 17, 2015, 4:30 a.m. UTC | #26

On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > [cc Tejun]
> > 
> > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > #  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> > #       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> > #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > #       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> > #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      0        80000         4096     106938.0           543310
> >      0       160000         4096     102922.7           476362
> >      0       240000         4096     107182.9           538206
> >      0       320000         4096     107871.7           619821
> >      0       400000         4096      99255.6           622021
> >      0       480000         4096     103217.8           609943
> >      0       560000         4096      96544.2           640988
> >      0       640000         4096     100347.3           676237
> >      0       720000         4096      87534.8           483495
> >      0       800000         4096      72577.5          2556920
> >      0       880000         4096      97569.0           646996
> > 
> > <RAM fills here, sustained performance is now dependent on writeback>
> 
> I think too many variables have changed here.
> 
> My numbers:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0       160000         4096     356407.1          1458461
>      0       320000         4096     368755.1          1030047
>      0       480000         4096     358736.8           992123
>      0       640000         4096     361912.5          1009566
>      0       800000         4096     342851.4          1004152

<snip>

> I can push the dirty threshold lower to try and make sure we end up in
> the hard dirty limits but none of this is going to be related to the
> plugging patch.

The point of this test is to drive writeback as hard as possible,
not to measure how fast we can create files in memory.  i.e. if the
test isn't pushing the dirty limits on your machines, then it really
isn't putting a meaningful load on writeback, and so the plugging
won't make significant difference because writeback isn't IO
bound....

> I do see lower numbers if I let the test run even
> longer, but there are a lot of things in the way that can slow it down
> as the filesystem gets that big.

Sure, that's why I hit the dirty limits early in the test - so it
measures steady state performance before the fs gets to any
significant scalability limits....

> > The baseline of no plugging is a full 3 minutes faster than the
> > plugging behaviour of Linus' patch. The IO behaviour demonstrates
> > that, sustaining between 25-30,000 IOPS and throughput of
> > 130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
> > it does not result in a performance improvement like the original
> > plugging patch did.
> 
> How consistent is this across runs?

That's what I'm trying to work out. I didn't report it until I got
consistently bad results - the numbers I reported were from the
third time I ran the comparison, and they were representative and
reproducable. I also ran my inode creation workload that is similar
(but has not data writeback so doesn't go through writeback paths at
all) and that shows no change in performance, so this problem
(whatever it is) is only manifesting itself through data
writeback....

The only measurable change I've noticed in my monitoring graphs is
that there is a lot more iowait time than I normally see, even when
the plugging appears to be working as desired. That's what I'm
trying to track down now, and once I've got to the bottom of that I
should have some idea of where the performance has gone....

As it is, there are a bunch of other things going wrong with
4.3-rc1+ right now that I'm working through - I haven't updated my
kernel tree for 10 days because I've been away on holidays so I'm
doing my usual "-rc1 is broken again" dance that I do every release
cycle.  (e.g every second boot hangs because systemd appears to be
waiting for iscsi devices to appear without first starting the iscsi
target daemon.  Never happened before today, every new kernel I've
booted today has hung on the first cold boot of the VM).

> > IOWs, what we are seeing here is that the baseline writeback
> > performance has regressed quite significantly since I took these
> > numbers back on 3.17.  I'm running on exactly the same test setup;
> > the only difference is the kernel and so the current kernel baseline
> > is ~20% slower than the baseline numbers I have in my patch.
> 
> All of this in a VM, I'd much rather see this reproduced on bare metal.
> I've had really consistent results with VMs in the past, but there is a
> huge amount of code between 3.17 and now.

I'm pretty sure it's not the VM - with the locking fix in place,
everything else I've looked at (since applying the locking fix Linus
mentioned) is within measurement error compared to 4.2. The only
thing that is out of whack from a performance POV is data
writeback....

Cheers,

Dave.

Chris Mason Sept. 17, 2015, 12:13 p.m. UTC | #27

On Thu, Sep 17, 2015 at 02:30:08PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > > [cc Tejun]
> > > 
> > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > > #  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> > > #       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > > #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > > #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> > > #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > > #       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> > > #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> > > 
> > > FSUse%        Count         Size    Files/sec     App Overhead
> > >      0        80000         4096     106938.0           543310
> > >      0       160000         4096     102922.7           476362
> > >      0       240000         4096     107182.9           538206
> > >      0       320000         4096     107871.7           619821
> > >      0       400000         4096      99255.6           622021
> > >      0       480000         4096     103217.8           609943
> > >      0       560000         4096      96544.2           640988
> > >      0       640000         4096     100347.3           676237
> > >      0       720000         4096      87534.8           483495
> > >      0       800000         4096      72577.5          2556920
> > >      0       880000         4096      97569.0           646996
> > > 
> > > <RAM fills here, sustained performance is now dependent on writeback>
> > 
> > I think too many variables have changed here.
> > 
> > My numbers:
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      0       160000         4096     356407.1          1458461
> >      0       320000         4096     368755.1          1030047
> >      0       480000         4096     358736.8           992123
> >      0       640000         4096     361912.5          1009566
> >      0       800000         4096     342851.4          1004152
> 
> <snip>
> 
> > I can push the dirty threshold lower to try and make sure we end up in
> > the hard dirty limits but none of this is going to be related to the
> > plugging patch.
> 
> The point of this test is to drive writeback as hard as possible,
> not to measure how fast we can create files in memory.  i.e. if the
> test isn't pushing the dirty limits on your machines, then it really
> isn't putting a meaningful load on writeback, and so the plugging
> won't make significant difference because writeback isn't IO
> bound....

It does end up IO bound on my rig, just because we do eventually hit the
dirty limits.  Otherwise there would be zero benefits in fs_mark from
any patches vs plain v4.2

But I setup a run last night with a dirty_ratio_bytes at 3G and
dirty_background_ratio_bytes at 1.5G.

There is definitely variation, but nothing like what you saw:

FSUse%        Count         Size    Files/sec     App Overhead
     0       160000         4096     317427.9          1524951
     0       320000         4096     319723.9          1023874
     0       480000         4096     336696.4          1053884
     0       640000         4096     257113.1          1190851
     0       800000         4096     257644.2          1198054
     0       960000         4096     254896.6          1225610
     0      1120000         4096     241052.6          1203227
     0      1280000         4096     214961.2          1386236
     0      1440000         4096     239985.7          1264659
     0      1600000         4096     232174.3          1310018
     0      1760000         4096     250477.9          1227289
     0      1920000         4096     221500.9          1276223
     0      2080000         4096     235212.1          1284989
     0      2240000         4096     238580.2          1257260
     0      2400000         4096     224182.6          1326821
     0      2560000         4096     234628.7          1236402
     0      2720000         4096     244675.3          1228400
     0      2880000         4096     234364.0          1268408
     0      3040000         4096     229712.6          1306148
     0      3200000         4096     241170.5          1254490
     0      3360000         4096     220487.8          1331456
     0      3520000         4096     215831.7          1313682
     0      3680000         4096     210934.7          1235750
     0      3840000         4096     218435.4          1258077
     0      4000000         4096     232127.7          1271555
     0      4160000         4096     212017.6          1381525
     0      4320000         4096     216309.3          1370558
     0      4480000         4096     239072.4          1269086
     0      4640000         4096     221959.1          1333164
     0      4800000         4096     228396.8          1213160
     0      4960000         4096     225747.5          1318503
     0      5120000         4096     115727.0          1237327
     0      5280000         4096     184171.4          1547357
     0      5440000         4096     209917.8          1380510
     0      5600000         4096     181074.7          1391764
     0      5760000         4096     263516.7          1155172
     0      5920000         4096     236405.8          1239719
     0      6080000         4096     231587.2          1221408
     0      6240000         4096     237118.8          1244272
     0      6400000         4096     236773.2          1201428
     0      6560000         4096     243987.5          1240527
     0      6720000         4096     232428.0          1283265
     0      6880000         4096     234839.9          1209152
     0      7040000         4096     234947.3          1223456
     0      7200000         4096     231463.1          1260628
     0      7360000         4096     226750.3          1290098
     0      7520000         4096     213632.0          1236409
     0      7680000         4096     194710.2          1411595
     0      7840000         4096     213963.1          4146893
     0      8000000         4096     225109.8          1323573
     0      8160000         4096     251322.1          1380271
     0      8320000         4096     220167.2          1159390
     0      8480000         4096     210991.2          1110593
     0      8640000         4096     197922.8          1126072
     0      8800000         4096     203539.3          1143501
     0      8960000         4096     193041.7          1134329
     0      9120000         4096     184667.9          1119222
     0      9280000         4096     165968.7          1172738
     0      9440000         4096     192767.3          1098361
     0      9600000         4096     227115.7          1158097
     0      9760000         4096     232139.8          1264245
     0      9920000         4096     213320.5          1270505
     0     10080000         4096     217013.4          1324569
     0     10240000         4096     227171.6          1308668
     0     10400000         4096     208591.4          1392098
     0     10560000         4096     212006.0          1359188
     0     10720000         4096     213449.3          1352084
     0     10880000         4096     219890.1          1326240
     0     11040000         4096     215907.7          1239180
     0     11200000         4096     214207.2          1334846
     0     11360000         4096     212875.2          1338429
     0     11520000         4096     211690.0          1249519
     0     11680000         4096     217013.0          1262050
     0     11840000         4096     204730.1          1205087
     0     12000000         4096     191146.9          1188635
     0     12160000         4096     207844.6          1157033
     0     12320000         4096     208857.7          1168111
     0     12480000         4096     198256.4          1388368
     0     12640000         4096     214996.1          1305412
     0     12800000         4096     212332.9          1357814
     0     12960000         4096     210325.8          1336127
     0     13120000         4096     200292.1          1282419
     0     13280000         4096     202030.2          1412105
     0     13440000         4096     216553.7          1424076
     0     13600000         4096     218721.7          1298149
     0     13760000         4096     202037.4          1266877
     0     13920000         4096     224032.3          1198159
     0     14080000         4096     206105.6          1336489
     0     14240000         4096     227540.3          1160841
     0     14400000         4096     236921.7          1190394
     0     14560000         4096     229343.3          1147451
     0     14720000         4096     199435.1          1284374
     0     14880000         4096     215177.3          1178542
     0     15040000         4096     206194.1          1170832
     0     15200000         4096     215762.3          1125633
     0     15360000         4096     194511.0          1122947
     0     15520000         4096     179008.5          1292603
     0     15680000         4096     208636.9          1094960
     0     15840000         4096     192173.1          1237891
     0     16000000         4096     212888.9          1111551
     0     16160000         4096     218403.0          1143400
     0     16320000         4096     207260.5          1233526
     0     16480000         4096     202123.2          1151509
     0     16640000         4096     191033.0          1257706
     0     16800000         4096     196865.4          1154520
     0     16960000         4096     210361.2          1128930
     0     17120000         4096     201755.2          1160469
     0     17280000         4096     196946.6          1173529
     0     17440000         4096     199677.8          1165750
     0     17600000         4096     194248.4          1234944
     0     17760000         4096     200027.9          1256599
     0     17920000         4096     206507.0          1166820
     0     18080000         4096     215082.7          1167599
     0     18240000         4096     201475.5          1212202
     0     18400000         4096     208247.6          1252255
     0     18560000         4096     205482.7          1311436
     0     18720000         4096     200111.9          1358784
     0     18880000         4096     200028.3          1351332
     0     19040000         4096     198873.4          1287400
     0     19200000         4096     209609.3          1268400
     0     19360000         4096     203538.6          1249787
     0     19520000         4096     203427.9          1294105
     0     19680000         4096     201905.3          1280714
     0     19840000         4096     209642.9          1283281
     0     20000000         4096     203438.9          1315427
     0     20160000         4096     199690.7          1252267
     0     20320000         4096     185965.2          1398905
     0     20480000         4096     203221.6          1214029
     0     20640000         4096     208654.8          1232679
     0     20800000         4096     212488.6          1298458
     0     20960000         4096     189701.1          1356640
     0     21120000         4096     198522.1          1361240
     0     21280000         4096     203857.3          1263402
     0     21440000         4096     204616.8          1362853
     0     21600000         4096     196310.6          1266710
     0     21760000         4096     203275.4          1391150
     0     21920000         4096     205998.5          1378741
     0     22080000         4096     205434.2          1283787
     0     22240000         4096     195918.0          1415912
     0     22400000         4096     186193.0          1413623
     0     22560000         4096     192911.3          1393471
     0     22720000         4096     203726.3          1264281
     0     22880000         4096     204853.4          1221048
     0     23040000         4096     222803.2          1153031
     0     23200000         4096     198558.6          1346256
     0     23360000         4096     201001.4          1278817
     0     23520000         4096     206225.2          1270440
     0     23680000         4096     190894.2          1425299
     0     23840000         4096     198555.6          1334122
     0     24000000         4096     202386.4          1332157
     0     24160000         4096     205103.1          1313607

> 
> > I do see lower numbers if I let the test run even
> > longer, but there are a lot of things in the way that can slow it down
> > as the filesystem gets that big.
> 
> Sure, that's why I hit the dirty limits early in the test - so it
> measures steady state performance before the fs gets to any
> significant scalability limits....
> 
> > > The baseline of no plugging is a full 3 minutes faster than the
> > > plugging behaviour of Linus' patch. The IO behaviour demonstrates
> > > that, sustaining between 25-30,000 IOPS and throughput of
> > > 130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
> > > it does not result in a performance improvement like the original
> > > plugging patch did.
> > 
> > How consistent is this across runs?
> 
> That's what I'm trying to work out. I didn't report it until I got
> consistently bad results - the numbers I reported were from the
> third time I ran the comparison, and they were representative and
> reproducable. I also ran my inode creation workload that is similar
> (but has not data writeback so doesn't go through writeback paths at
> all) and that shows no change in performance, so this problem
> (whatever it is) is only manifesting itself through data
> writeback....

The big change between Linus' patch and your patch is with Linus kblockd
is probably doing most of the actual unplug work (except for the last
super block in the list).  If a process is waiting for dirty writeout
progress, it has to wait for that context switch to kblockd.

In the VM, that's going to hurt more then my big two socket mostly idle
machine.

> 
> The only measurable change I've noticed in my monitoring graphs is
> that there is a lot more iowait time than I normally see, even when
> the plugging appears to be working as desired. That's what I'm
> trying to track down now, and once I've got to the bottom of that I
> should have some idea of where the performance has gone....
> 
> As it is, there are a bunch of other things going wrong with
> 4.3-rc1+ right now that I'm working through - I haven't updated my
> kernel tree for 10 days because I've been away on holidays so I'm
> doing my usual "-rc1 is broken again" dance that I do every release
> cycle.  (e.g every second boot hangs because systemd appears to be
> waiting for iscsi devices to appear without first starting the iscsi
> target daemon.  Never happened before today, every new kernel I've
> booted today has hung on the first cold boot of the VM).

I've been doing 4.2 plus patches because rc1 didn't boot on this strange
box.  Let me nail that down and rerun.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 17, 2015, 7:39 p.m. UTC | #28

On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Dave, if you're testing my current -git, the other performance issue
>> might still be the spinlock thing.
>
> I have the fix as the first commit in my local tree - it'll remain
> there until I get a conflict after an update. :)

Ok. I'm happy to report that you should get a conflict now, and that
the spinlock code should work well for your virtualized case again.

No updates on the plugging thing yet, I'll wait a bit and follow this
thread and see if somebody comes up with any explanations or theories
in the hope that we might not need to revert (or at least have a more
targeted change).

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 17, 2015, 11:03 p.m. UTC | #29

On Thu, Sep 17, 2015 at 12:14:53PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > TL;DR: Results look really bad - not only is the plugging
> > > problematic, baseline writeback performance has regressed
> > > significantly.
> > 
> > Dave, if you're testing my current -git, the other performance issue
> > might still be the spinlock thing.
> 
> I have the fix as the first commit in my local tree - it'll remain
> there until I get a conflict after an update. :)
> 
> > The plugging IO pauses are interesting, though. Plugging really
> > *shouldn't* cause that kind of pauses, _regardless_ of what level it
> > happens on, so I wonder if the patch ends up just exposing some really
> > basic problem that just normally goes hidden.
> 
> Right, that's what I suspect - it didn't happen on older kernels,
> but we've just completely reworked the writeback code for the
> control group awareness since I last looked really closely at
> this...
> 
> > Can you match up the IO wait times with just *where* it is
> > waiting? Is it waiting for that inode I_SYNC thing in
> > inode_sleep_on_writeback()?
> 
> I'll do some more investigation.

Ok, I'm happy to report there is actually nothing wrong with the
plugging code that is your tree.  I finally tracked the problem I
was seeing down to a misbehaving RAID controller.[*]

With that problem sorted:

kernel		files/s		wall time
3.17		32500		5m54s
4.3-noplug	34400		5m25s
3.17-plug	52900		3m19s
4.3-badplug	60540		3m24s
4.3-rc1		56600		3m23s

So the 3.17/4.3-noplug baselines so no regression - 4.3 is slightly
faster. All the plugging variants show roughly the same improvement
and IO behaviour. These numbers are reproducable and there are no
weird performance inconsistencies during any of the 4.3-rc1 kernel
runs.  Hence my numbers and observed behaviour now aligns with
Chris' results and so I think we can say the reworked high level
plugging is behaving as we expected it to.

Cheers,

Dave.

[*] It seems to have a dodgy battery connector, and so has been
"losing" battery backup and changing the cache mode of the HBA from
write back to write through. This results in changing from NVRAM
performance to SSD native performance and back again.  A small
vibration would cause the connection to the battery to reconnect and
the controller would switch back to writeback mode. The few log
entries in the bios showed changes in status between a few seconds
apart to minutes apart - enough for the cache status to change
several times a 5-10 minute benchmark run.

I didn't notice the hardware was playing up because it wasn't
triggering the machine alert indicator through the bios like it's
supposed to and so the visible and audible alarms were not being
triggered, nor was the BMC logging the raid controller cache status
changes.

In the end, I noticed it by chance - during a low level test the
behaviour changed very obviously as one of my dogs ran past the
rack.  I unplugged everything inside the server, plugged it all back
in, powered it back up and fiddled with cables until I found what
was causing the problem. And having done this, the BMC is now
sending warnings and the audible alarm is working when the battery
is disconnected... :/

Linus Torvalds Sept. 17, 2015, 11:13 p.m. UTC | #30

On Thu, Sep 17, 2015 at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Ok, I'm happy to report there is actually nothing wrong with the
> plugging code that is your tree.  I finally tracked the problem I
> was seeing down to a misbehaving RAID controller.[*]

Hey, that's great.

Just out of interest, try the patch that Chris just sent that turns
the unplugging synchronous for the special "cond_resched()" case.

If that helps your case too, let's just do it, even while we wonder
why kblockd_workqueue messes up so noticeable.

Because regardless of the kblockd_workqueue questions, I think it's
nicer to avoid the cond_resched_lock(), and do the rescheduling while
we're not holding any locks.

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

fs-writeback: drop wb->list_lock during blk_finish_plug()

Commit Message

Comments

Patch