Message ID | CA+55aFz2f8969nQAR6=TUTQOYA7T9vmcYJKGCyejYWYZ9a5S_w@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 09/11/2015 04:37 PM, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> How about we instead: >> >> (a) revert that commit d353d7587 as broken (because it clearly is) >> >> (b) add a big honking comment about the fact that we hold 'list_lock' >> in writeback_sb_inodes() >> >> (c) move the plugging up to wb_writeback() and writeback_inodes_wb() >> _outside_ the spinlock. > > Ok, I've done (a) and (b) in my tree. And attached is the totally > untested patch for (c). It looks ObviouslyCorrect(tm), but since this > is a performance issue, I'm not going to commit it without some more > ACK's from people. > > I obviously think this is a *much* better approach than dropping and > retaking the lock, but there might be something silly I'm missing. > > For example, maybe we want to unplug and replug around the > "inode_sleep_on_writeback()" in wb_writeback()? So while the revert > was a no-brainer, this one I really want people to think about. So we talked about this when we were trying to figure out a solution. The problem with this approach is now we have a plug that covers multiple super blocks (__writeback_inodes_wb loops through the sb's starts writeback), which is likely to give us crappier performance than no plug at all. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote: > > So we talked about this when we were trying to figure out a solution. The > problem with this approach is now we have a plug that covers multiple super > blocks (__writeback_inodes_wb loops through the sb's starts writeback), > which is likely to give us crappier performance than no plug at all. Why would that be? Either they are on separate disks, and the IO is all independent anyway, and at most it got started by some (small) CPU-amount later. Actual throughput should be the same. No? Or the filesystems are on the same disk, in which case it should presumably be a win to submit the IO together. Of course, actual numbers would be the deciding factor if this really is noticeable. But "cleaner code and saner locking" is definitely an issue at least for me. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Of course, actual numbers would be the deciding factor if this really > is noticeable. But "cleaner code and saner locking" is definitely an > issue at least for me. Anyway, I'll hold off pushing out the revert too for a while, in the hope that we'll have actual numbers for or against whatever the particular solution should be. I do get the feeling that the whole wb->list_lock needs more loving. For example, there's that locked_inode_to_wb_and_lock_list() thing (which might be better off trying to just do "trylock" on it, but that's a separate issue) showing lock inversion worries. Maybe we should *not* get that wb->list_lock early at all, and nest it inside the inode spinlocks, and just move the list_lock locking down a lot (ie not even try to hold it over big functions that then are forced to releasing it anyway). For example, realistically, it looks like the longest we ever *really* hold that lock is at the top of the loop of writeback_sb_inodes() - maybe we could just explicitly have a function that does "find the first inode that matches this sb and needs writeout activity", and literally only take hold the lock over that function. And *not* take the lock in the caller at all? The callers seem to want that lock mainly because they do that "list_empty(&wb->b_io)" test. But the way our doubly linked lists work, "list_empty()" is actually something that can be done racily without the lock held... That said, I doubt anybody really wants to touch this, so at least for now we're stuck with either the "plug outside the lock" or the "drop and retake lock" options. It really would be loveyl to haev numbers either way. <insert pricess leia gif> "Dave, you're our only hope" Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 02:04:13PM -0700, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote: > > > > So we talked about this when we were trying to figure out a solution. The > > problem with this approach is now we have a plug that covers multiple super > > blocks (__writeback_inodes_wb loops through the sb's starts writeback), > > which is likely to give us crappier performance than no plug at all. > > Why would that be? Either they are on separate disks, and the IO is > all independent anyway, and at most it got started by some (small) > CPU-amount later. Actual throughput should be the same. No? > > Or the filesystems are on the same disk, in which case it should > presumably be a win to submit the IO together. > > Of course, actual numbers would be the deciding factor if this really > is noticeable. But "cleaner code and saner locking" is definitely an > issue at least for me. Originally I was worried about the latency impact of holding the plugs over more than one super with high end flash. I just didn't want to hold onto the IO for longer than we had to. But, since this isn't really latency sensitive anyway, if we find we're not keeping the flash pipelines full the right answer is to short circuit the plugging in general. I'd agree actual throughput should be the same. But benchmarking is the best choice, I'll be able to reproduce Dave's original results without too much trouble. Our thinking for this duct tape patch was the lock wasn't very hot and this was the best immediate compromise between the bug and the perf improvement. Happy to sign up for pushing the lock higher if that's what people would rather see. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 4:06 PM, Chris Mason <clm@fb.com> wrote: > > Originally I was worried about the latency impact of holding the > plugs over more than one super with high end flash. I just didn't want > to hold onto the IO for longer than we had to. > > But, since this isn't really latency sensitive anyway, if we find we're > not keeping the flash pipelines full the right answer is to short > circuit the plugging in general. I'd agree actual throughput should be > the same. Yeah, this only triggers for system-wide writeback, so I don't seer that it should be latency-sensitive at that level, afaik. But hey, it's IO, and I've been surprised by magic pattern sensitivites before... Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 03:06:18PM -0700, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Of course, actual numbers would be the deciding factor if this really > > is noticeable. But "cleaner code and saner locking" is definitely an > > issue at least for me. > > Anyway, I'll hold off pushing out the revert too for a while, in the > hope that we'll have actual numbers for or against whatever the > particular solution should be. > > I do get the feeling that the whole wb->list_lock needs more loving. > For example, there's that locked_inode_to_wb_and_lock_list() thing > (which might be better off trying to just do "trylock" on it, but > that's a separate issue) showing lock inversion worries. At the very least, it's kind of sad how many of us were surprised to find the lock held when Dave's patch was unplugging. If this bug slipped through, more are going to. It's also true N-1 of those people were really surprised about scheduling unplug functions, so maybe we can't put all the blame on wb->list_lock. > > Maybe we should *not* get that wb->list_lock early at all, and nest it > inside the inode spinlocks, and just move the list_lock locking down a > lot (ie not even try to hold it over big functions that then are > forced to releasing it anyway). > > For example, realistically, it looks like the longest we ever *really* > hold that lock is at the top of the loop of writeback_sb_inodes() - > maybe we could just explicitly have a function that does "find the > first inode that matches this sb and needs writeout activity", and > literally only take hold the lock over that function. And *not* take > the lock in the caller at all? > > The callers seem to want that lock mainly because they do that > "list_empty(&wb->b_io)" test. But the way our doubly linked lists > work, "list_empty()" is actually something that can be done racily > without the lock held... > > That said, I doubt anybody really wants to touch this, so at least for > now we're stuck with either the "plug outside the lock" or the "drop > and retake lock" options. It really would be loveyl to haev numbers > either way. For 4.3 timeframes, what runs do you want to see numbers for: 1) revert 2) my hack 3) plug over multiple sbs (on different devices) 4) ? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote: > > For 4.3 timeframes, what runs do you want to see numbers for: > > 1) revert > 2) my hack > 3) plug over multiple sbs (on different devices) > 4) ? Just 2 or 3. I don't think the plain revert is all that interesting, and I think the "anything else" is far too late for this merge window. So we'll go with either (2) your patch (which I obviously don't _like_, but apart from the ugliness I don't think there's anything technically wrong with), or with (3) the "plug across a bigger area". So the only issue with (3) is whether that's just "revert plus the patch I sent out", or whether we should unplug/replug over the "wait synchronously for an inode" case (iow, the "inode_sleep_on_writeback()"). The existing plug code (that has the spinlock issue) already has a "wait on inode" case, and did *not* unplug over that call, but broadening the plugging further now ends up having two of those "wait synchronosly on inode". Are we really ok with waiting synchronously for an inode while holding the plug? No chance of deadlock (waiting for IO that we've plugged)? That issue is true even of the current code, though, and I have _not_ really thought that through, it's just a worry. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Are we really ok with waiting synchronously for an inode while holding > the plug? No chance of deadlock (waiting for IO that we've plugged)? > That issue is true even of the current code, though, and I have _not_ > really thought that through, it's just a worry. Never mind. We still flush the plug on explicit scheduling events. I wonder why I thought we got rid of that. Some kind of "senior moment", I guess. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 05:52:27PM -0700, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Are we really ok with waiting synchronously for an inode while holding > > the plug? No chance of deadlock (waiting for IO that we've plugged)? > > That issue is true even of the current code, though, and I have _not_ > > really thought that through, it's just a worry. > > Never mind. We still flush the plug on explicit scheduling events. I > wonder why I thought we got rid of that. Some kind of "senior moment", But flushing on schedule is a little different. It ends up calling blk_schedule_flush_plug() which will hand off work to kblockd through blk_run_queue_async() Not a huge deal, but if we're scheduling to wait for that IO, we should really run the plug ourselves so that we're not waiting for kblockd too. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 7:15 PM, Chris Mason <clm@fb.com> wrote: > > But flushing on schedule is a little different. It ends up calling > blk_schedule_flush_plug() which will hand off work to kblockd through > blk_run_queue_async() I was more worried about some actual deadlock from the changes. And blk_schedule_flush_plug() is fine in that it doesn't actually remove the plug, it just schedules the currently plugged pending IO, so the IO will start from waiting on the inode, but the plug will still remain for the rest of the writeback, and it all looks like it should be fine. And the reason we use kblockd is simple: stack usage. The reschedule can happen pretty deep on the stack, we don't actually want to necessarily then cause much more stack use through things like md/raid allocating new requests etc. So it all looks fine to me. Btw, very tangentially related: grepping a bit shows that "blk_flush_plug()" isn't actually used anywhere any more. Can we get rid of that? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote: > > > > For 4.3 timeframes, what runs do you want to see numbers for: > > > > 1) revert > > 2) my hack > > 3) plug over multiple sbs (on different devices) > > 4) ? > > Just 2 or 3. > > I don't think the plain revert is all that interesting, and I think > the "anything else" is far too late for this merge window. I did the plain revert as well, just to have a baseline. This box is a little different from Dave's. Bare metal two socket box (E5-2660 v2 @ 2.20Ghz) with 144GB of ram. I have two pcie flash devices, one nvme and one fusionio, and I put a one FS on each device (two mounts total). The test created 1.6M files, 4K each. I used Dave's fs_mark command line, spread out over 16 directories from each mounted filesystem. In btrfs they are spread over subvolumes to cut down lock contention. I need to change around the dirty ratios more to smooth out the IO, and I had trouble with both XFS and btrfs getting runs that were not CPU bound. I included the time to run sync at the end of the run because the results were not very consistent without it. The XFS runs generally had one CPU pegged at 100%, and I think this is throwing off the results. On Monday I'll redo them with two (four?) filesystems per flash device, hopefully that'll break things up. The btrfs runs generally had all the CPUs pegged at 100%. I switched to mount -o nodatasum and squeezed out an extra 50K files/sec at much lower CPU utilization. wall time fs_mark files/sec bytes written/sec XFS: baseline v4.2: 5m6s 118,578 272MB/s Dave's patch: 4m46s 151,421 294MB/s my hack: 5m5s 150,714 275MB/s Linus plug: 5m15s 147,735 266MB/s Btrfs (nodatasum): baseline v4.2: 4m39s 242,643 313MB/s Dave's patch: 3m46s 252,452 389MB/s my hack: 3m48s 257,924 379MB/s Linus plug: 3m58s 247,528 369MB/s Bottom line, not as conclusive as I'd like. My hack doesn't seem to hurt, but FS internals are consuming enough CPU that this lock just isn't showing up. Linus' plug patch is consistently slower, and I don't have a great explanation. My guesses: not keeping the flash pipelines full, or the imbalance between the different speed flash is averaging the overall result down, or its my kblockd vs explicit unplug handwaving from yesterday. So, next step is either more runs on flash or grab a box with a bunch of spindles. I'd rather do the spindle runs, I agree with Dave that his patch should help much more on actual drives. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote: > > I did the plain revert as well, just to have a baseline. Ahh, I ended up not expecting you to get this done until after rc1 was out, so I in the meantime just merged my fix instead rather than leave the expected scheduling-while-atomic problem. And just as well that you did a baseline, since apparently the numbers are all over the map. I don't see how your hack and dave's original can _possibly_ differ that much, but they clearly did on your xfs test. So there's probably huge variance that depends on random details. I'll leave things as they are until we have something that looks a bit more believable ;) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 12, 2015 at 04:29:06PM -0700, Linus Torvalds wrote: > On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote: > > > > I did the plain revert as well, just to have a baseline. > > Ahh, I ended up not expecting you to get this done until after rc1 was > out, so I in the meantime just merged my fix instead rather than leave > the expected scheduling-while-atomic problem. Yeah, I wasn't sure I'd be able to do the runs, but it was a rainy afternoon and this was more fun than cleaning. Really glad something got in for rc1 either way. > > And just as well that you did a baseline, since apparently the numbers > are all over the map. I don't see how your hack and dave's original > can _possibly_ differ that much, but they clearly did on your xfs > test. So there's probably huge variance that depends on random > details. I don't think the XFS numbers can be trusted too much since it was basically bottlenecked behind that single pegged CPU. It was bouncing around and I couldn't quite track it down to a process name (or perf profile). The btrfs numbers were much more consistent, but your patch is still a win over plain 4.2. > > I'll leave things as they are until we have something that looks a bit > more believable ;) We can build from here, thanks Linus. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote: > I don't think the XFS numbers can be trusted too much since it was > basically bottlenecked behind that single pegged CPU. It was bouncing > around and I couldn't quite track it down to a process name (or perf > profile). I'll do more runs Monday, but I was able to grab a perf profile of the pegged XFS CPU. It was just the writeback worker thread, and it hit btrfs differently because we defer more of this stuff to endio workers, effectively spreading it out over more CPUs. With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K. Here's one of the profiles, but it bounced around a lot so I wouldn't use this to actually tune anything: 11.42% kworker/u82:61 [kernel.kallsyms] [k] _raw_spin_lock | ---_raw_spin_lock | |--83.43%-- xfs_extent_busy_trim | xfs_alloc_compute_aligned | | | |--99.92%-- xfs_alloc_ag_vextent_near | | xfs_alloc_ag_vextent | | xfs_alloc_vextent | | xfs_bmap_btalloc | | xfs_bmap_alloc | | xfs_bmapi_write | | xfs_iomap_write_allocate | | xfs_map_blocks | | xfs_vm_writepage | | __writepage | | write_cache_pages | | generic_writepages | | xfs_vm_writepages | | do_writepages | | __writeback_single_inode | | writeback_sb_inodes | | __writeback_inodes_wb | | wb_writeback | | wb_do_writeback | | wb_workfn | | process_one_work | | worker_thread | | kthread | | ret_from_fork | --0.08%-- [...] | -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 13, 2015 at 09:12:44AM -0400, Chris Mason wrote: > On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote: > > I don't think the XFS numbers can be trusted too much since it was > > basically bottlenecked behind that single pegged CPU. It was bouncing > > around and I couldn't quite track it down to a process name (or perf > > profile). > > I'll do more runs Monday, but I was able to grab a perf profile of the > pegged XFS CPU. It was just the writeback worker thread, and it > hit btrfs differently because we defer more of this stuff to endio > workers, effectively spreading it out over more CPUs. > > With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K. > Here's one of the profiles, but it bounced around a lot so I wouldn't > use this to actually tune anything: mkfs.xfs -d agcount=64 .... Cheers, Dave.
On Sat, Sep 12, 2015 at 07:00:27PM -0400, Chris Mason wrote: > On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote: > > On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote: > > > > > > For 4.3 timeframes, what runs do you want to see numbers for: > > > > > > 1) revert > > > 2) my hack > > > 3) plug over multiple sbs (on different devices) > > > 4) ? > > > > Just 2 or 3. > > > > I don't think the plain revert is all that interesting, and I think > > the "anything else" is far too late for this merge window. > > I did the plain revert as well, just to have a baseline. This box is a > little different from Dave's. Bare metal two socket box (E5-2660 v2 @ > 2.20Ghz) with 144GB of ram. I have two pcie flash devices, one nvme and > one fusionio, and I put a one FS on each device (two mounts total). > > The test created 1.6M files, 4K each. I used Dave's fs_mark command > line, spread out over 16 directories from each mounted filesystem. In > btrfs they are spread over subvolumes to cut down lock contention. > > I need to change around the dirty ratios more to smooth out the IO, and > I had trouble with both XFS and btrfs getting runs that were not CPU > bound. I included the time to run sync at the end of the run because > the results were not very consistent without it. > > The XFS runs generally had one CPU pegged at 100%, and I think this is > throwing off the results. On Monday I'll redo them with two (four?) > filesystems per flash device, hopefully that'll break things up. > > The btrfs runs generally had all the CPUs pegged at 100%. I switched to > mount -o nodatasum and squeezed out an extra 50K files/sec at much lower > CPU utilization. > > wall time fs_mark files/sec bytes written/sec > > XFS: > baseline v4.2: 5m6s 118,578 272MB/s > Dave's patch: 4m46s 151,421 294MB/s > my hack: 5m5s 150,714 275MB/s > Linus plug: 5m15s 147,735 266MB/s > > > Btrfs (nodatasum): > baseline v4.2: 4m39s 242,643 313MB/s > Dave's patch: 3m46s 252,452 389MB/s > my hack: 3m48s 257,924 379MB/s > Linus plug: 3m58s 247,528 369MB/s Really need to run these numbers on slower disks where block layer merging makes a difference to performance. The high level plugging improves performance on spinning disks by a huge amount on XFS because the merging reduces the number of IOs issued to disk by 2 orders of magnitude. Plugging makes comparitively little difference on devices that can sustain extremely high IOPS and hence sink the tens to hundreds of thousand individual 4k IOs that this workload generates through writeback. i.e. while throughput increases, that's not the numbers that matters here - it's the change in write IO behaviour that needs to be categorised and measured by the different patches... (I'm on holidays, so I'm not going to get to this any time soon) Cheers, Dave.
On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote: > > Really need to run these numbers on slower disks where block layer > merging makes a difference to performance. Yeah. We've seen plugging and io schedulers not make much difference for high-performance flash (although I think the people who argued that noop should generally be used for non-rotating media were wrong, I think - the elevator ends up still being critical to merging, and while merging isn't a life-or-death situation, it tends to still help). For rotating rust with nasty seek times, the plugging is likely to make the biggest difference. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote: > On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Really need to run these numbers on slower disks where block layer > > merging makes a difference to performance. > > Yeah. We've seen plugging and io schedulers not make much difference > for high-performance flash (although I think the people who argued > that noop should generally be used for non-rotating media were wrong, > I think - the elevator ends up still being critical to merging, and > while merging isn't a life-or-death situation, it tends to still > help). Yeah, my big concern was that holding the plug longer would result in lower overall perf because we weren't keeping the flash busy. So I started with the flash boxes to make sure we weren't regressing past 4.2 levels at least. I'm still worried about that, but this probably isn't the right benchmark to show it. And if it's really a problem, it'll happen everywhere we plug and not just here. > > For rotating rust with nasty seek times, the plugging is likely to > make the biggest difference. For rotating storage, I grabbed a big box and did the fs_mark run against 8 spindles. These are all behind a megaraid card as jbods, so I flipped the card's cache to write-through. I changed around the run a bit, making enough files for fs_mark to run for ~10 minutes, and I took out the sync. I ran only xfs to cut down on the iterations, and after the fs_mark run, I did short 30 second run with blktrace in the background to capture the io sizes. v4.2: 178K files/sec Chinner: 192K files/sec Mason: 192K files/sec Linus: 193K files/sec I added support to iowatcher to graph IO size, and attached the graph. Short version, Linus' patch still gives bigger IOs and similar perf to Dave's original. I should have done the blktrace runs for 60 seconds instead of 30, I suspect that would even out the average sizes between the three patches. -chris
On Wed 16-09-15 11:16:21, Chris Mason wrote: > On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote: > > On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > > > Really need to run these numbers on slower disks where block layer > > > merging makes a difference to performance. > > > > Yeah. We've seen plugging and io schedulers not make much difference > > for high-performance flash (although I think the people who argued > > that noop should generally be used for non-rotating media were wrong, > > I think - the elevator ends up still being critical to merging, and > > while merging isn't a life-or-death situation, it tends to still > > help). > > > Yeah, my big concern was that holding the plug longer would result in > lower overall perf because we weren't keeping the flash busy. So I > started with the flash boxes to make sure we weren't regressing past 4.2 > levels at least. > > I'm still worried about that, but this probably isn't the right > benchmark to show it. And if it's really a problem, it'll happen > everywhere we plug and not just here. > > > > > For rotating rust with nasty seek times, the plugging is likely to > > make the biggest difference. > > For rotating storage, I grabbed a big box and did the fs_mark run > against 8 spindles. These are all behind a megaraid card as jbods, so I > flipped the card's cache to write-through. > > I changed around the run a bit, making enough files for fs_mark to run > for ~10 minutes, and I took out the sync. I ran only xfs to cut down on > the iterations, and after the fs_mark run, I did short 30 second run with > blktrace in the background to capture the io sizes. > > v4.2: 178K files/sec > Chinner: 192K files/sec > Mason: 192K files/sec > Linus: 193K files/sec > > I added support to iowatcher to graph IO size, and attached the graph. > > Short version, Linus' patch still gives bigger IOs and similar perf to > Dave's original. I should have done the blktrace runs for 60 seconds > instead of 30, I suspect that would even out the average sizes between > the three patches. Thanks for the data Chris. So I guess we are fine with what's currently in, right? Honza
On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > Short version, Linus' patch still gives bigger IOs and similar perf to > > Dave's original. I should have done the blktrace runs for 60 seconds > > instead of 30, I suspect that would even out the average sizes between > > the three patches. > > Thanks for the data Chris. So I guess we are fine with what's currently in, > right? Looks like it works well to me. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote: > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > > Short version, Linus' patch still gives bigger IOs and similar perf to > > > Dave's original. I should have done the blktrace runs for 60 seconds > > > instead of 30, I suspect that would even out the average sizes between > > > the three patches. > > > > Thanks for the data Chris. So I guess we are fine with what's currently in, > > right? > > Looks like it works well to me. Graph looks good, though I'll confirm it on my test rig once I get out from under the pile of email and other stuff that is queued up after being away for a week... Cheers, Dave.
[cc Tejun] On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote: > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > > > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > > > Short version, Linus' patch still gives bigger IOs and similar perf to > > > > Dave's original. I should have done the blktrace runs for 60 seconds > > > > instead of 30, I suspect that would even out the average sizes between > > > > the three patches. > > > > > > Thanks for the data Chris. So I guess we are fine with what's currently in, > > > right? > > > > Looks like it works well to me. > > Graph looks good, though I'll confirm it on my test rig once I get > out from under the pile of email and other stuff that is queued up > after being away for a week... I ran some tests in the background while reading other email..... TL;DR: Results look really bad - not only is the plugging problematic, baseline writeback performance has regressed significantly. We need to revert the plugging changes until the underlying writeback performance regressions are sorted out. In more detail, these tests were run on my usual 16p/16GB RAM performance test VM with storage set up as described here: http://permalink.gmane.org/gmane.linux.kernel/1768786 The test: $ ~/tests/fsmark-10-4-test-xfs.sh meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0 data = bsize=4096 blocks=134217727500, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. FSUse% Count Size Files/sec App Overhead 0 80000 4096 106938.0 543310 0 160000 4096 102922.7 476362 0 240000 4096 107182.9 538206 0 320000 4096 107871.7 619821 0 400000 4096 99255.6 622021 0 480000 4096 103217.8 609943 0 560000 4096 96544.2 640988 0 640000 4096 100347.3 676237 0 720000 4096 87534.8 483495 0 800000 4096 72577.5 2556920 0 880000 4096 97569.0 646996 <RAM fills here, sustained performance is now dependent on writeback> 0 960000 4096 80147.0 515679 0 1040000 4096 100394.2 816979 0 1120000 4096 91466.5 739009 0 1200000 4096 85868.1 977506 0 1280000 4096 89691.5 715207 0 1360000 4096 52547.5 712810 0 1440000 4096 47999.1 685282 0 1520000 4096 47894.3 697261 0 1600000 4096 47549.4 789977 0 1680000 4096 40029.2 677885 0 1760000 4096 16637.4 12804557 0 1840000 4096 16883.6 24295975 0 1920000 4096 18852.1 22264137 0 2000000 4096 27107.3 10310408 0 2080000 4096 23654.1 19823567 0 2160000 4096 24588.9 19769484 0 2240000 4096 13959.3 40168163 0 2320000 4096 26239.2 17806813 0 2400000 4096 17858.5 24596715 0 2480000 4096 12072.9 38177650 0 2560000 4096 9989.7 43413729 <This is Real Bad. It gets significantly worse as the test goes on:> 0 6880000 4096 11773.4 16470400 0 6960000 4096 5556.5 27325343 0 7040000 4096 13816.6 22440991 0 7120000 4096 7322.4 42086686 0 7200000 4096 8033.0 31182907 0 7280000 4096 10589.4 30876729 0 7360000 4096 15844.5 19903105 0 7440000 4096 9187.7 35041718 0 7520000 4096 11800.9 12254920 0 7600000 4096 20049.9 13052020 0 7680000 4096 17444.9 12055262 0 7760000 4096 25900.6 10633885 0 7840000 4096 6398.8 12014901 0 7920000 4096 12147.4 14655678 <...> 0 9360000 4096 19406.0 5874193 0 9440000 4096 7261.3 14542806 0 9520000 4096 14978.7 21921186 0 9600000 4096 12656.7 10539894 real 10m9.240s user 0m51.817s sys 13m23.874s A quick reading of the IO behaviour shows an average of about 1000 IOPS and 100MB/s, which matches roughly with the average files/s created. There's no obvious peaks and gaps in the IO throughput, but CPU usage has regular gaps where it falls to nothing and the only "usage" is IO wait time. Reverting just Linus' patch, so no plugging at all in the code (i.e. the baseline numbers): FSUse% Count Size Files/sec App Overhead 0 80000 4096 109017.5 586268 0 160000 4096 93548.1 553333 0 240000 4096 104141.7 817182 0 320000 4096 112846.2 597494 0 400000 4096 98628.6 641524 0 480000 4096 86586.1 1274195 0 560000 4096 95965.0 776961 0 640000 4096 84585.9 583330 0 720000 4096 93787.1 695658 0 800000 4096 94112.7 645925 0 880000 4096 92259.2 792950 <RAM fills here, sustained performance is now dependent on writeback> 0 960000 4096 62936.9 763642 0 1040000 4096 22484.8 737471 0 1120000 4096 22038.6 706791 0 1200000 4096 24096.5 680878 0 1280000 4096 23973.1 839826 0 1360000 4096 24761.9 779486 0 1440000 4096 25467.7 976022 0 1520000 4096 25338.7 935552 0 1600000 4096 27484.8 683351 0 1680000 4096 24383.2 3039781 0 1760000 4096 24857.0 2657008 <maintains 13-26k files/s all the way to the end .... > 0 9040000 4096 21438.6 4869317 0 9120000 4096 25387.1 642198 0 9200000 4096 26815.9 727763 0 9280000 4096 25175.5 4458811 0 9360000 4096 19832.8 982122 0 9440000 4096 17643.8 805599 0 9520000 4096 13923.4 7819152 0 9600000 4096 18101.3 1071163 real 7m7.926s user 0m57.976s sys 14m14.620s The baseline of no plugging is a full 3 minutes faster than the plugging behaviour of Linus' patch. The IO behaviour demonstrates that, sustaining between 25-30,000 IOPS and throughput of 130-150MB/s. Hence, while Linus' patch does change the IO patterns, it does not result in a performance improvement like the original plugging patch did. So I went back and had a look at my original patch, which I've been using locally for a couple of years and was similar to the original commit. It has this description from when I last updated the perf numbers from testing done on 3.17: | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem, | metadata CRCs enabled. | | Test: | | $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d | /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d | /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d | /mnt/scratch/6 -d /mnt/scratch/7 | | Result: | wall sys create rate Physical write IO | time CPU (avg files/s) IOPS Bandwidth | ----- ----- ------------- ------ --------- | unpatched 5m54s 15m32s 32,500+/-2200 28,000 150MB/s | patched 3m19s 13m28s 52,900+/-1800 1,500 280MB/s | improvement -43.8% -13.3% +62.7% -94.6% +86.6% IOWs, what we are seeing here is that the baseline writeback performance has regressed quite significantly since I took these numbers back on 3.17. I'm running on exactly the same test setup; the only difference is the kernel and so the current kernel baseline is ~20% slower than the baseline numbers I have in my patch. So, let's put the original [broken] plugging back in: ..... 0 8960000 4096 28420.0 2129129 0 9040000 4096 26863.0 6893094 0 9120000 4096 46721.4 705927 0 9200000 4096 46384.0 1081627 0 9280000 4096 39692.3 2148938 0 9360000 4096 41976.0 3540489 0 9440000 4096 28025.1 10364440 0 9520000 4096 38580.2 2457019 0 9600000 4096 33526.1 7661316 real 5m12.114s user 0m50.294s sys 12m54.425s While we get back that regression and then some more, it's still much slower than the high level plugging on slightly older kernels. And the performance is, again, highly variable through the midpoint of the test run. e.g: .... 0 4720000 4096 39224.4 9401401 0 4800000 4096 25342.9 18889665 0 4880000 4096 14841.7 36668633 0 4960000 4096 27624.6 17731873 0 5040000 4096 37874.8 11390831 0 5120000 4096 29009.2 12331965 0 5200000 4096 49568.0 4346746 0 5280000 4096 83074.5 877142 0 5360000 4096 32498.8 5401470 0 5440000 4096 55468.1 3188117 0 5520000 4096 19156.0 4112120 0 5600000 4096 44894.0 3186559 .... We have a low of ~15k files/s and a high of 83k files/s whilst averaging about 38k files/s. IOWs, ignoring the plugging, the recent changes to the writeback code have really screwed up writeback performance and changed the way it behaves from a throughput consistency point of view. I'd suggest we should start by reverting all the plugging changes for now, because there are other problems these plugging changes have exposed in the current writeback code that need to be fixed first... Cheers Dave. PS: I was going to run this on an ext4 filesystem to check it wasn't just XFS seeing this, but that reminded me why I don't run this test on ext4: FSUse% Count Size Files/sec App Overhead 5 80000 4096 211668.3 534504 5 160000 4096 191731.8 532701 5 240000 4096 196251.2 489772 5 320000 4096 207597.1 464032 5 400000 4096 121616.0 601156 5 480000 4096 197729.9 573925 <Really, really fast in memory, but background writeback starts here> 5 560000 4096 128698.4 546362 5 640000 4096 137904.3 543718 5 720000 4096 63680.9 498110 5 800000 4096 25825.9 1538213 5 880000 4096 32708.7 632235 <dirty page threshold (20% RAM) is hit here. performance sucks from here on in (still 10GB of free RAM) because the writeback IO pattern is a random 4k write pattern> 5 960000 4096 6940.2 7250666 5 1040000 4096 7474.8 711872 5 1120000 4096 4127.5 14034167 5 1200000 4096 4822.1 752398 .... Hence it takes about around hour to do a single run of this test on ext4 because it's delayed allocation algorithm does not linearise writes to small files with similar inode locality like XFS does, and that's the behaviour that high level plugging really helps. IOws, this specific workload does not demonstrate any significant performance difference on ext4 with the different plugging algorithms because it does random IO rather than sequential IO. I also ran it on btrfs, but it's pushing peaks of 15-25,000 IOPS every 30s regardless of whether there is plugging at a high level or not, so again btrfs behaviour is pretty much unaffected by the high level plugging on this workload.... Cheers, Dave.
On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > TL;DR: Results look really bad - not only is the plugging > problematic, baseline writeback performance has regressed > significantly. Dave, if you're testing my current -git, the other performance issue might still be the spinlock thing. I haven't gotten the -tip pull fox the spinlock regression yet, and since you were the only one who noticed, I didn't consider it critical enough to bypass the normal schedule, so -rc1 still has the crazy-bad test-and-set loop for spinlocks. So it's waiting in the locking fix branch (and is cc'd for stable), but hasn't reached me yet. If you want to test that theory out, you can make virt_spin_lock() just unconditionally return false to disable the bad cmpxchg loop for now. The plugging IO pauses are interesting, though. Plugging really *shouldn't* cause that kind of pauses, _regardless_ of what level it happens on, so I wonder if the patch ends up just exposing some really basic problem that just normally goes hidden. Can you match up the IO wait times with just *where* it is waiting? Is it waiting for that inode I_SYNC thing in inode_sleep_on_writeback()? But it does sound like we should just revert the whole plugging for now, if only because "it has odd effects". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote: > On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > TL;DR: Results look really bad - not only is the plugging > > problematic, baseline writeback performance has regressed > > significantly. > > Dave, if you're testing my current -git, the other performance issue > might still be the spinlock thing. I have the fix as the first commit in my local tree - it'll remain there until I get a conflict after an update. :) > The plugging IO pauses are interesting, though. Plugging really > *shouldn't* cause that kind of pauses, _regardless_ of what level it > happens on, so I wonder if the patch ends up just exposing some really > basic problem that just normally goes hidden. Right, that's what I suspect - it didn't happen on older kernels, but we've just completely reworked the writeback code for the control group awareness since I last looked really closely at this... > Can you match up the IO wait times with just *where* it is > waiting? Is it waiting for that inode I_SYNC thing in > inode_sleep_on_writeback()? I'll do some more investigation. > But it does sound like we should just revert the whole plugging for > now, if only because "it has odd effects". Yup - we can add it again next merge window once we get to the bottom of whatever is going on and have had time to test the new code properly. Cheers, Dave.
On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote: > [cc Tejun] > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote: > > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > > > > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > > > > Short version, Linus' patch still gives bigger IOs and similar perf to > > > > > Dave's original. I should have done the blktrace runs for 60 seconds > > > > > instead of 30, I suspect that would even out the average sizes between > > > > > the three patches. > > > > > > > > Thanks for the data Chris. So I guess we are fine with what's currently in, > > > > right? > > > > > > Looks like it works well to me. > > > > Graph looks good, though I'll confirm it on my test rig once I get > > out from under the pile of email and other stuff that is queued up > > after being away for a week... > > I ran some tests in the background while reading other email..... > > TL;DR: Results look really bad - not only is the plugging > problematic, baseline writeback performance has regressed > significantly. We need to revert the plugging changes until the > underlying writeback performance regressions are sorted out. > > In more detail, these tests were run on my usual 16p/16GB RAM > performance test VM with storage set up as described here: > > https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b > > The test: > > $ ~/tests/fsmark-10-4-test-xfs.sh > meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=0 > data = bsize=4096 blocks=134217727500, imaxpct=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal log bsize=4096 blocks=131072, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write > # App overhead is time in microseconds spent in the test not doing file writing related system calls. > > FSUse% Count Size Files/sec App Overhead > 0 80000 4096 106938.0 543310 > 0 160000 4096 102922.7 476362 > 0 240000 4096 107182.9 538206 > 0 320000 4096 107871.7 619821 > 0 400000 4096 99255.6 622021 > 0 480000 4096 103217.8 609943 > 0 560000 4096 96544.2 640988 > 0 640000 4096 100347.3 676237 > 0 720000 4096 87534.8 483495 > 0 800000 4096 72577.5 2556920 > 0 880000 4096 97569.0 646996 > > <RAM fills here, sustained performance is now dependent on writeback> I think too many variables have changed here. My numbers: FSUse% Count Size Files/sec App Overhead 0 160000 4096 356407.1 1458461 0 320000 4096 368755.1 1030047 0 480000 4096 358736.8 992123 0 640000 4096 361912.5 1009566 0 800000 4096 342851.4 1004152 0 960000 4096 358357.2 996014 0 1120000 4096 338025.8 1004412 0 1280000 4096 354440.3 997380 0 1440000 4096 335225.9 1000222 0 1600000 4096 278786.1 1164962 0 1760000 4096 268161.4 1205255 0 1920000 4096 259158.0 1298054 0 2080000 4096 276939.1 1219411 0 2240000 4096 252385.1 1245496 0 2400000 4096 280674.1 1189161 0 2560000 4096 290155.4 1141941 0 2720000 4096 280842.2 1179964 0 2880000 4096 272446.4 1155527 0 3040000 4096 268827.4 1235095 0 3200000 4096 251767.1 1250006 0 3360000 4096 248339.8 1235471 0 3520000 4096 267129.9 1200834 0 3680000 4096 257320.7 1244854 0 3840000 4096 233540.8 1267764 0 4000000 4096 269237.0 1216324 0 4160000 4096 249787.6 1291767 0 4320000 4096 256185.7 1253776 0 4480000 4096 257849.7 1212953 0 4640000 4096 253933.9 1181216 0 4800000 4096 263567.2 1233937 0 4960000 4096 255666.4 1231802 0 5120000 4096 257083.2 1282893 0 5280000 4096 254285.0 1229031 0 5440000 4096 265561.6 1219472 0 5600000 4096 266374.1 1229886 0 5760000 4096 241003.7 1257064 0 5920000 4096 245047.4 1298330 0 6080000 4096 254771.7 1257241 0 6240000 4096 254355.2 1261006 0 6400000 4096 254800.4 1201074 0 6560000 4096 262794.5 1234816 0 6720000 4096 248103.0 1287921 0 6880000 4096 231397.3 1291224 0 7040000 4096 227898.0 1285359 0 7200000 4096 227279.6 1296340 0 7360000 4096 232561.5 1748248 0 7520000 4096 231055.3 1169373 0 7680000 4096 245738.5 1121856 0 7840000 4096 234961.7 1147035 0 8000000 4096 243973.0 1152202 0 8160000 4096 246292.6 1169527 0 8320000 4096 249433.2 1197921 0 8480000 4096 222576.0 1253650 0 8640000 4096 239407.5 1263257 0 8800000 4096 246037.1 1218109 0 8960000 4096 242306.5 1293567 0 9120000 4096 238525.9 3745133 0 9280000 4096 269869.5 1159541 0 9440000 4096 266447.1 4794719 0 9600000 4096 265748.9 1161584 0 9760000 4096 269067.8 1149918 0 9920000 4096 248896.2 1164112 0 10080000 4096 261342.9 1174536 0 10240000 4096 254778.3 1225425 0 10400000 4096 257702.2 1211634 0 10560000 4096 233972.5 1203665 0 10720000 4096 232647.1 1197486 0 10880000 4096 242320.6 1203984 I can push the dirty threshold lower to try and make sure we end up in the hard dirty limits but none of this is going to be related to the plugging patch. I do see lower numbers if I let the test run even longer, but there are a lot of things in the way that can slow it down as the filesystem gets that big. I'll try again with lower ratios. [ ... ] > The baseline of no plugging is a full 3 minutes faster than the > plugging behaviour of Linus' patch. The IO behaviour demonstrates > that, sustaining between 25-30,000 IOPS and throughput of > 130-150MB/s. Hence, while Linus' patch does change the IO patterns, > it does not result in a performance improvement like the original > plugging patch did. > How consistent is this across runs? > So I went back and had a look at my original patch, which I've been > using locally for a couple of years and was similar to the original > commit. It has this description from when I last updated the perf > numbers from testing done on 3.17: > > | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem, > | metadata CRCs enabled. > | > | Test: > | > | $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d > | /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d > | /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d > | /mnt/scratch/6 -d /mnt/scratch/7 > | > | Result: > | wall sys create rate Physical write IO > | time CPU (avg files/s) IOPS Bandwidth > | ----- ----- ------------- ------ --------- > | unpatched 5m54s 15m32s 32,500+/-2200 28,000 150MB/s > | patched 3m19s 13m28s 52,900+/-1800 1,500 280MB/s > | improvement -43.8% -13.3% +62.7% -94.6% +86.6% > > IOWs, what we are seeing here is that the baseline writeback > performance has regressed quite significantly since I took these > numbers back on 3.17. I'm running on exactly the same test setup; > the only difference is the kernel and so the current kernel baseline > is ~20% slower than the baseline numbers I have in my patch. All of this in a VM, I'd much rather see this reproduced on bare metal. I've had really consistent results with VMs in the past, but there is a huge amount of code between 3.17 and now. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote: > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote: > > [cc Tejun] > > > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 > > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 > > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. > > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) > > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write > > # App overhead is time in microseconds spent in the test not doing file writing related system calls. > > > > FSUse% Count Size Files/sec App Overhead > > 0 80000 4096 106938.0 543310 > > 0 160000 4096 102922.7 476362 > > 0 240000 4096 107182.9 538206 > > 0 320000 4096 107871.7 619821 > > 0 400000 4096 99255.6 622021 > > 0 480000 4096 103217.8 609943 > > 0 560000 4096 96544.2 640988 > > 0 640000 4096 100347.3 676237 > > 0 720000 4096 87534.8 483495 > > 0 800000 4096 72577.5 2556920 > > 0 880000 4096 97569.0 646996 > > > > <RAM fills here, sustained performance is now dependent on writeback> > > I think too many variables have changed here. > > My numbers: > > FSUse% Count Size Files/sec App Overhead > 0 160000 4096 356407.1 1458461 > 0 320000 4096 368755.1 1030047 > 0 480000 4096 358736.8 992123 > 0 640000 4096 361912.5 1009566 > 0 800000 4096 342851.4 1004152 <snip> > I can push the dirty threshold lower to try and make sure we end up in > the hard dirty limits but none of this is going to be related to the > plugging patch. The point of this test is to drive writeback as hard as possible, not to measure how fast we can create files in memory. i.e. if the test isn't pushing the dirty limits on your machines, then it really isn't putting a meaningful load on writeback, and so the plugging won't make significant difference because writeback isn't IO bound.... > I do see lower numbers if I let the test run even > longer, but there are a lot of things in the way that can slow it down > as the filesystem gets that big. Sure, that's why I hit the dirty limits early in the test - so it measures steady state performance before the fs gets to any significant scalability limits.... > > The baseline of no plugging is a full 3 minutes faster than the > > plugging behaviour of Linus' patch. The IO behaviour demonstrates > > that, sustaining between 25-30,000 IOPS and throughput of > > 130-150MB/s. Hence, while Linus' patch does change the IO patterns, > > it does not result in a performance improvement like the original > > plugging patch did. > > How consistent is this across runs? That's what I'm trying to work out. I didn't report it until I got consistently bad results - the numbers I reported were from the third time I ran the comparison, and they were representative and reproducable. I also ran my inode creation workload that is similar (but has not data writeback so doesn't go through writeback paths at all) and that shows no change in performance, so this problem (whatever it is) is only manifesting itself through data writeback.... The only measurable change I've noticed in my monitoring graphs is that there is a lot more iowait time than I normally see, even when the plugging appears to be working as desired. That's what I'm trying to track down now, and once I've got to the bottom of that I should have some idea of where the performance has gone.... As it is, there are a bunch of other things going wrong with 4.3-rc1+ right now that I'm working through - I haven't updated my kernel tree for 10 days because I've been away on holidays so I'm doing my usual "-rc1 is broken again" dance that I do every release cycle. (e.g every second boot hangs because systemd appears to be waiting for iscsi devices to appear without first starting the iscsi target daemon. Never happened before today, every new kernel I've booted today has hung on the first cold boot of the VM). > > IOWs, what we are seeing here is that the baseline writeback > > performance has regressed quite significantly since I took these > > numbers back on 3.17. I'm running on exactly the same test setup; > > the only difference is the kernel and so the current kernel baseline > > is ~20% slower than the baseline numbers I have in my patch. > > All of this in a VM, I'd much rather see this reproduced on bare metal. > I've had really consistent results with VMs in the past, but there is a > huge amount of code between 3.17 and now. I'm pretty sure it's not the VM - with the locking fix in place, everything else I've looked at (since applying the locking fix Linus mentioned) is within measurement error compared to 4.2. The only thing that is out of whack from a performance POV is data writeback.... Cheers, Dave.
On Thu, Sep 17, 2015 at 02:30:08PM +1000, Dave Chinner wrote: > On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote: > > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote: > > > [cc Tejun] > > > > > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > > > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 > > > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 > > > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > > > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. > > > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) > > > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write > > > # App overhead is time in microseconds spent in the test not doing file writing related system calls. > > > > > > FSUse% Count Size Files/sec App Overhead > > > 0 80000 4096 106938.0 543310 > > > 0 160000 4096 102922.7 476362 > > > 0 240000 4096 107182.9 538206 > > > 0 320000 4096 107871.7 619821 > > > 0 400000 4096 99255.6 622021 > > > 0 480000 4096 103217.8 609943 > > > 0 560000 4096 96544.2 640988 > > > 0 640000 4096 100347.3 676237 > > > 0 720000 4096 87534.8 483495 > > > 0 800000 4096 72577.5 2556920 > > > 0 880000 4096 97569.0 646996 > > > > > > <RAM fills here, sustained performance is now dependent on writeback> > > > > I think too many variables have changed here. > > > > My numbers: > > > > FSUse% Count Size Files/sec App Overhead > > 0 160000 4096 356407.1 1458461 > > 0 320000 4096 368755.1 1030047 > > 0 480000 4096 358736.8 992123 > > 0 640000 4096 361912.5 1009566 > > 0 800000 4096 342851.4 1004152 > > <snip> > > > I can push the dirty threshold lower to try and make sure we end up in > > the hard dirty limits but none of this is going to be related to the > > plugging patch. > > The point of this test is to drive writeback as hard as possible, > not to measure how fast we can create files in memory. i.e. if the > test isn't pushing the dirty limits on your machines, then it really > isn't putting a meaningful load on writeback, and so the plugging > won't make significant difference because writeback isn't IO > bound.... It does end up IO bound on my rig, just because we do eventually hit the dirty limits. Otherwise there would be zero benefits in fs_mark from any patches vs plain v4.2 But I setup a run last night with a dirty_ratio_bytes at 3G and dirty_background_ratio_bytes at 1.5G. There is definitely variation, but nothing like what you saw: FSUse% Count Size Files/sec App Overhead 0 160000 4096 317427.9 1524951 0 320000 4096 319723.9 1023874 0 480000 4096 336696.4 1053884 0 640000 4096 257113.1 1190851 0 800000 4096 257644.2 1198054 0 960000 4096 254896.6 1225610 0 1120000 4096 241052.6 1203227 0 1280000 4096 214961.2 1386236 0 1440000 4096 239985.7 1264659 0 1600000 4096 232174.3 1310018 0 1760000 4096 250477.9 1227289 0 1920000 4096 221500.9 1276223 0 2080000 4096 235212.1 1284989 0 2240000 4096 238580.2 1257260 0 2400000 4096 224182.6 1326821 0 2560000 4096 234628.7 1236402 0 2720000 4096 244675.3 1228400 0 2880000 4096 234364.0 1268408 0 3040000 4096 229712.6 1306148 0 3200000 4096 241170.5 1254490 0 3360000 4096 220487.8 1331456 0 3520000 4096 215831.7 1313682 0 3680000 4096 210934.7 1235750 0 3840000 4096 218435.4 1258077 0 4000000 4096 232127.7 1271555 0 4160000 4096 212017.6 1381525 0 4320000 4096 216309.3 1370558 0 4480000 4096 239072.4 1269086 0 4640000 4096 221959.1 1333164 0 4800000 4096 228396.8 1213160 0 4960000 4096 225747.5 1318503 0 5120000 4096 115727.0 1237327 0 5280000 4096 184171.4 1547357 0 5440000 4096 209917.8 1380510 0 5600000 4096 181074.7 1391764 0 5760000 4096 263516.7 1155172 0 5920000 4096 236405.8 1239719 0 6080000 4096 231587.2 1221408 0 6240000 4096 237118.8 1244272 0 6400000 4096 236773.2 1201428 0 6560000 4096 243987.5 1240527 0 6720000 4096 232428.0 1283265 0 6880000 4096 234839.9 1209152 0 7040000 4096 234947.3 1223456 0 7200000 4096 231463.1 1260628 0 7360000 4096 226750.3 1290098 0 7520000 4096 213632.0 1236409 0 7680000 4096 194710.2 1411595 0 7840000 4096 213963.1 4146893 0 8000000 4096 225109.8 1323573 0 8160000 4096 251322.1 1380271 0 8320000 4096 220167.2 1159390 0 8480000 4096 210991.2 1110593 0 8640000 4096 197922.8 1126072 0 8800000 4096 203539.3 1143501 0 8960000 4096 193041.7 1134329 0 9120000 4096 184667.9 1119222 0 9280000 4096 165968.7 1172738 0 9440000 4096 192767.3 1098361 0 9600000 4096 227115.7 1158097 0 9760000 4096 232139.8 1264245 0 9920000 4096 213320.5 1270505 0 10080000 4096 217013.4 1324569 0 10240000 4096 227171.6 1308668 0 10400000 4096 208591.4 1392098 0 10560000 4096 212006.0 1359188 0 10720000 4096 213449.3 1352084 0 10880000 4096 219890.1 1326240 0 11040000 4096 215907.7 1239180 0 11200000 4096 214207.2 1334846 0 11360000 4096 212875.2 1338429 0 11520000 4096 211690.0 1249519 0 11680000 4096 217013.0 1262050 0 11840000 4096 204730.1 1205087 0 12000000 4096 191146.9 1188635 0 12160000 4096 207844.6 1157033 0 12320000 4096 208857.7 1168111 0 12480000 4096 198256.4 1388368 0 12640000 4096 214996.1 1305412 0 12800000 4096 212332.9 1357814 0 12960000 4096 210325.8 1336127 0 13120000 4096 200292.1 1282419 0 13280000 4096 202030.2 1412105 0 13440000 4096 216553.7 1424076 0 13600000 4096 218721.7 1298149 0 13760000 4096 202037.4 1266877 0 13920000 4096 224032.3 1198159 0 14080000 4096 206105.6 1336489 0 14240000 4096 227540.3 1160841 0 14400000 4096 236921.7 1190394 0 14560000 4096 229343.3 1147451 0 14720000 4096 199435.1 1284374 0 14880000 4096 215177.3 1178542 0 15040000 4096 206194.1 1170832 0 15200000 4096 215762.3 1125633 0 15360000 4096 194511.0 1122947 0 15520000 4096 179008.5 1292603 0 15680000 4096 208636.9 1094960 0 15840000 4096 192173.1 1237891 0 16000000 4096 212888.9 1111551 0 16160000 4096 218403.0 1143400 0 16320000 4096 207260.5 1233526 0 16480000 4096 202123.2 1151509 0 16640000 4096 191033.0 1257706 0 16800000 4096 196865.4 1154520 0 16960000 4096 210361.2 1128930 0 17120000 4096 201755.2 1160469 0 17280000 4096 196946.6 1173529 0 17440000 4096 199677.8 1165750 0 17600000 4096 194248.4 1234944 0 17760000 4096 200027.9 1256599 0 17920000 4096 206507.0 1166820 0 18080000 4096 215082.7 1167599 0 18240000 4096 201475.5 1212202 0 18400000 4096 208247.6 1252255 0 18560000 4096 205482.7 1311436 0 18720000 4096 200111.9 1358784 0 18880000 4096 200028.3 1351332 0 19040000 4096 198873.4 1287400 0 19200000 4096 209609.3 1268400 0 19360000 4096 203538.6 1249787 0 19520000 4096 203427.9 1294105 0 19680000 4096 201905.3 1280714 0 19840000 4096 209642.9 1283281 0 20000000 4096 203438.9 1315427 0 20160000 4096 199690.7 1252267 0 20320000 4096 185965.2 1398905 0 20480000 4096 203221.6 1214029 0 20640000 4096 208654.8 1232679 0 20800000 4096 212488.6 1298458 0 20960000 4096 189701.1 1356640 0 21120000 4096 198522.1 1361240 0 21280000 4096 203857.3 1263402 0 21440000 4096 204616.8 1362853 0 21600000 4096 196310.6 1266710 0 21760000 4096 203275.4 1391150 0 21920000 4096 205998.5 1378741 0 22080000 4096 205434.2 1283787 0 22240000 4096 195918.0 1415912 0 22400000 4096 186193.0 1413623 0 22560000 4096 192911.3 1393471 0 22720000 4096 203726.3 1264281 0 22880000 4096 204853.4 1221048 0 23040000 4096 222803.2 1153031 0 23200000 4096 198558.6 1346256 0 23360000 4096 201001.4 1278817 0 23520000 4096 206225.2 1270440 0 23680000 4096 190894.2 1425299 0 23840000 4096 198555.6 1334122 0 24000000 4096 202386.4 1332157 0 24160000 4096 205103.1 1313607 > > > I do see lower numbers if I let the test run even > > longer, but there are a lot of things in the way that can slow it down > > as the filesystem gets that big. > > Sure, that's why I hit the dirty limits early in the test - so it > measures steady state performance before the fs gets to any > significant scalability limits.... > > > > The baseline of no plugging is a full 3 minutes faster than the > > > plugging behaviour of Linus' patch. The IO behaviour demonstrates > > > that, sustaining between 25-30,000 IOPS and throughput of > > > 130-150MB/s. Hence, while Linus' patch does change the IO patterns, > > > it does not result in a performance improvement like the original > > > plugging patch did. > > > > How consistent is this across runs? > > That's what I'm trying to work out. I didn't report it until I got > consistently bad results - the numbers I reported were from the > third time I ran the comparison, and they were representative and > reproducable. I also ran my inode creation workload that is similar > (but has not data writeback so doesn't go through writeback paths at > all) and that shows no change in performance, so this problem > (whatever it is) is only manifesting itself through data > writeback.... The big change between Linus' patch and your patch is with Linus kblockd is probably doing most of the actual unplug work (except for the last super block in the list). If a process is waiting for dirty writeout progress, it has to wait for that context switch to kblockd. In the VM, that's going to hurt more then my big two socket mostly idle machine. > > The only measurable change I've noticed in my monitoring graphs is > that there is a lot more iowait time than I normally see, even when > the plugging appears to be working as desired. That's what I'm > trying to track down now, and once I've got to the bottom of that I > should have some idea of where the performance has gone.... > > As it is, there are a bunch of other things going wrong with > 4.3-rc1+ right now that I'm working through - I haven't updated my > kernel tree for 10 days because I've been away on holidays so I'm > doing my usual "-rc1 is broken again" dance that I do every release > cycle. (e.g every second boot hangs because systemd appears to be > waiting for iscsi devices to appear without first starting the iscsi > target daemon. Never happened before today, every new kernel I've > booted today has hung on the first cold boot of the VM). I've been doing 4.2 plus patches because rc1 didn't boot on this strange box. Let me nail that down and rerun. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> Dave, if you're testing my current -git, the other performance issue >> might still be the spinlock thing. > > I have the fix as the first commit in my local tree - it'll remain > there until I get a conflict after an update. :) Ok. I'm happy to report that you should get a conflict now, and that the spinlock code should work well for your virtualized case again. No updates on the plugging thing yet, I'll wait a bit and follow this thread and see if somebody comes up with any explanations or theories in the hope that we might not need to revert (or at least have a more targeted change). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17, 2015 at 12:14:53PM +1000, Dave Chinner wrote: > On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote: > > On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > > > TL;DR: Results look really bad - not only is the plugging > > > problematic, baseline writeback performance has regressed > > > significantly. > > > > Dave, if you're testing my current -git, the other performance issue > > might still be the spinlock thing. > > I have the fix as the first commit in my local tree - it'll remain > there until I get a conflict after an update. :) > > > The plugging IO pauses are interesting, though. Plugging really > > *shouldn't* cause that kind of pauses, _regardless_ of what level it > > happens on, so I wonder if the patch ends up just exposing some really > > basic problem that just normally goes hidden. > > Right, that's what I suspect - it didn't happen on older kernels, > but we've just completely reworked the writeback code for the > control group awareness since I last looked really closely at > this... > > > Can you match up the IO wait times with just *where* it is > > waiting? Is it waiting for that inode I_SYNC thing in > > inode_sleep_on_writeback()? > > I'll do some more investigation. Ok, I'm happy to report there is actually nothing wrong with the plugging code that is your tree. I finally tracked the problem I was seeing down to a misbehaving RAID controller.[*] With that problem sorted: kernel files/s wall time 3.17 32500 5m54s 4.3-noplug 34400 5m25s 3.17-plug 52900 3m19s 4.3-badplug 60540 3m24s 4.3-rc1 56600 3m23s So the 3.17/4.3-noplug baselines so no regression - 4.3 is slightly faster. All the plugging variants show roughly the same improvement and IO behaviour. These numbers are reproducable and there are no weird performance inconsistencies during any of the 4.3-rc1 kernel runs. Hence my numbers and observed behaviour now aligns with Chris' results and so I think we can say the reworked high level plugging is behaving as we expected it to. Cheers, Dave. [*] It seems to have a dodgy battery connector, and so has been "losing" battery backup and changing the cache mode of the HBA from write back to write through. This results in changing from NVRAM performance to SSD native performance and back again. A small vibration would cause the connection to the battery to reconnect and the controller would switch back to writeback mode. The few log entries in the bios showed changes in status between a few seconds apart to minutes apart - enough for the cache status to change several times a 5-10 minute benchmark run. I didn't notice the hardware was playing up because it wasn't triggering the machine alert indicator through the bios like it's supposed to and so the visible and audible alarms were not being triggered, nor was the BMC logging the raid controller cache status changes. In the end, I noticed it by chance - during a low level test the behaviour changed very obviously as one of my dogs ran past the rack. I unplugged everything inside the server, plugged it all back in, powered it back up and fiddled with cables until I found what was causing the problem. And having done this, the BMC is now sending warnings and the audible alarm is working when the battery is disconnected... :/
On Thu, Sep 17, 2015 at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote: > > Ok, I'm happy to report there is actually nothing wrong with the > plugging code that is your tree. I finally tracked the problem I > was seeing down to a misbehaving RAID controller.[*] Hey, that's great. Just out of interest, try the patch that Chris just sent that turns the unplugging synchronous for the special "cond_resched()" case. If that helps your case too, let's just do it, even while we wonder why kblockd_workqueue messes up so noticeable. Because regardless of the kblockd_workqueue questions, I think it's nicer to avoid the cond_resched_lock(), and do the rescheduling while we're not holding any locks. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d8ea7ed411b2..587ac08eabb6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1546,12 +1546,15 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, .range_cyclic = 1, .reason = reason, }; + struct blk_plug plug; + blk_start_plug(&plug); spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) queue_io(wb, &work); __writeback_inodes_wb(wb, &work); spin_unlock(&wb->list_lock); + blk_finish_plug(&plug); return nr_pages - work.nr_pages; } @@ -1579,10 +1582,12 @@ static long wb_writeback(struct bdi_writeback *wb, unsigned long oldest_jif; struct inode *inode; long progress; + struct blk_plug plug; oldest_jif = jiffies; work->older_than_this = &oldest_jif; + blk_start_plug(&plug); spin_lock(&wb->list_lock); for (;;) { /* @@ -1662,6 +1667,7 @@ static long wb_writeback(struct bdi_writeback *wb, } } spin_unlock(&wb->list_lock); + blk_finish_plug(&plug); return nr_pages - work->nr_pages; }