diff mbox

[MMTests] IO metadata on XFS

Message ID 20120704004706.GD19223@dastard (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Chinner July 4, 2012, 12:47 a.m. UTC
On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > Adding dri-devel and a few others because an i915 patch contributed to
> > > the regression.
> > > 
> > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > you have lots of CPU power, there will be little difference in
> > > > > > performance...
> > > > > 
> > > > > When I checked it it could only be called twice, and we'd already
> > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > the performance changes happend, at least to a major version but even
> > > > > better to a -rc or git commit.
> > > > > 
> > > > 
> > > > By all means feel free to run the test yourself and run the bisection :)
> > > > 
> > > > It's rare but on this occasion the test machine is idle so I started an
> > > > automated git bisection. As you know the milage with an automated bisect
> > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > apparently unrelated patch that caused the problem.
> > > > 
> > > 
> > > It was obvious very quickly that there were two distinct regression so I
> > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > that enables RC6 to reduce power usage.
> > > 
> > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > Doesn't seem to be the major cause of the regression. By itself, it
> > has impact, but the majority comes from the XFS change...
> > 
> 
> The fact it has an impact at all is weird but lets see what the DRI
> folks think about it.
> 
> > > [c999a223: xfs: introduce an allocation workqueue]
> > 
> > Which indicates that there is workqueue scheduling issues, I think.
> > The same amount of work is being done, but half of it is being
> > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > tested the above patch in anger on an 8p machine, similar to the
> > machine you saw no regressions on, but the workload didn't drive it
> > to being completely CPU bound (only about 90%) so the allocation
> > work was probably always scheduled quickly.
> 
> What test were you using?

fsmark, dbench, compilebench, and a few fio workloads. Also,
xfstests times each test and I keep track of overall runtime, and
none of those showed any performance differential, either...

Indeed, running on a current 3.5-rc5 tree, my usual fsmark
benchmarks are running at the same numbers I've been seeing since
about 3.0 - somewhere around 18k files/s for a single thread, and
110-115k files/s for 8 threads.

I just ran your variant, and I'm getting about 20kfile/s for a
single thread, which is about right because you're using smaller
directories than I am (22500 files per dir vs 100k in my tests).

> > How many worker threads have been spawned on these machines
> > that are showing the regression?
> 
> 20 or 21 generally. An example list as spotted by top looks like

Pretty normal.

> > What is the context switch rate on the machines whenteh test is running?
.....
> Vanilla average context switch rate	4278.53
> Revert average context switch rate	1095

That seems about right, too.

> > Can you run latencytop to see
> > if there is excessive starvation/wait times for allocation
> > completion?
> 
> I'm not sure what format you are looking for.

Where the context switches are coming from, and how long they are
abeing stalled for. Just to get the context switch locations, you
can use perf on the sched:sched_switch event, but that doesn't give
you stall times. Local testing tells me that about 40% of the
switches are from xfs_alloc_vextent, 55% are from the work threads,
and the rest are CPU idling events, which is exactly as I'd expect.

> > A pert top profile comparison might be informative,
> > too...
> > 
> 
> I'm not sure if this is what you really wanted. I thought an oprofile or
> perf report would have made more sense but I recorded perf top over time
> anyway and it's at the end of the mail.

perf report and oprofile give you CPU usage across the run, it's not
instantaneous and that's where all the interesting information is.
e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
it might be 100% for 1s - that's the behaviour run profiles cannot
give you insight into....

As it is, the output you posted is nothing unusual.

> For just these XFS tests I've uploaded a tarball of the logs to
> http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz

Ok, so the main thing I missed when first looking at this is that
you are concerned about single thread regressions. Well, I can't
reproduce your results here. Single threaded with or without the
workqueue based allocation gives me roughly 20k +/-0.5k files/s one
a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM
machine.  That's the same results I've been seeing since I wrote
this patch almost 12 months ago....

So, given that this is a metadata intensive workload, the only
extent allocation is going to be through inode and directory block
allocation. These paths do not consume a large amount of stack, so
we can tell the allocator not to switch to workqueue stack for these
allocations easily.

The patch below does this. It completely removes all the allocation
based context switches from the no-data fsmark workloads being used
for this testing. It makes no noticable difference to performance
here, so I'm interested if it solves the regression you are seeing
on your machines.

Cheers,

Dave.

Comments

Mel Gorman July 4, 2012, 9:51 a.m. UTC | #1
On Wed, Jul 04, 2012 at 10:47:06AM +1000, Dave Chinner wrote:
> > > > <SNIP>
> > > > [c999a223: xfs: introduce an allocation workqueue]
> > > 
> > > Which indicates that there is workqueue scheduling issues, I think.
> > > The same amount of work is being done, but half of it is being
> > > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > > tested the above patch in anger on an 8p machine, similar to the
> > > machine you saw no regressions on, but the workload didn't drive it
> > > to being completely CPU bound (only about 90%) so the allocation
> > > work was probably always scheduled quickly.
> > 
> > What test were you using?
> 
> fsmark, dbench, compilebench, and a few fio workloads. Also,
> xfstests times each test and I keep track of overall runtime, and
> none of those showed any performance differential, either...
> 

Sound. I have some coverage on some of the same tests. When I get to
them I'll keep an eye on the 3.4 figures. It might be due to the disk
I'm using. It's a single disk and nothing to write home about in terms of
performance. It's not exactly XFS's usual target audience.

> Indeed, running on a current 3.5-rc5 tree, my usual fsmark
> benchmarks are running at the same numbers I've been seeing since
> about 3.0 - somewhere around 18k files/s for a single thread, and
> 110-115k files/s for 8 threads.
> 
> I just ran your variant, and I'm getting about 20kfile/s for a
> single thread, which is about right because you're using smaller
> directories than I am (22500 files per dir vs 100k in my tests).
> 

I had data for an fsmark-single test running with 30M files and FWIW the
3.4 performance figures were in line with 3.0 and later kernels.

> > > How many worker threads have been spawned on these machines
> > > that are showing the regression?
> > 
> > 20 or 21 generally. An example list as spotted by top looks like
> 
> Pretty normal.
> 
> > > What is the context switch rate on the machines whenteh test is running?
> .....
> > Vanilla average context switch rate	4278.53
> > Revert average context switch rate	1095
> 
> That seems about right, too.
> 
> > > Can you run latencytop to see
> > > if there is excessive starvation/wait times for allocation
> > > completion?
> > 
> > I'm not sure what format you are looking for.
> 
> Where the context switches are coming from, and how long they are
> abeing stalled for.

Noted. Capturing latency_stats over time is enough to do that. It won't
give a per-process breakdown but in the majority of cases that is not a
problem. Extracting the data is a bit annoying but not impossible and
better than parsing latencytop. Ideally, latencytop would be able to log
data in some sensible format. hmmm.

> Just to get the context switch locations, you
> can use perf on the sched:sched_switch event, but that doesn't give
> you stall times.

No, but both can be captured and roughly correlated with each other
given sufficient motivation.

> Local testing tells me that about 40% of the
> switches are from xfs_alloc_vextent, 55% are from the work threads,
> and the rest are CPU idling events, which is exactly as I'd expect.
> 
> > > A pert top profile comparison might be informative,
> > > too...
> > > 
> > 
> > I'm not sure if this is what you really wanted. I thought an oprofile or
> > perf report would have made more sense but I recorded perf top over time
> > anyway and it's at the end of the mail.
> 
> perf report and oprofile give you CPU usage across the run, it's not
> instantaneous and that's where all the interesting information is.
> e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
> it might be 100% for 1s - that's the behaviour run profiles cannot
> give you insight into....
> 

Fair point. I'll fix up the timestamping and keep the monitor for future
reference.

> As it is, the output you posted is nothing unusual.
> 

Grand. I had taken a look but I saw nothing particularly unusual but I
also was not 100% sure what I should be looking for.,

> > For just these XFS tests I've uploaded a tarball of the logs to
> > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz
> 
> Ok, so the main thing I missed when first looking at this is that
> you are concerned about single thread regressions.

In this specific test, yes. In the original data I posted I had threaded
benchmarks but they did not show the regression. This was a rerun with
just the single threaded case. Generally I run both because I see bug
reports involving both types of test.

> Well, I can't
> reproduce your results here. Single threaded with or without the
> workqueue based allocation gives me roughly 20k +/-0.5k files/s one
> a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM
> machine.  That's the same results I've been seeing since I wrote
> this patch almost 12 months ago....
> 
> So, given that this is a metadata intensive workload, the only
> extent allocation is going to be through inode and directory block
> allocation. These paths do not consume a large amount of stack, so
> we can tell the allocator not to switch to workqueue stack for these
> allocations easily.
> 
> The patch below does this. It completely removes all the allocation
> based context switches from the no-data fsmark workloads being used
> for this testing. It makes no noticable difference to performance
> here, so I'm interested if it solves the regression you are seeing
> on your machines.
> 

It does. nodefer-metadata is just your patch applied on top of 3.4 and is
the right-most column. It's within the noise for the reverted patches and
approximately the same performance as 3.3. If you look at the timing
stats at the bottom you'll see that hte patch brings the System time way
down so consider this a

Tested-by: Mel Gorman <mgorman@suse.de>

FS-Mark Single Threaded
                 fsmark-single      single-3.4.0      single-3.4.0      single-3.4.0      single-3.4.0
                 3.4.0-vanilla   revert-aa464191   revert-c999a223       revert-both  nodefer-metadata
Files/s  min       14176.40 ( 0.00%)    17830.60 (25.78%)    24186.70 (70.61%)      25108.00 (77.11%)    25448.40 (79.51%)
Files/s  mean      16783.35 ( 0.00%)    25029.69 (49.13%)    37513.72 (123.52%)     38169.97 (127.43%)   36393.09 (116.84%)
Files/s  stddev     1007.26 ( 0.00%)     2644.87 (162.58%)    5344.99 (430.65%)      5599.65 (455.93%)    5961.48 (491.85%)
Files/s  max       18475.40 ( 0.00%)    27966.10 (51.37%)    45564.60 (146.62%)     47918.10 (159.36%)   47146.20 (155.18%)
Overhead min      593978.00 ( 0.00%)   386173.00 (34.99%)   253812.00 (57.27%)     247396.00 (58.35%)   248906.00 (58.10%)
Overhead mean     637782.80 ( 0.00%)   429229.33 (32.70%)   322868.20 (49.38%)     287141.73 (54.98%)   284274.93 (55.43%)
Overhead stddev    72440.72 ( 0.00%)   100056.96 (-38.12%)  175001.08 (-141.58%)   102018.14 (-40.83%)  114055.47 (-57.45%)
Overhead max      855637.00 ( 0.00%)   753541.00 (11.93%)   880531.00 (-2.91%)     637932.00 (25.44%)   710720.00 (16.94%)

MMTests Statistics: duration
Sys Time Running Test (seconds)              44.06     32.25     24.19     23.99     24.38
User+Sys Time Running Test (seconds)         50.19     36.35     27.24      26.7     27.12
Total Elapsed Time (seconds)                 59.21     44.76     34.95     34.14     36.11

Thanks.
diff mbox

Patch

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index f654f51..4f33c32 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2434,13 +2434,22 @@  xfs_alloc_vextent_worker(
 	current_restore_flags_nested(&pflags, PF_FSTRANS);
 }
 
-
-int				/* error */
+/*
+ * Data allocation requests often come in with little stack to work on. Push
+ * them off to a worker thread so there is lots of stack to use. Metadata
+ * requests, OTOH, are generally from low stack usage paths, so avoid the
+ * context switch overhead here.
+ */
+int
 xfs_alloc_vextent(
-	xfs_alloc_arg_t	*args)	/* allocation argument structure */
+	struct xfs_alloc_arg	*args)
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 
+	if (!args->userdata)
+		return __xfs_alloc_vextent(args);
+
+
 	args->done = &done;
 	INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
 	queue_work(xfs_alloc_wq, &args->work);