Message ID | 1535316795-21560-3-git-send-email-longman@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs: Reduce spinlock contention in log space slowpath code | expand |
On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote: > The current log space reservation code allows multiple wakeups of the > same sleeping waiter to happen. This is a just a waste of cpu time as > well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is > added to track if a task is being waken up and skip the wake_up_process() > call if the flag is set. > > Running the AIM7 fserver workload on a 2-socket 24-core 48-thread > Broadwell system with a small xfs filesystem on ramfs, the performance > increased from 91,486 jobs/min to 192,666 jobs/min with this change. Oh, I just noticed you are using a ramfs for this benchmark, tl; dr: Once you pass a certain point, ramdisks can be *much* slower than SSDs on journal intensive workloads like AIM7. Hence it would be useful to see if you have the same problems on, say, high performance nvme SSDs. ----- Ramdisks have substantially different means log IO completion and wakeup behaviour compared to real storage on real production systems. Basically, ramdisks are synchronous and real storage is asynchronous. That is, on a ramdisk the IO completion is run synchronously in the same task as the IO submission because the IO is just a memcpy(). Hence a single dispatch thread can only drive an IO queue depth of 1 IO - there is no concurrency possible. This serialises large parts of the XFS journal - the journal is really an asynchronous IO engine that gets it's performance from driving deep IO queues and batching commits while IO is in flight. Ramdisks also have very low IO latency, which means there's only a very small window for "IO in flight" batching optimisations to be made effectively. It effectively stops such algorithms from working completely. This means the XFS journal behaves very differently on ramdisks when compared to normal storage. The submission batching techniques reduces log IOs by a factor of 10-20 under heavy synchrnous transaction loads when there is any noticeable journal IO delay - a few tens of microseconds is enough for it to function effectively, but a ramdisk doesn't even have this delay on journal IO. The submission batching also has the effect of reducing log space wakeups by the same factor there are less IO completions signalling that space has been made available. Further, when we get async IO completions from real hardware, they get processed in batches by a completion workqueue - this leads to there typically only being a single reservation space update from all batched IO completions. This tends to reduce log space wakeups due to log IO completion by a factor of 6-8 as the log can have up to 8 concurrent IOs in flight at a time. And when we throw in the lack of batching, merging and IO completion aggregation of metadata writeback because ramdisks are synchrnous and don't queue or merge adjacent IOs, we end up with lots more contention on the AIL lock and much more frequent log space wakeups (i.e. from log tail movement updates). This futher exacerbates the problems the log already has with synchronous IO. IOWs, log space wakeups on real storage are likely to be 50-100x lower than on a ramdisk for the same metadata and journal intensive workload, and as such those workloads often run faster on real storage than they do on ramdisks. This can be trivially seen with dbench, a simple IO benchmark that hammers the journal. On a ramdisk, I can only get 2-2.5GB/s throughput from the benchmark before the log bottlenecks at about 20,000 log tiny IOs per second. In comparison, on an old, badly abused Samsung 850EVO SSD, I see 5-6GB/s in 2,000 log IOs per second because of the pipelining and IO batching in the XFS journal async IO engine and the massive reduction in metadata IO due to merging of adjacent IOs in the block layer. i.e. the journal and metadata writeback design allows the filesystem to operate at a much higher synchronous transaction rate than would otherwise be possible by taking advantage of the IO concurrency that storage provides us with. So if you use proper storage hardware (e.g. nvme SSD) and/or an appropriately sized log, does the slowpath wakeup contention go away? Can you please test both of these things and report the results so we can properly evaluate the impact of these changes? Cheers, Dave.
On Mon, Aug 27, 2018 at 10:21:34AM +1000, Dave Chinner wrote: > tl; dr: Once you pass a certain point, ramdisks can be *much* slower > than SSDs on journal intensive workloads like AIM7. Hence it would be > useful to see if you have the same problems on, say, high > performance nvme SSDs. Note that all these ramdisk issues you mentioned below will also apply to using the pmem driver on nvdimms, which might be a more realistic version. Even worse at least for cases where the nvdimms aren't actually powerfail dram of some sort with write through caching and ADR the latency is going to be much higher than the ramdisk as well.
On 08/26/2018 08:21 PM, Dave Chinner wrote: > On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote: >> The current log space reservation code allows multiple wakeups of the >> same sleeping waiter to happen. This is a just a waste of cpu time as >> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is >> added to track if a task is being waken up and skip the wake_up_process() >> call if the flag is set. >> >> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread >> Broadwell system with a small xfs filesystem on ramfs, the performance >> increased from 91,486 jobs/min to 192,666 jobs/min with this change. > Oh, I just noticed you are using a ramfs for this benchmark, > > tl; dr: Once you pass a certain point, ramdisks can be *much* slower > than SSDs on journal intensive workloads like AIM7. Hence it would be > useful to see if you have the same problems on, say, high > performance nvme SSDs. Oh sorry, I made a mistake. There were some problems with my test configuration. I was actually running the test on a regular enterprise-class disk device mount on /. Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/rhel_hp--xl420gen9--01-root 52403200 11284408 41118792 22% / It was not an SSD, nor ramdisk. I reran the test on ramdisk, the performance of the patched kernel was 679,880 jobs/min which was a bit more than double the 285,221 score that I got on a regular disk. So the filesystem used wasn't tiny, though it is still not very large. The test was supposed to create 16 ramdisks and distribute the test tasks to the ramdisks. Instead, they were all pounding on the same filesystem worsening the spinlock contention problem. Cheers, Longman
On Mon, Aug 27, 2018 at 12:39:06AM -0700, Christoph Hellwig wrote: > On Mon, Aug 27, 2018 at 10:21:34AM +1000, Dave Chinner wrote: > > tl; dr: Once you pass a certain point, ramdisks can be *much* slower > > than SSDs on journal intensive workloads like AIM7. Hence it would be > > useful to see if you have the same problems on, say, high > > performance nvme SSDs. > > Note that all these ramdisk issues you mentioned below will also apply > to using the pmem driver on nvdimms, which might be a more realistic > version. Even worse at least for cases where the nvdimms aren't > actually powerfail dram of some sort with write through caching and > ADR the latency is going to be much higher than the ramdisk as well. Yes, I realise that. I am expecting that when it comes to optimising for pmem, we'll actually rewrite the journal to map pmem and memcpy() directly rather than go through the buffering and IO layers we currently do so we can minimise write latency and control concurrency ourselves. Hence I'm not really concerned by performance issues with pmem at this point - most of our still users have traditional storage and will for a long time to come.... Cheers, Dave.
On Mon, Aug 27, 2018 at 11:34:13AM -0400, Waiman Long wrote: > On 08/26/2018 08:21 PM, Dave Chinner wrote: > > On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote: > >> The current log space reservation code allows multiple wakeups of the > >> same sleeping waiter to happen. This is a just a waste of cpu time as > >> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is > >> added to track if a task is being waken up and skip the wake_up_process() > >> call if the flag is set. > >> > >> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread > >> Broadwell system with a small xfs filesystem on ramfs, the performance > >> increased from 91,486 jobs/min to 192,666 jobs/min with this change. > > Oh, I just noticed you are using a ramfs for this benchmark, > > > > tl; dr: Once you pass a certain point, ramdisks can be *much* slower > > than SSDs on journal intensive workloads like AIM7. Hence it would be > > useful to see if you have the same problems on, say, high > > performance nvme SSDs. > > Oh sorry, I made a mistake. > > There were some problems with my test configuration. I was actually > running the test on a regular enterprise-class disk device mount on /. > > Filesystem 1K-blocks Used Available > Use% Mounted on > /dev/mapper/rhel_hp--xl420gen9--01-root 52403200 11284408 41118792 22% / > > It was not an SSD, nor ramdisk. I reran the test on ramdisk, the > performance of the patched kernel was 679,880 jobs/min which was a bit > more than double the 285,221 score that I got on a regular disk. Can you please re-run and report the results for each patch on the ramdisk setup? And, please, include the mkfs.xfs or xfs_info output for the ramdisk filesystem so I can see /exactly/ how much concurrency the filesystems are providing to the benchmark you are running. > So the filesystem used wasn't tiny, though it is still not very large. 50GB is tiny for XFS. Personally, I've been using ~1PB filesystems(*) for the performance testing I've been doing recently... Cheers, Dave. (*) Yes, petabytes. Sparse image files on really fast SSDs are a wonderful thing.
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index c3b610b687d1..ac1dc8db7112 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -232,8 +232,16 @@ xlog_grant_head_wake( return false; *free_bytes -= need_bytes; + + /* + * Skip task that is being waken up already. + */ + if (tic->t_flags & XLOG_TIC_WAKING) + continue; + trace_xfs_log_grant_wake_up(log, tic); wake_up_process(tic->t_task); + tic->t_flags |= XLOG_TIC_WAKING; } return true; @@ -264,6 +272,7 @@ xlog_grant_head_wait( trace_xfs_log_grant_wake(log, tic); spin_lock(&head->lock); + tic->t_flags &= ~XLOG_TIC_WAKING; if (XLOG_FORCED_SHUTDOWN(log)) goto shutdown; } while (xlog_space_left(log, &head->grant) < need_bytes); diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b5f82cb36202..738df09bf352 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -59,6 +59,7 @@ static inline uint xlog_get_client_id(__be32 i) */ #define XLOG_TIC_INITED 0x1 /* has been initialized */ #define XLOG_TIC_PERM_RESERV 0x2 /* permanent reservation */ +#define XLOG_TIC_WAKING 0x4 /* task is being waken up */ #define XLOG_TIC_FLAGS \ { XLOG_TIC_INITED, "XLOG_TIC_INITED" }, \
The current log space reservation code allows multiple wakeups of the same sleeping waiter to happen. This is a just a waste of cpu time as well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is added to track if a task is being waken up and skip the wake_up_process() call if the flag is set. Running the AIM7 fserver workload on a 2-socket 24-core 48-thread Broadwell system with a small xfs filesystem on ramfs, the performance increased from 91,486 jobs/min to 192,666 jobs/min with this change. Signed-off-by: Waiman Long <longman@redhat.com> --- fs/xfs/xfs_log.c | 9 +++++++++ fs/xfs/xfs_log_priv.h | 1 + 2 files changed, 10 insertions(+)