Message ID | 20210920085436.20939-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Remove dependency on congestion_wait in mm/ | expand |
On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote: > This has been lightly tested only and the testing was useless as the > relevant code was not executed. The workload configurations I had that > used to trigger these corner cases no longer work (yey?) and I'll need > to implement a new synthetic workload. If someone is aware of a realistic > workload that forces reclaim activity to the point where reclaim stalls > then kindly share the details. The stereeotypical "stalling on I/O" problem is to plug in one of the crap USB drives you were given at a trade show and simply dd if=/dev/zero of=/dev/sdb sync You can also set up qemu to have extremely slow I/O performance: https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images
On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote: > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote: > > This has been lightly tested only and the testing was useless as the > > relevant code was not executed. The workload configurations I had that > > used to trigger these corner cases no longer work (yey?) and I'll need > > to implement a new synthetic workload. If someone is aware of a realistic > > workload that forces reclaim activity to the point where reclaim stalls > > then kindly share the details. > > The stereeotypical "stalling on I/O" problem is to plug in one of the > crap USB drives you were given at a trade show and simply > dd if=/dev/zero of=/dev/sdb > sync > The test machines are 1500KM away so plugging in a USB stick but worst comes to the worst, I could test it on a laptop. I considered using the IO controller but I'm not sure that would throttle background writeback. I dismissed doing this for a few reasons though -- the dirtying should be rate limited based on the speed of the BDI so it will not necessarily trigger the condition. It also misses the other interesting cases -- throttling due to excessive isolation and throttling due to failing to make progress. I've prototyped a synthetic case that uses 4..(NR_CPUS*4) workers. 1 worker measures mmap/munmap latency. 1 worker under fio is randomly reading files. The remaining workers are split between fio doing random write IO on separate files and anonymous memory hogs reading large mappings every 5 seconds. The aggregate WSS is approximately totalmem*2 split between 60% anon and 40% file-backed (40% to be 2xdirty_ratio). After a warmup period based on the writeback speed, it runs for 5 minutes per number of workers. The primary metric of "goodness" will be the mmap latency because it's the smallest worker that should be able to make quick progress and I want to see how much it is interfered with during reclaim. I'll be graphing the throttling times to see what processes get throttled and for how long. I was hoping though that there was a canonical realistic case that the FS people use to stress the paths where the allocator fails to return memory. While my synthetic workload *might* work to trigger the cases, I would prefer to have something that can compare this basic approach with anything that is more clever. Similarly, it would be nice to have a reasonable test case that phase changes what memory is hot while there is heavy IO in the background to detect whether the hot WSS is being properly protected. I used to use memcached and a heavy writer to simulate this but it's weak because there is no phase change so it's poor at evaluating vmscan. > You can also set up qemu to have extremely slow I/O performance: > https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images > Similar problem to the slow USB case, it's only catching one part of the picture except now I have to worry about differences that are related to the VM configuration (e.g. pinning virtual CPUs to physical CPUs and replicating topology). Fine for a functional test, not so fine for measuring if the patch is any good performance-wise.
On Mon, Sep 20, 2021 at 01:50:58PM +0100, Mel Gorman wrote: > On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote: > > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote: > > > This has been lightly tested only and the testing was useless as the > > > relevant code was not executed. The workload configurations I had that > > > used to trigger these corner cases no longer work (yey?) and I'll need > > > to implement a new synthetic workload. If someone is aware of a realistic > > > workload that forces reclaim activity to the point where reclaim stalls > > > then kindly share the details. > > > > The stereeotypical "stalling on I/O" problem is to plug in one of the > > crap USB drives you were given at a trade show and simply > > dd if=/dev/zero of=/dev/sdb > > sync > > > > The test machines are 1500KM away so plugging in a USB stick but worst > comes to the worst, I could test it on a laptop. There's a device mapper target dm-delay [1] that as it says delays the reads and writes, so you could try to emulate the slow USB that way. [1] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html
On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote: > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote: > > This has been lightly tested only and the testing was useless as the > > relevant code was not executed. The workload configurations I had that > > used to trigger these corner cases no longer work (yey?) and I'll need > > to implement a new synthetic workload. If someone is aware of a realistic > > workload that forces reclaim activity to the point where reclaim stalls > > then kindly share the details. > > The stereeotypical "stalling on I/O" problem is to plug in one of the > crap USB drives you were given at a trade show and simply > dd if=/dev/zero of=/dev/sdb > sync > > You can also set up qemu to have extremely slow I/O performance: > https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images > Ok, I managed to get something working and nothing blew up. The workload was similar to what I described except the dirty file data is related to dirty_ratio, the memory hogs no longer sleep and I disabled the parallel readers. There is still a configuration with the parallel readers but I won't have the results till tomorrow. Surprising no one, vanilla kernel throttling barely works. 1 writeback_wait_iff_congested: usec_delayed=4000 3 writeback_congestion_wait: usec_delayed=108000 196 writeback_congestion_wait: usec_delayed=104000 16697 writeback_wait_iff_congested: usec_delayed=0 too_many_isolated it not tracked at all so we don't know what that looks like but kswapd "blocking" on dirty pages at the tail basically never stalls. The few congestion_wait's that did happen stalled for the full duration as the bdi is not tracking congestion at all. With the series, the breakdown of reasons to stall were 5703 reason=VMSCAN_THROTTLE_WRITEBACK 29644 reason=VMSCAN_THROTTLE_NOPROGRESS 1979999 reason=VMSCAN_THROTTLE_ISOLATED kswapd stalls were rare but they did happen and surprise surprise, it was dirty pages 914 reason=VMSCAN_THROTTLE_WRITEBACK All of them stalled for the full timeout so there might be a bug in patch 1 because that sounds suspicious. As "too many pages isolated" was the top reason, the frequency of each stall time is as follows 1 usect_delayed=164000 1 usect_delayed=192000 1 usect_delayed=200000 1 usect_delayed=208000 1 usect_delayed=220000 1 usect_delayed=244000 1 usect_delayed=308000 1 usect_delayed=312000 1 usect_delayed=316000 1 usect_delayed=332000 1 usect_delayed=588000 1 usect_delayed=620000 1 usect_delayed=836000 3 usect_delayed=116000 4 usect_delayed=124000 4 usect_delayed=128000 6 usect_delayed=120000 9 usect_delayed=112000 11 usect_delayed=100000 13 usect_delayed=48000 13 usect_delayed=96000 14 usect_delayed=40000 15 usect_delayed=88000 15 usect_delayed=92000 16 usect_delayed=80000 18 usect_delayed=68000 19 usect_delayed=76000 22 usect_delayed=84000 23 usect_delayed=108000 23 usect_delayed=60000 25 usect_delayed=44000 25 usect_delayed=52000 29 usect_delayed=36000 30 usect_delayed=56000 30 usect_delayed=64000 33 usect_delayed=72000 57 usect_delayed=32000 91 usect_delayed=20000 107 usect_delayed=24000 125 usect_delayed=28000 131 usect_delayed=16000 180 usect_delayed=12000 186 usect_delayed=8000 1379 usect_delayed=104000 16493 usect_delayed=4000 1960837 usect_delayed=0 In other words, the vast majority of stalls were for 0 time and the task was immediately woken again. The next most common stall time was 1 tick but a sizable number reach the full timeout. Everything else is somewhere in between so the event trigger appears to be ok. I don't know how the application itself performed as I still have to write the analysis script and assuming I can look at this tomorrow, I'll probably start with why VMSCAN_THROTTLE_WRITEBACK always stalled for the full timeout.
On Mon, Sep 20, 2021 at 04:11:52PM +0200, David Sterba wrote: > On Mon, Sep 20, 2021 at 01:50:58PM +0100, Mel Gorman wrote: > > On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote: > > > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote: > > > > This has been lightly tested only and the testing was useless as the > > > > relevant code was not executed. The workload configurations I had that > > > > used to trigger these corner cases no longer work (yey?) and I'll need > > > > to implement a new synthetic workload. If someone is aware of a realistic > > > > workload that forces reclaim activity to the point where reclaim stalls > > > > then kindly share the details. > > > > > > The stereeotypical "stalling on I/O" problem is to plug in one of the > > > crap USB drives you were given at a trade show and simply > > > dd if=/dev/zero of=/dev/sdb > > > sync > > > > > > > The test machines are 1500KM away so plugging in a USB stick but worst > > comes to the worst, I could test it on a laptop. > > There's a device mapper target dm-delay [1] that as it says delays the > reads and writes, so you could try to emulate the slow USB that way. > > [1] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html Ah, thanks for that tip. I wondered if something like this existed and clearly did not search hard enough. I was able to reproduce the problem without throttling but this could still be useful if examining cases where there are 2 or more BDIs with variable speeds.
On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote: > Cc list similar to "congestion_wait() and GFP_NOFAIL" as they're loosely > related. > > This is a prototype series that removes all calls to congestion_wait > in mm/ and deletes wait_iff_congested. It's not a clever > implementation but congestion_wait has been broken for a long time > (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/). > Even if it worked, it was never a great idea. While excessive > dirty/writeback pages at the tail of the LRU is one possibility that > reclaim may be slow, there is also the problem of too many pages being > isolated and reclaim failing for other reasons (elevated references, > too many pages isolated, excessive LRU contention etc). > > This series replaces the reclaim conditions with event driven ones > > o If there are too many dirty/writeback pages, sleep until a timeout > or enough pages get cleaned > o If too many pages are isolated, sleep until enough isolated pages > are either reclaimed or put back on the LRU > o If no progress is being made, let direct reclaim tasks sleep until > another task makes progress > > This has been lightly tested only and the testing was useless as the > relevant code was not executed. The workload configurations I had that > used to trigger these corner cases no longer work (yey?) and I'll need > to implement a new synthetic workload. If someone is aware of a realistic > workload that forces reclaim activity to the point where reclaim stalls > then kindly share the details. Got a git tree pointer so I can pull it into a test kernel so I can see what impact it has on behaviour before I try to make sense of the code? Cheers, Dave.
On Wed, Sep 22, 2021 at 06:46:21AM +1000, Dave Chinner wrote: > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote: > > Cc list similar to "congestion_wait() and GFP_NOFAIL" as they're loosely > > related. > > > > This is a prototype series that removes all calls to congestion_wait > > in mm/ and deletes wait_iff_congested. It's not a clever > > implementation but congestion_wait has been broken for a long time > > (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/). > > Even if it worked, it was never a great idea. While excessive > > dirty/writeback pages at the tail of the LRU is one possibility that > > reclaim may be slow, there is also the problem of too many pages being > > isolated and reclaim failing for other reasons (elevated references, > > too many pages isolated, excessive LRU contention etc). > > > > This series replaces the reclaim conditions with event driven ones > > > > o If there are too many dirty/writeback pages, sleep until a timeout > > or enough pages get cleaned > > o If too many pages are isolated, sleep until enough isolated pages > > are either reclaimed or put back on the LRU > > o If no progress is being made, let direct reclaim tasks sleep until > > another task makes progress > > > > This has been lightly tested only and the testing was useless as the > > relevant code was not executed. The workload configurations I had that > > used to trigger these corner cases no longer work (yey?) and I'll need > > to implement a new synthetic workload. If someone is aware of a realistic > > workload that forces reclaim activity to the point where reclaim stalls > > then kindly share the details. > > Got a git tree pointer so I can pull it into a test kernel so I can > see what impact it has on behaviour before I try to make sense of > the code? > The current version I'm testing is at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v2r5 Only one test has completed and I won't be able to analyse the results in detail for a few days but it's doing *something* for the workload that is hammering reclaim 5.15.0-rc1 5.15.0-rc1 vanillamm-reclaimcongest-v2r5 Duration User 10891.30 9945.59 Duration System 5673.78 2649.43 Duration Elapsed 2402.85 2407.96 System CPU usage dropped by a lot. Workload completes runs for a fixed duration so a difference in elapsed is not interesting Ops Direct pages scanned 518791317.00 219956338.00 Ops Kswapd pages scanned 128555233.00 165439373.00 Ops Kswapd pages reclaimed 87830801.00 72216420.00 Ops Direct pages reclaimed 16114049.00 10408389.00 Ops Kswapd efficiency % 68.32 43.65 Ops Kswapd velocity 53501.15 68705.20 Ops Direct efficiency % 3.11 4.73 Ops Direct velocity 215906.66 91345.5 Ops Percentage direct scans 80.14 57.07 Ops Page writes by reclaim 4225921.00 2032865.00 Large reductions in direct pages scanned. The rate kswapd scans is roughly the same (velocity) where as direct velocity is down (presumably because it's getting throttled). Pages written from reclaim context are about halved. Kswapd scan rates are increased slightly but probably because direct reclaimers throttled. Reclaim efficiency is low but that's expected given the workload is basically trying to make it as hard as possible for reclaim to make progress. Kswapd is only getting throttled on writeback and is being woken before the timeout of 100000 1 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK 2 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 6 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 12 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 17 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 129 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK 205 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK The number of throttle events for direct reclaimers were 16909 reason=VMSCAN_THROTTLE_ISOLATED 77844 reason=VMSCAN_THROTTLE_NOPROGRESS 113415 reason=VMSCAN_THROTTLE_WRITEBACK For the throttle events, 33% of them were NOPROGRESS hitting the full timeout and 33% were WRITEBACK hitting the full timeout. If anything, that would suggest increasing the max timeout as presumably they woke up uselessly like Neil had suggested.