mbox series

[RFC,0/2] Add predictive memory reclamation and compaction

Message ID 20190813014012.30232-1-khalid.aziz@oracle.com (mailing list archive)
Headers show
Series Add predictive memory reclamation and compaction | expand

Message

Khalid Aziz Aug. 13, 2019, 1:40 a.m. UTC
Page reclamation and compaction is triggered in response to reaching low
watermark. This makes reclamation/compaction reactive based upon a
snapshot of the system at a point in time. When that point is reached,
system is already suffering from free memory shortage and must now try
to recover. Recovery can often land system in direct
reclamation/compaction path and while recovery happens, workloads start
to experience unpredictable memory allocation latencies. In real life,
forced direct reclamation has been seen to cause sudden spike in time it
takes to populate a new database or an extraordinary unpredictable
latency in launching a new server on cloud platform. These events create
SLA violations which are expensive for businesses.

If the kernel could foresee a potential free page exhaustion or
fragmentation event well before it happens, it could start reclamation
proactively instead to avoid allocation stalls. A time based trend line
for available free pages can show such potential future events by
charting the current memory consumption trend on the system.

These patches propose a way to capture enough memory usage information
to compute a trend line based upon most recent data. Trend line is
graphed with x-axis showing time and y-axis showing number of free
pages. The proposal is to capture the number of free pages at opportune
moments along with the current timestamp. Once system has enough data
points (the lookback window for trend analysis), fit a line of the form
y=mx+c to these points using least sqaure regression method.  As time
advances, these points can be updated with new data points and a new
best fit line can be computed. Capturing these data points and computing
trend line for pages of order 0-MAX_ORDER allows us to not only foresee
free pages exhaustion point but also severe fragmentation points in
future.

If the line representing trend for total free pages has a negative slope
(hence trending downward), solving y=mx+c for x with y=0 tells us if
the current trend continues, at what point would the system run out of
free pages. If average rate of page reclamation is computed by observing
page reclamation behavior, that information can be used to compute the
time to start reclamation at so that number of free pages does not fall
to 0 or below low watermark if current memory consumption trend were to
continue.

Similarly, if kernel tracks the level of fragmentation for each order
page (which can be done by computing the number of free pages below this
order), a trend line for each order can be used to compute the point in
time when no more pages of that order will be available for allocation.
If the trend line represents number of unusable pages for that order,
the intersection of this line with line representing number of free
pages is the point of 100% fragmentation. This holds true because at
this intersection point all free pages are of lower order. Intersetion
point for two lines y0=m0x0+c0 and y1=m1x1+c1 can be computed
mathematically which yields x and y coordinates on time and free pages
graph. If average rate of compaction is computed by timing previous
compaction runs, kernel can compute how soon does it need to start
compaction to avoid this 100% fragmentation point.

Patch 1 adds code to maintain a sliding lookback window of (time, number
of free pages) points which can be updated continuously and adds code to
compute best fit line across these points. It also adds code to use the
best fit lines to determine if kernel must start reclamation or
compaction.

Patch 2 adds code to collect data points on free pages of various orders
at different points in time, uses code in patch 1 to update sliding
lookback window with these points and kicks off reclamation or
compaction based upon the results it gets.

Patch 1 maintains a fixed size lookback window. A fixed size lookback
window limits the amount of data that has to be maintained to compute a
best fit line. Routine mem_predict() in patch 1 uses best fit line to
determine the immediate need for reclamation or compaction. To simplify
initial concept implementation, it uses a fixed time threshold when
compaction should start in anticipation of impending fragmentation.
Similarly it uses a fixed minimum precentage free pages as criteria to
detrmine if it is time to start reclamation if the current trend line
shows continued drop in number of free pages. Both of these criteria can
be improved upon in final implementation by taking rate of compaction
and rate of reclamation into account.

Patch 2 collects data points for best fit line in kswapd before we
decide if kswapd should go to sleep or continue reclamation. It then
uses that data to delay kswapd from sleeping and continue reclamation.
Potential fragmentation information obtained from best fit line is used
to decide if zone watermark should be boosted to avert impending
fragmentation. This data is also used in balance_pgdat() to determine if
kcompatcd should be woken up to start compaction.
get_page_from_freelist() might be a better place to gather data points
and make decision on starting reclamation or comapction but it can also
impact page allocation latency. Another possibility is to create a
separate kernel thread that gathers page usage data periodically and
wakes up kswapd or kcompactd as needed based upon trend analysis. This
is something that can be finalized before final implementation of this
proposal.

Impact of this implementation was measured using two sets of tests.
First test consists of three concurrent dd processes writing large
amounts of data (66 GB, 131 GB and 262 GB) to three different SSDs
causing large number of free pages to be used up for buffer/page cache.
Number of cumulative allocation stalls as reported by /proc/vmstat were
recorded for 5 runs of this test.

5.3-rc2
-------

allocstall_dma 0
allocstall_dma32 0
allocstall_normal 15
allocstall_movable 1629
compact_stall 0

Total = 1644


5.3-rc2 + this patch series
---------------------------

allocstall_dma 0
allocstall_dma32 0
allocstall_normal 182
allocstall_movable 1266
compact_stall 0

Total = 1544

There was no significant change in system time between these runs. This
was a ~6.5% improvement in number of allocation stalls.

A scond test used was the parallel dd test from mmtests. Average number
of stalls over 4 runs with unpatched 5.3-rc2 kernel was 6057. Average
number of stalls over 4 runs after applying these patches was 5584. This
was an ~8% improvement in number of allocation stalls.

This work is complementary to other allocation/compaction stall
improvements. It attempts to address potential stalls proactively before
they happen and will make use of any improvements made to the
reclamation/compaction code.

Any feedback on this proposal and associated implementation will be
greatly appreciated. This is work in progress.

Khalid Aziz (2):
  mm: Add trend based prediction algorithm for memory usage
  mm/vmscan: Add fragmentation prediction to kswapd

 include/linux/mmzone.h |  72 +++++++++++
 mm/Makefile            |   2 +-
 mm/lsq.c               | 273 +++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |  27 ----
 mm/vmscan.c            | 116 ++++++++++++++++-
 5 files changed, 456 insertions(+), 34 deletions(-)
 create mode 100644 mm/lsq.c

Comments

Michal Hocko Aug. 13, 2019, 2:05 p.m. UTC | #1
On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
[...]
> Patch 1 adds code to maintain a sliding lookback window of (time, number
> of free pages) points which can be updated continuously and adds code to
> compute best fit line across these points. It also adds code to use the
> best fit lines to determine if kernel must start reclamation or
> compaction.
> 
> Patch 2 adds code to collect data points on free pages of various orders
> at different points in time, uses code in patch 1 to update sliding
> lookback window with these points and kicks off reclamation or
> compaction based upon the results it gets.

An important piece of information missing in your description is why
do we need to keep that logic in the kernel. In other words, we have
the background reclaim that acts on a wmark range and those are tunable
from the userspace. The primary point of this background reclaim is to
keep balance and prevent from direct reclaim. Why cannot you implement
this or any other dynamic trend watching watchdog and tune watermarks
accordingly? Something similar applies to kcompactd although we might be
lacking a good interface.
Khalid Aziz Aug. 13, 2019, 3:20 p.m. UTC | #2
On 8/13/19 8:05 AM, Michal Hocko wrote:
> On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
> [...]
>> Patch 1 adds code to maintain a sliding lookback window of (time, number
>> of free pages) points which can be updated continuously and adds code to
>> compute best fit line across these points. It also adds code to use the
>> best fit lines to determine if kernel must start reclamation or
>> compaction.
>>
>> Patch 2 adds code to collect data points on free pages of various orders
>> at different points in time, uses code in patch 1 to update sliding
>> lookback window with these points and kicks off reclamation or
>> compaction based upon the results it gets.
> 
> An important piece of information missing in your description is why
> do we need to keep that logic in the kernel. In other words, we have
> the background reclaim that acts on a wmark range and those are tunable
> from the userspace. The primary point of this background reclaim is to
> keep balance and prevent from direct reclaim. Why cannot you implement
> this or any other dynamic trend watching watchdog and tune watermarks
> accordingly? Something similar applies to kcompactd although we might be
> lacking a good interface.
> 

Hi Michal,

That is a very good question. As a matter of fact the initial prototype
to assess the feasibility of this approach was written in userspace for
a very limited application. We wrote the initial prototype to monitor
fragmentation and used /sys/devices/system/node/node*/compact to trigger
compaction. The prototype demonstrated this approach has merits.

The primary reason to implement this logic in the kernel is to make the
kernel self-tuning. The more knobs we have externally, the more complex
it becomes to tune the kernel externally. If we can make the kernel
self-tuning, we can actually eliminate external knobs and simplify
kernel admin. Inspite of availability of tuning knobs and large number
of tuning guides for databases and cloud platforms, allocation stalls is
a routinely occurring problem on customer deployments. A best fit line
algorithm shows immeasurable impact on system performance yet provides
measurable improvement and room for further refinement. Makes sense?

Thanks,
Khalid
Michal Hocko Aug. 14, 2019, 8:58 a.m. UTC | #3
On Tue 13-08-19 09:20:51, Khalid Aziz wrote:
> On 8/13/19 8:05 AM, Michal Hocko wrote:
> > On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
> > [...]
> >> Patch 1 adds code to maintain a sliding lookback window of (time, number
> >> of free pages) points which can be updated continuously and adds code to
> >> compute best fit line across these points. It also adds code to use the
> >> best fit lines to determine if kernel must start reclamation or
> >> compaction.
> >>
> >> Patch 2 adds code to collect data points on free pages of various orders
> >> at different points in time, uses code in patch 1 to update sliding
> >> lookback window with these points and kicks off reclamation or
> >> compaction based upon the results it gets.
> > 
> > An important piece of information missing in your description is why
> > do we need to keep that logic in the kernel. In other words, we have
> > the background reclaim that acts on a wmark range and those are tunable
> > from the userspace. The primary point of this background reclaim is to
> > keep balance and prevent from direct reclaim. Why cannot you implement
> > this or any other dynamic trend watching watchdog and tune watermarks
> > accordingly? Something similar applies to kcompactd although we might be
> > lacking a good interface.
> > 
> 
> Hi Michal,
> 
> That is a very good question. As a matter of fact the initial prototype
> to assess the feasibility of this approach was written in userspace for
> a very limited application. We wrote the initial prototype to monitor
> fragmentation and used /sys/devices/system/node/node*/compact to trigger
> compaction. The prototype demonstrated this approach has merits.
> 
> The primary reason to implement this logic in the kernel is to make the
> kernel self-tuning.

What makes this particular self-tuning an universal win? In other words
there are many ways to analyze the memory pressure and feedback it back
that I can think of. It is quite likely that very specific workloads
would have very specific demands there. I have seen cases where are
trivial increase of min_free_kbytes to normally insane value worked
really great for a DB workload because the wasted memory didn't matter
for example.

> The more knobs we have externally, the more complex
> it becomes to tune the kernel externally.

I agree on this point. Is the current set of tunning sufficient? What
would be missing if not?
Khalid Aziz Aug. 15, 2019, 4:27 p.m. UTC | #4
On 8/14/19 2:58 AM, Michal Hocko wrote:
> On Tue 13-08-19 09:20:51, Khalid Aziz wrote:
>> On 8/13/19 8:05 AM, Michal Hocko wrote:
>>> On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
>>> [...]
>>>> Patch 1 adds code to maintain a sliding lookback window of (time, number
>>>> of free pages) points which can be updated continuously and adds code to
>>>> compute best fit line across these points. It also adds code to use the
>>>> best fit lines to determine if kernel must start reclamation or
>>>> compaction.
>>>>
>>>> Patch 2 adds code to collect data points on free pages of various orders
>>>> at different points in time, uses code in patch 1 to update sliding
>>>> lookback window with these points and kicks off reclamation or
>>>> compaction based upon the results it gets.
>>>
>>> An important piece of information missing in your description is why
>>> do we need to keep that logic in the kernel. In other words, we have
>>> the background reclaim that acts on a wmark range and those are tunable
>>> from the userspace. The primary point of this background reclaim is to
>>> keep balance and prevent from direct reclaim. Why cannot you implement
>>> this or any other dynamic trend watching watchdog and tune watermarks
>>> accordingly? Something similar applies to kcompactd although we might be
>>> lacking a good interface.
>>>
>>
>> Hi Michal,
>>
>> That is a very good question. As a matter of fact the initial prototype
>> to assess the feasibility of this approach was written in userspace for
>> a very limited application. We wrote the initial prototype to monitor
>> fragmentation and used /sys/devices/system/node/node*/compact to trigger
>> compaction. The prototype demonstrated this approach has merits.
>>
>> The primary reason to implement this logic in the kernel is to make the
>> kernel self-tuning.
> 
> What makes this particular self-tuning an universal win? In other words
> there are many ways to analyze the memory pressure and feedback it back
> that I can think of. It is quite likely that very specific workloads
> would have very specific demands there. I have seen cases where are
> trivial increase of min_free_kbytes to normally insane value worked
> really great for a DB workload because the wasted memory didn't matter
> for example.

Hi Michal,

The problem is not so much as do we have enough knobs available, rather
how do we tweak them dynamically to avoid allocation stalls. Knobs like
watermarks and min_free_kbytes are set once typically and left alone.
Allocation stalls show up even on much smaller scale than large DB or
cloud platforms. I have seen it on a desktop class machine running a few
services in the background. Desktop is running gnome3, I would lock the
screen and come back to unlock it a day or two later. In that time most
of memory has been consumed by buffer/page cache. Just unlocking the
screen can take 30+ seconds while system reclaims pages to be able swap
back in all the processes that were inactive so far.

It is true different workloads will have different requirements and that
is what I am attempting to address here. Instead of tweaking the knobs
statically based upon one workload requirements, I am looking at the
trend of memory consumption instead. A best fit line showing recent
trend can be quite indicative of what the workload is doing in terms of
memory. For instance, a cloud server might be running a certain number
of instances for a few days and it can end up using any memory not used
up by tasks, for buffer/page cache. Now the sys admin gets a request to
launch another instance and when they try to to do that, system starts
to allocate pages and soon runs out of free pages. We are now in direct
reclaim path and it can take significant amount of time to find all free
pages the new task needs. If the kernel were watching the memory
consumption trend instead, it could see that the trend line shows a
complete exhaustion of free pages or 100% fragmentation in near future,
irrespective of what the workload is. This allows kernel to start
reclamation/compaction before we actually hit the point of complete free
page exhaustion or fragmentation. This could avoid direct
reclamation/compaction or at least cut down its severity enough. That is
what makes it a win in large number of cases. Least square algorithm is
lightweight enough to not add to system load or complexity. If you have
come across a better algorithm, I certainly would look into using that.

> 
>> The more knobs we have externally, the more complex
>> it becomes to tune the kernel externally.
> 
> I agree on this point. Is the current set of tunning sufficient? What
> would be missing if not?
> 

We have knob available to force compaction immediately. That is helpful
and in some case, sys admins have resorted to forcing compaction on all
zones before launching a new cloud instance or loading a new database.
Some admins have resorted to using /proc/sys/vm/drop_caches to force
buffer/page cache pages to be freed up. Either of these solutions causes
system load to go up immediately while kswapd/kcompactd run to free up
and compact pages. This is far from ideal. Other knobs available seem to
be hard to set correctly especially on servers that run mixed workloads
which results in a regular stream of customer complaints coming in about
system stalling at most inopportune times.

I appreciate this discussion. This is how we can get to a solution that
actually works.

Thanks,
Khalid
Michal Hocko Aug. 15, 2019, 5:02 p.m. UTC | #5
On Thu 15-08-19 10:27:26, Khalid Aziz wrote:
> On 8/14/19 2:58 AM, Michal Hocko wrote:
> > On Tue 13-08-19 09:20:51, Khalid Aziz wrote:
> >> On 8/13/19 8:05 AM, Michal Hocko wrote:
> >>> On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
> >>> [...]
> >>>> Patch 1 adds code to maintain a sliding lookback window of (time, number
> >>>> of free pages) points which can be updated continuously and adds code to
> >>>> compute best fit line across these points. It also adds code to use the
> >>>> best fit lines to determine if kernel must start reclamation or
> >>>> compaction.
> >>>>
> >>>> Patch 2 adds code to collect data points on free pages of various orders
> >>>> at different points in time, uses code in patch 1 to update sliding
> >>>> lookback window with these points and kicks off reclamation or
> >>>> compaction based upon the results it gets.
> >>>
> >>> An important piece of information missing in your description is why
> >>> do we need to keep that logic in the kernel. In other words, we have
> >>> the background reclaim that acts on a wmark range and those are tunable
> >>> from the userspace. The primary point of this background reclaim is to
> >>> keep balance and prevent from direct reclaim. Why cannot you implement
> >>> this or any other dynamic trend watching watchdog and tune watermarks
> >>> accordingly? Something similar applies to kcompactd although we might be
> >>> lacking a good interface.
> >>>
> >>
> >> Hi Michal,
> >>
> >> That is a very good question. As a matter of fact the initial prototype
> >> to assess the feasibility of this approach was written in userspace for
> >> a very limited application. We wrote the initial prototype to monitor
> >> fragmentation and used /sys/devices/system/node/node*/compact to trigger
> >> compaction. The prototype demonstrated this approach has merits.
> >>
> >> The primary reason to implement this logic in the kernel is to make the
> >> kernel self-tuning.
> > 
> > What makes this particular self-tuning an universal win? In other words
> > there are many ways to analyze the memory pressure and feedback it back
> > that I can think of. It is quite likely that very specific workloads
> > would have very specific demands there. I have seen cases where are
> > trivial increase of min_free_kbytes to normally insane value worked
> > really great for a DB workload because the wasted memory didn't matter
> > for example.
> 
> Hi Michal,
> 
> The problem is not so much as do we have enough knobs available, rather
> how do we tweak them dynamically to avoid allocation stalls. Knobs like
> watermarks and min_free_kbytes are set once typically and left alone.

Does anything prevent from tuning these knobs more dynamically based on
already exported metrics?

> Allocation stalls show up even on much smaller scale than large DB or
> cloud platforms. I have seen it on a desktop class machine running a few
> services in the background. Desktop is running gnome3, I would lock the
> screen and come back to unlock it a day or two later. In that time most
> of memory has been consumed by buffer/page cache. Just unlocking the
> screen can take 30+ seconds while system reclaims pages to be able swap
> back in all the processes that were inactive so far.

This sounds like a bug to me.

> It is true different workloads will have different requirements and that
> is what I am attempting to address here. Instead of tweaking the knobs
> statically based upon one workload requirements, I am looking at the
> trend of memory consumption instead. A best fit line showing recent
> trend can be quite indicative of what the workload is doing in terms of
> memory.

Is there anything preventing from following that trend from the
userspace and trigger background reclaim earlier to not even get to the
direct reclaim though?

> For instance, a cloud server might be running a certain number
> of instances for a few days and it can end up using any memory not used
> up by tasks, for buffer/page cache. Now the sys admin gets a request to
> launch another instance and when they try to to do that, system starts
> to allocate pages and soon runs out of free pages. We are now in direct
> reclaim path and it can take significant amount of time to find all free
> pages the new task needs. If the kernel were watching the memory
> consumption trend instead, it could see that the trend line shows a
> complete exhaustion of free pages or 100% fragmentation in near future,
> irrespective of what the workload is.

I am confused now. How can an unpredictable action (like sys admin
starting a new workload) be handled by watching a memory consumption
history trend? From the above description I would expect that the system
would be in a balanced state for few days when a new instance is
launched. The only reasonable thing to do then is to trigger the reclaim
before the workload is spawned but then what is the actual difference
between direct reclaim and an early reclaim?

[...]
> > I agree on this point. Is the current set of tunning sufficient? What
> > would be missing if not?
> > 
> 
> We have knob available to force compaction immediately. That is helpful
> and in some case, sys admins have resorted to forcing compaction on all
> zones before launching a new cloud instance or loading a new database.
> Some admins have resorted to using /proc/sys/vm/drop_caches to force
> buffer/page cache pages to be freed up. Either of these solutions causes
> system load to go up immediately while kswapd/kcompactd run to free up
> and compact pages. This is far from ideal. Other knobs available seem to
> be hard to set correctly especially on servers that run mixed workloads
> which results in a regular stream of customer complaints coming in about
> system stalling at most inopportune times.

Then let's talk about what is missing in the existing tuning we already
provide. I do agree that compaction needs some love but I am under
impression that min_free_kbytes and watermark_*_factor should give a
decent abstraction to control the background reclaim. If that is not the
case then I am really interested on examples because I might be easily
missing something there.

Thanks!
Khalid Aziz Aug. 15, 2019, 8:51 p.m. UTC | #6
On 8/15/19 11:02 AM, Michal Hocko wrote:
> On Thu 15-08-19 10:27:26, Khalid Aziz wrote:
>> On 8/14/19 2:58 AM, Michal Hocko wrote:
>>> On Tue 13-08-19 09:20:51, Khalid Aziz wrote:
>>>> On 8/13/19 8:05 AM, Michal Hocko wrote:
>>>>> On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
>>>>> [...]
>>>>>> Patch 1 adds code to maintain a sliding lookback window of (time, number
>>>>>> of free pages) points which can be updated continuously and adds code to
>>>>>> compute best fit line across these points. It also adds code to use the
>>>>>> best fit lines to determine if kernel must start reclamation or
>>>>>> compaction.
>>>>>>
>>>>>> Patch 2 adds code to collect data points on free pages of various orders
>>>>>> at different points in time, uses code in patch 1 to update sliding
>>>>>> lookback window with these points and kicks off reclamation or
>>>>>> compaction based upon the results it gets.
>>>>>
>>>>> An important piece of information missing in your description is why
>>>>> do we need to keep that logic in the kernel. In other words, we have
>>>>> the background reclaim that acts on a wmark range and those are tunable
>>>>> from the userspace. The primary point of this background reclaim is to
>>>>> keep balance and prevent from direct reclaim. Why cannot you implement
>>>>> this or any other dynamic trend watching watchdog and tune watermarks
>>>>> accordingly? Something similar applies to kcompactd although we might be
>>>>> lacking a good interface.
>>>>>
>>>>
>>>> Hi Michal,
>>>>
>>>> That is a very good question. As a matter of fact the initial prototype
>>>> to assess the feasibility of this approach was written in userspace for
>>>> a very limited application. We wrote the initial prototype to monitor
>>>> fragmentation and used /sys/devices/system/node/node*/compact to trigger
>>>> compaction. The prototype demonstrated this approach has merits.
>>>>
>>>> The primary reason to implement this logic in the kernel is to make the
>>>> kernel self-tuning.
>>>
>>> What makes this particular self-tuning an universal win? In other words
>>> there are many ways to analyze the memory pressure and feedback it back
>>> that I can think of. It is quite likely that very specific workloads
>>> would have very specific demands there. I have seen cases where are
>>> trivial increase of min_free_kbytes to normally insane value worked
>>> really great for a DB workload because the wasted memory didn't matter
>>> for example.
>>
>> Hi Michal,
>>
>> The problem is not so much as do we have enough knobs available, rather
>> how do we tweak them dynamically to avoid allocation stalls. Knobs like
>> watermarks and min_free_kbytes are set once typically and left alone.
> 
> Does anything prevent from tuning these knobs more dynamically based on
> already exported metrics?

Hi Michal,

The smarts for tuning these knobs can be implemented in userspace and
more knobs added to allow for what is missing today, but we get back to
the same issue as before. That does nothing to make kernel self-tuning
and adds possibly even more knobs to userspace. Something so fundamental
to kernel memory management as making free pages available when they are
needed really should be taken care of in the kernel itself. Moving it to
userspace just means the kernel is hobbled unless one installs and tunes
a userspace package correctly.

> 
>> Allocation stalls show up even on much smaller scale than large DB or
>> cloud platforms. I have seen it on a desktop class machine running a few
>> services in the background. Desktop is running gnome3, I would lock the
>> screen and come back to unlock it a day or two later. In that time most
>> of memory has been consumed by buffer/page cache. Just unlocking the
>> screen can take 30+ seconds while system reclaims pages to be able swap
>> back in all the processes that were inactive so far.
> 
> This sounds like a bug to me.

Quite possibly. I had seen that behavior with 4.17, 4.18 and 4.19
kernels. I then just moved enough tasks off of my machine to other
machines to make the problem go away. So I can't say if the problem has
persisted past 4.19.

> 
>> It is true different workloads will have different requirements and that
>> is what I am attempting to address here. Instead of tweaking the knobs
>> statically based upon one workload requirements, I am looking at the
>> trend of memory consumption instead. A best fit line showing recent
>> trend can be quite indicative of what the workload is doing in terms of
>> memory.
> 
> Is there anything preventing from following that trend from the
> userspace and trigger background reclaim earlier to not even get to the
> direct reclaim though?

It is possible to do that in userspace for compaction. We will need a
smaller hammer than drop_cache to do the same for reclamation. This
still makes kernel dependent upon a properly configured userspace
program for it to do something as fundamental as free page management.
That does not sound like a good situation. Allocation stalls have been a
problem for many years (I could find patch from as far back as 2002
attempting to address allocation stalls). More tuning knobs have been
temporary solution at best since workloads and storage technology keep
changing and processors keep getting faster overall.

> 
>> For instance, a cloud server might be running a certain number
>> of instances for a few days and it can end up using any memory not used
>> up by tasks, for buffer/page cache. Now the sys admin gets a request to
>> launch another instance and when they try to to do that, system starts
>> to allocate pages and soon runs out of free pages. We are now in direct
>> reclaim path and it can take significant amount of time to find all free
>> pages the new task needs. If the kernel were watching the memory
>> consumption trend instead, it could see that the trend line shows a
>> complete exhaustion of free pages or 100% fragmentation in near future,
>> irrespective of what the workload is.
> 
> I am confused now. How can an unpredictable action (like sys admin
> starting a new workload) be handled by watching a memory consumption
> history trend? From the above description I would expect that the system
> would be in a balanced state for few days when a new instance is
> launched. The only reasonable thing to do then is to trigger the reclaim
> before the workload is spawned but then what is the actual difference
> between direct reclaim and an early reclaim?

If kernel watches trend far ahead enough, it can start
reclaiming/compacting well in advance and keep direct reclamation at bay
even if there is sudden surge of memory demand. A pathological case of
userspace suddenly demanding 100's of GB of memory in one request is
always difficult to tackle. For such cases, triggering
reclamation/compaction and waiting to launch new process until enough
free pages are available might be the only solution. A more normal case
will be a continuous stream of page allocations until a database is
fully populated or a new server instance is launched. It is like a
bucket with a hole. We can wait to start filling it until water gets
very low in it or notice that the hole at the bottom has been unplugged
and water is draining fast, so we start filling it before water gets too
low. If we have been observing how fast the bucket fills up with no leak
and how fast is the current drain, we can start filling in advance
enough that water never gets too low. That is what I referred to as
improvements to current patch, i.e. track current reclamation/compaction
rate in kswapd and kcompactd and use those rates to determine how far in
advance do we start reclaiming/compacting.

> 
> [...]
>>> I agree on this point. Is the current set of tunning sufficient? What
>>> would be missing if not?
>>>
>>
>> We have knob available to force compaction immediately. That is helpful
>> and in some case, sys admins have resorted to forcing compaction on all
>> zones before launching a new cloud instance or loading a new database.
>> Some admins have resorted to using /proc/sys/vm/drop_caches to force
>> buffer/page cache pages to be freed up. Either of these solutions causes
>> system load to go up immediately while kswapd/kcompactd run to free up
>> and compact pages. This is far from ideal. Other knobs available seem to
>> be hard to set correctly especially on servers that run mixed workloads
>> which results in a regular stream of customer complaints coming in about
>> system stalling at most inopportune times.
> 
> Then let's talk about what is missing in the existing tuning we already
> provide. I do agree that compaction needs some love but I am under
> impression that min_free_kbytes and watermark_*_factor should give a
> decent abstraction to control the background reclaim. If that is not the
> case then I am really interested on examples because I might be easily
> missing something there.

Just last week an email crossed my mailbox where an order 4 allocation
failed on a server that has 768 GB memory and had 355,000 free pages of
order 2 and lower available at the time. That allocation failure brought
down an important service and was a significant disruption.

These knobs do give some control to userspace but their values depend
upon workload and it is easy enough to set them wrong. Finding the right
value is not easy for servers that run mixed workloads. So it is not
that there are not enough knobs or we can not add more knobs. The
question is is that the right direction to go or do we make kernel
self-tuning and give it the capability to deal with these issues without
requiring sys admins to be able to determine correct values for these
knobs for every new workload.

Thanks,
Khalid
Michal Hocko Aug. 21, 2019, 2:06 p.m. UTC | #7
On Thu 15-08-19 14:51:04, Khalid Aziz wrote:
> Hi Michal,
> 
> The smarts for tuning these knobs can be implemented in userspace and
> more knobs added to allow for what is missing today, but we get back to
> the same issue as before. That does nothing to make kernel self-tuning
> and adds possibly even more knobs to userspace. Something so fundamental
> to kernel memory management as making free pages available when they are
> needed really should be taken care of in the kernel itself. Moving it to
> userspace just means the kernel is hobbled unless one installs and tunes
> a userspace package correctly.

From my past experience the existing autotunig works mostly ok for a
vast variety of workloads. A more clever tuning is possible and people
are doing that already. Especially for cases when the machine is heavily
overcommited. There are different ways to achieve that. Your new
in-kernel auto tuning would have to be tested on a large variety of
workloads to be proven and riskless. So I am quite skeptical to be
honest.

Therefore I would really focus on discussing whether we have sufficient
APIs to tune the kernel to do the right thing when needed. That requires
to identify gaps in that area.
Bharath Vedartham Aug. 26, 2019, 8:44 p.m. UTC | #8
Hi Michal,

Here are some of my thoughts,
On Wed, Aug 21, 2019 at 04:06:32PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 14:51:04, Khalid Aziz wrote:
> > Hi Michal,
> > 
> > The smarts for tuning these knobs can be implemented in userspace and
> > more knobs added to allow for what is missing today, but we get back to
> > the same issue as before. That does nothing to make kernel self-tuning
> > and adds possibly even more knobs to userspace. Something so fundamental
> > to kernel memory management as making free pages available when they are
> > needed really should be taken care of in the kernel itself. Moving it to
> > userspace just means the kernel is hobbled unless one installs and tunes
> > a userspace package correctly.
> 
> From my past experience the existing autotunig works mostly ok for a
> vast variety of workloads. A more clever tuning is possible and people
> are doing that already. Especially for cases when the machine is heavily
> overcommited. There are different ways to achieve that. Your new
> in-kernel auto tuning would have to be tested on a large variety of
> workloads to be proven and riskless. So I am quite skeptical to be
> honest.
Could you give some references to such works regarding tuning the kernel? 

Essentially, Our idea here is to foresee potential memory exhaustion.
This foreseeing is done by observing the workload, observing the memory
usage of the workload. Based on this observations, we make a prediction
whether or not memory exhaustion could occur. If memory exhaustion
occurs, we reclaim some more memory. kswapd stops reclaim when
hwmark is reached. hwmark is usually set to a fairly low percentage of
total memory, in my system for zone Normal hwmark is 13% of total pages.
So there is scope for reclaiming more pages to make sure system does not
suffer from a lack of pages. 

Since we are "predicting", there could be mistakes in our prediction.
The question is how bad are the mistakes? How much does a wrong
prediction cost? 

A right prediction would be a win. We rightfully predict that there could be
exhaustion, this would lead to us reclaiming more memory(than hwmark)/compacting
memory beforehand(unlike kcompactd which does it on demand).

A wrong prediction on the other hand can be categorized into 2
situations: 
(i) We foresee memory exhaustion but there is no memory exhaustion in
the future. In this case, we would be reclaiming more memory for not a lot
of use. This situation is not entirely bad but we definitly waste a few
clock cycles.
(ii) We don't foresee memory exhaustion but there is memory exhaustion
in the future. This is a bad case where we may end up going into direct
compaction/reclaim. But it could be the case that the memory exhaustion
is far in the future and even though we didnt see it, kswapd could have
reclaimed that memory or drop_cache occured.

How often we hit wrong predictions of type (ii) would really determine our
efficiency. 

Coming to your situation of provisioning vms. A situation where our work
will come to good is when there is a cloud burst. When the demand for
vms is super high, our algorithm could adapt to the increase in demand
for these vms and reclaim more memory/compact more memory to reduce
allocation stalls and improve performance.
> Therefore I would really focus on discussing whether we have sufficient
> APIs to tune the kernel to do the right thing when needed. That requires
> to identify gaps in that area. 
One thing that comes to my mind is based on the issue Khalid mentioned
earlier on how his desktop took more than 30secs to boot up because of
the caches using up a lot of memory.
Rather than allowing any unused memory to be the page cache, would it be
a good idea to fix a size for the caches and elastically change the size
based on the workload?

Thank you
Bharath

> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko Aug. 27, 2019, 6:16 a.m. UTC | #9
On Tue 27-08-19 02:14:20, Bharath Vedartham wrote:
> Hi Michal,
> 
> Here are some of my thoughts,
> On Wed, Aug 21, 2019 at 04:06:32PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 14:51:04, Khalid Aziz wrote:
> > > Hi Michal,
> > > 
> > > The smarts for tuning these knobs can be implemented in userspace and
> > > more knobs added to allow for what is missing today, but we get back to
> > > the same issue as before. That does nothing to make kernel self-tuning
> > > and adds possibly even more knobs to userspace. Something so fundamental
> > > to kernel memory management as making free pages available when they are
> > > needed really should be taken care of in the kernel itself. Moving it to
> > > userspace just means the kernel is hobbled unless one installs and tunes
> > > a userspace package correctly.
> > 
> > From my past experience the existing autotunig works mostly ok for a
> > vast variety of workloads. A more clever tuning is possible and people
> > are doing that already. Especially for cases when the machine is heavily
> > overcommited. There are different ways to achieve that. Your new
> > in-kernel auto tuning would have to be tested on a large variety of
> > workloads to be proven and riskless. So I am quite skeptical to be
> > honest.
> Could you give some references to such works regarding tuning the kernel? 

Talk to Facebook guys and their usage of PSI to control the memory
distribution and OOM situations.

> Essentially, Our idea here is to foresee potential memory exhaustion.
> This foreseeing is done by observing the workload, observing the memory
> usage of the workload. Based on this observations, we make a prediction
> whether or not memory exhaustion could occur.

I understand that and I am not disputing this can be useful. All I do
argue here is that there is unlikely a good "crystall ball" for most/all
workloads that would justify its inclusion into the kernel and that this
is something better done in the userspace where you can experiment and
tune the behavior for a particular workload of your interest.

Therefore I would like to shift the discussion towards existing APIs and
whether they are suitable for such an advance auto-tuning. I haven't
heard any arguments about missing pieces.

> If memory exhaustion
> occurs, we reclaim some more memory. kswapd stops reclaim when
> hwmark is reached. hwmark is usually set to a fairly low percentage of
> total memory, in my system for zone Normal hwmark is 13% of total pages.
> So there is scope for reclaiming more pages to make sure system does not
> suffer from a lack of pages. 

Yes and we have ways to control those watermarks that your monitoring
tool can use to alter the reclaim behavior.
 
[...]
> > Therefore I would really focus on discussing whether we have sufficient
> > APIs to tune the kernel to do the right thing when needed. That requires
> > to identify gaps in that area. 
> One thing that comes to my mind is based on the issue Khalid mentioned
> earlier on how his desktop took more than 30secs to boot up because of
> the caches using up a lot of memory.
> Rather than allowing any unused memory to be the page cache, would it be
> a good idea to fix a size for the caches and elastically change the size
> based on the workload?

I do not think so. Limiting the pagecache is unlikely to help as it is
really cheap to reclaim most of the time. In those cases when this is
not the case (e.g. the underlying FS needs to flush and/or metadata)
then the same would be possible in a restricted page cache situation
and you could easily end up stalled waiting for pagecache (e.g. any
executable/library) while there is a lot of memory.

I cannot comment on the Khalid's example because there were no details
there but I would be really surprised if the primary source of stall was
the pagecache.
Bharath Vedartham Aug. 28, 2019, 1:09 p.m. UTC | #10
Hi Michal, Thank you for spending your time on this.
On Tue, Aug 27, 2019 at 08:16:06AM +0200, Michal Hocko wrote:
> On Tue 27-08-19 02:14:20, Bharath Vedartham wrote:
> > Hi Michal,
> > 
> > Here are some of my thoughts,
> > On Wed, Aug 21, 2019 at 04:06:32PM +0200, Michal Hocko wrote:
> > > On Thu 15-08-19 14:51:04, Khalid Aziz wrote:
> > > > Hi Michal,
> > > > 
> > > > The smarts for tuning these knobs can be implemented in userspace and
> > > > more knobs added to allow for what is missing today, but we get back to
> > > > the same issue as before. That does nothing to make kernel self-tuning
> > > > and adds possibly even more knobs to userspace. Something so fundamental
> > > > to kernel memory management as making free pages available when they are
> > > > needed really should be taken care of in the kernel itself. Moving it to
> > > > userspace just means the kernel is hobbled unless one installs and tunes
> > > > a userspace package correctly.
> > > 
> > > From my past experience the existing autotunig works mostly ok for a
> > > vast variety of workloads. A more clever tuning is possible and people
> > > are doing that already. Especially for cases when the machine is heavily
> > > overcommited. There are different ways to achieve that. Your new
> > > in-kernel auto tuning would have to be tested on a large variety of
> > > workloads to be proven and riskless. So I am quite skeptical to be
> > > honest.
> > Could you give some references to such works regarding tuning the kernel? 
> 
> Talk to Facebook guys and their usage of PSI to control the memory
> distribution and OOM situations.
Yup. Thanks for the pointer.
> > Essentially, Our idea here is to foresee potential memory exhaustion.
> > This foreseeing is done by observing the workload, observing the memory
> > usage of the workload. Based on this observations, we make a prediction
> > whether or not memory exhaustion could occur.
> 
> I understand that and I am not disputing this can be useful. All I do
> argue here is that there is unlikely a good "crystall ball" for most/all
> workloads that would justify its inclusion into the kernel and that this
> is something better done in the userspace where you can experiment and
> tune the behavior for a particular workload of your interest.
> 
> Therefore I would like to shift the discussion towards existing APIs and
> whether they are suitable for such an advance auto-tuning. I haven't
> heard any arguments about missing pieces.
I understand your concern here. Just confirming, by APIs you are
referring to sysctls, sysfs files and stuff like that right?
> > If memory exhaustion
> > occurs, we reclaim some more memory. kswapd stops reclaim when
> > hwmark is reached. hwmark is usually set to a fairly low percentage of
> > total memory, in my system for zone Normal hwmark is 13% of total pages.
> > So there is scope for reclaiming more pages to make sure system does not
> > suffer from a lack of pages. 
> 
> Yes and we have ways to control those watermarks that your monitoring
> tool can use to alter the reclaim behavior.
Just to confirm here, I am aware of one way which is to alter
min_kfree_bytes values. What other ways are there to alter watermarks
from user space? 
> [...]
> > > Therefore I would really focus on discussing whether we have sufficient
> > > APIs to tune the kernel to do the right thing when needed. That requires
> > > to identify gaps in that area. 
> > One thing that comes to my mind is based on the issue Khalid mentioned
> > earlier on how his desktop took more than 30secs to boot up because of
> > the caches using up a lot of memory.
> > Rather than allowing any unused memory to be the page cache, would it be
> > a good idea to fix a size for the caches and elastically change the size
> > based on the workload?
> 
> I do not think so. Limiting the pagecache is unlikely to help as it is
> really cheap to reclaim most of the time. In those cases when this is
> not the case (e.g. the underlying FS needs to flush and/or metadata)
> then the same would be possible in a restricted page cache situation
> and you could easily end up stalled waiting for pagecache (e.g. any
> executable/library) while there is a lot of memory.
That makes sense to me.
> I cannot comment on the Khalid's example because there were no details
> there but I would be really surprised if the primary source of stall was
> the pagecache.
Should have done more research before talking :) Sorry about that.
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 28, 2019, 1:15 p.m. UTC | #11
On Wed 28-08-19 18:39:22, Bharath Vedartham wrote:
[...]
> > Therefore I would like to shift the discussion towards existing APIs and
> > whether they are suitable for such an advance auto-tuning. I haven't
> > heard any arguments about missing pieces.
> I understand your concern here. Just confirming, by APIs you are
> referring to sysctls, sysfs files and stuff like that right?

Yup

> > > If memory exhaustion
> > > occurs, we reclaim some more memory. kswapd stops reclaim when
> > > hwmark is reached. hwmark is usually set to a fairly low percentage of
> > > total memory, in my system for zone Normal hwmark is 13% of total pages.
> > > So there is scope for reclaiming more pages to make sure system does not
> > > suffer from a lack of pages. 
> > 
> > Yes and we have ways to control those watermarks that your monitoring
> > tool can use to alter the reclaim behavior.
> Just to confirm here, I am aware of one way which is to alter
> min_kfree_bytes values. What other ways are there to alter watermarks
> from user space? 

/proc/sys/vm/watermark_*factor
Khalid Aziz Aug. 30, 2019, 9:35 p.m. UTC | #12
On 8/27/19 12:16 AM, Michal Hocko wrote:
> On Tue 27-08-19 02:14:20, Bharath Vedartham wrote:
>> Hi Michal,
>>
>> Here are some of my thoughts,
>> On Wed, Aug 21, 2019 at 04:06:32PM +0200, Michal Hocko wrote:
>>> On Thu 15-08-19 14:51:04, Khalid Aziz wrote:
>>>> Hi Michal,
>>>>
>>>> The smarts for tuning these knobs can be implemented in userspace and
>>>> more knobs added to allow for what is missing today, but we get back to
>>>> the same issue as before. That does nothing to make kernel self-tuning
>>>> and adds possibly even more knobs to userspace. Something so fundamental
>>>> to kernel memory management as making free pages available when they are
>>>> needed really should be taken care of in the kernel itself. Moving it to
>>>> userspace just means the kernel is hobbled unless one installs and tunes
>>>> a userspace package correctly.
>>>
>>> From my past experience the existing autotunig works mostly ok for a
>>> vast variety of workloads. A more clever tuning is possible and people
>>> are doing that already. Especially for cases when the machine is heavily
>>> overcommited. There are different ways to achieve that. Your new
>>> in-kernel auto tuning would have to be tested on a large variety of
>>> workloads to be proven and riskless. So I am quite skeptical to be
>>> honest.
>> Could you give some references to such works regarding tuning the kernel? 
> 
> Talk to Facebook guys and their usage of PSI to control the memory
> distribution and OOM situations.
> 
>> Essentially, Our idea here is to foresee potential memory exhaustion.
>> This foreseeing is done by observing the workload, observing the memory
>> usage of the workload. Based on this observations, we make a prediction
>> whether or not memory exhaustion could occur.
> 
> I understand that and I am not disputing this can be useful. All I do
> argue here is that there is unlikely a good "crystall ball" for most/all
> workloads that would justify its inclusion into the kernel and that this
> is something better done in the userspace where you can experiment and
> tune the behavior for a particular workload of your interest.
> 
> Therefore I would like to shift the discussion towards existing APIs and
> whether they are suitable for such an advance auto-tuning. I haven't
> heard any arguments about missing pieces.
> 

We seem to be in agreement that dynamic tuning is a useful tool. The
question is does that tuning belong in the kernel or in userspace. I see
your point that putting it in userspace allows for faster evolution of
such predictive algorithm than it would be for in-kernel algorithm. I
see following pros and cons with that approach:

+ Keeps complexity of predictive algorithms out of kernel and allows for
faster evolution of these algorithms in userspace.

+ Tuning algorithm can be fine-tuned to specific workloads as appropriate

- Kernel is not self-tuning and is dependent upon a userspace tool to
perform well in a fundamental area of memory management.

- More knobs get added to already crowded field of knobs to allow for
userspace to tweak mm subsystem for better performance.

As for adding predictive algorithm to kernel, I see following pros and cons:

+ Kernel becomes self-tuning and can respond to varying workloads better.

+ Allows for number of user visible tuning knobs to be reduced.

- Getting predictive algorithm right is important to ensure none of the
users see worse performance than today.

- Adds a certain level of complexity to mm subsystem

Pushing the burden of tuning kernel to userspace is no different from
where we are today and we still have allocation stall issues after years
of tuning from userspace. Adding more knobs to aid tuning from userspace
just makes the kernel look even more complex to the users. In my
opinion, a self tuning kernel should be the base for long term solution.
We can still export knobs to userspace to allow for users with specific
needs to further fine-tune but the base kernel should work well enough
for majority of users. We are not there at this point. We can discuss
what are the missing pieces to support further tuning from userspace but
is continuing to tweak from userpace the right long term strategy?

Assuming we want to continue to support tuning from userspace instead, I
can't say more knobs are needed right now. We may have enough knobs and
monitors available between /proc/buddyinfo, /sys/devices/system/node and
/proc/sys/vm. Right values for these knobs and their interaction is not
always clear. Maybe we need to simplify these knobs into something more
understandable for average user as opposed to adding more knobs.

--
Khalid
Michal Hocko Sept. 2, 2019, 8:02 a.m. UTC | #13
On Fri 30-08-19 15:35:06, Khalid Aziz wrote:
[...]
> - Kernel is not self-tuning and is dependent upon a userspace tool to
> perform well in a fundamental area of memory management.

You keep bringing this up without an actual analysis of a wider range of
workloads that would prove that the default behavior is really
suboptimal. You are making some assumptions based on a very specific DB
workload which might benefit from a more aggressive background workload.
If you really want to sell any changes to auto tuning then you really
need to come up with more workloads and an actual theory why an early
and more aggressive reclaim pays off.
Khalid Aziz Sept. 3, 2019, 7:45 p.m. UTC | #14
On 9/2/19 2:02 AM, Michal Hocko wrote:
> On Fri 30-08-19 15:35:06, Khalid Aziz wrote:
> [...]
>> - Kernel is not self-tuning and is dependent upon a userspace tool to
>> perform well in a fundamental area of memory management.
> 
> You keep bringing this up without an actual analysis of a wider range of
> workloads that would prove that the default behavior is really
> suboptimal. You are making some assumptions based on a very specific DB
> workload which might benefit from a more aggressive background workload.
> If you really want to sell any changes to auto tuning then you really
> need to come up with more workloads and an actual theory why an early
> and more aggressive reclaim pays off.
> 

Hi Michal,

Fair enough. I have seen DB and cloud server workloads suffer under
default behavior of reclaim/compaction. It manifests itself as prolonged
delays in populating new database and in launching new cloud
applications. It is fair to ask for the predictive algorithm to be
proven before pulling something like this in kernel. I will implement
this same algorithm in userspace and use existing knobs to tune kernel
dynamically. Running that with large number of workloads will provide
data on how often does this help. If I find any useful tunables missing,
I will be sure to bring it up.

Thanks,
Khalid