diff mbox

scrub randomization and load threshold

Message ID CABZ+qqma+-bOPcPr3K_9mZaDHE54pvNtWw0OmiAFCYH1zX3Ysw@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan van der Ster Nov. 16, 2015, 2:25 p.m. UTC
On Thu, Nov 12, 2015 at 4:34 PM, Dan van der Ster <dan@vanderster.com> wrote:
> On Thu, Nov 12, 2015 at 4:10 PM, Sage Weil <sage@newdream.net> wrote:
>> On Thu, 12 Nov 2015, Dan van der Ster wrote:
>>> On Thu, Nov 12, 2015 at 2:29 PM, Sage Weil <sage@newdream.net> wrote:
>>> > On Thu, 12 Nov 2015, Dan van der Ster wrote:
>>> >> Hi,
>>> >>
>>> >> Firstly, we just had a look at the new
>>> >> osd_scrub_interval_randomize_ratio option and found that it doesn't
>>> >> really solve the deep scrubbing problem. Given the default options,
>>> >>
>>> >> osd_scrub_min_interval = 60*60*24
>>> >> osd_scrub_max_interval = 7*60*60*24
>>> >> osd_scrub_interval_randomize_ratio = 0.5
>>> >> osd_deep_scrub_interval = 60*60*24*7
>>> >>
>>> >> we understand that the new option changes the min interval to the
>>> >> range 1-1.5 days. However, this doesn't do anything for the thundering
>>> >> herd of deep scrubs which will happen every 7 days. We've found a
>>> >> configuration that should randomize deep scrubbing across two weeks,
>>> >> e.g.:
>>> >>
>>> >> osd_scrub_min_interval = 60*60*24*7
>>> >> osd_scrub_max_interval = 100*60*60*24 // effectively disabling this option
>>> >> osd_scrub_load_threshold = 10 // effectively disabling this option
>>> >> osd_scrub_interval_randomize_ratio = 2.0
>>> >> osd_deep_scrub_interval = 60*60*24*7
>>> >>
>>> >> but that (a) doesn't allow shallow scrubs to run daily and (b) is so
>>> >> far off the defaults that its basically an abuse of the intended
>>> >> behaviour.
>>> >>
>>> >> So we'd like to simplify how deep scrubbing can be randomized. Our PR
>>> >> (http://github.com/ceph/ceph/pull/6550) adds a new option
>>> >> osd_deep_scrub_randomize_ratio which  controls a coin flip to randomly
>>> >> turn scrubs into deep scrubs. The default is tuned so roughly 1 in 7
>>> >> scrubs will be run deeply.
>>> >
>>> > The coin flip seems reasonable to me.  But wouldn't it also/instead make
>>> > sense to apply the randomize ratio to the deep_scrub_interval?  My just
>>> > adding in the random factor here:
>>> >
>>> > https://github.com/ceph/ceph/pull/6550/files#diff-dfb9ddca0a3ee32b266623e8fa489626R3247
>>> >
>>> > That is what I would have expected to happen, and if the coin flip is also
>>> > there then you have two knobs controlling the same thing, which'll cause
>>> > confusion...
>>> >
>>>
>>> That was our first idea. But that has a couple downsides:
>>>
>>>   1.  If we use the random range for the deep scrub intervals, e.g.
>>> deep every 1-1.5 weeks, we still get quite bursty scrubbing until it
>>> randomizes over a period of many weeks/months. And I fear it might
>>> even lead to lower frequency harmonics of many concurrent deep scrubs.
>>> Using a coin flip guarantees uniformity starting immediately from time
>>> zero.
>>>
>>>   2. In our PR osd_deep_scrub_interval is still used as an upper limit
>>> on how long a PG can go without being deeply scrubbed. This way
>>> there's no confusion such as PGs going undeep-scrubbed longer than
>>> expected. (In general, I think this random range is unintuitive and
>>> difficult to tune (e.g. see my 2 week deep scrubbing config above).
>>
>> Fair enough..
>>
>>> For me, the most intuitive configuration (maintaining randomness) would be:
>>>
>>>   a. drop the osd_scrub_interval_randomize_ratio because there is no
>>> shallow scrub thundering herd problem (AFAIK), and it just complicates
>>> the configuration. (But this is in a stable release now so I don't
>>> know if you want to back it out).
>>
>> I'm inclined to leave it, even if it complicates config: just because we
>> haven't noticed the shallow scrub thundering herd doesn't mean it doesn't
>> exist, and I fully expect that it is there.  Also, if the shallow scrubs
>> are lumpy and we're promoting some of them to deep scrubs, then the deep
>> scrubs will be lumpy too.
>>
>
> Sounds good.
>
>>>   b. perform a (usually shallow) scrub every
>>> osd_scrub_interval_(min/max) depending on a self-tuning load
>>> threshold.
>>
>> Yep, although as you note we have some work to do to get there.  :)
>>
>>>   c. do a coin flip each (b) to occasionally turn it into deep scrub.
>>
>> Works for me.
>>
>>>   optionally: d. remove osd_deep_scrub_randomize_ratio and replace it
>>> with  osd_scrub_interval_min/osd_deep_scrub_interval.
>>
>> There is no osd_deep_scrub_randomize_ratio.  Do you mean replace
>> osd_deep_scrub_interval with osd_deep_scrub_{min,max}_interval?
>
> osd_deep_scrub_randomize_ratio is the new option we proposed in the
> PR. We chose 0.15 because it's roughly 1/7 (i.e.
> osd_scrub_interval_min/osd_deep_scrub_interval = 1/7 in the default
> config). But the coin flip could use
> osd_scrub_interval_min/osd_deep_scrub_interval instead of adding this
> extra configurable.
>
> My preference would be to keep it separately configurable.
>
>>> >> Secondly, we'd also like to discuss the osd_scrub_load_threshold
>>> >> option, where we see two problems:
>>> >>    - the default is so low that it disables all the shallow scrub
>>> >> randomization on all but completely idle clusters.
>>> >>    - finding the correct osd_scrub_load_threshold for a cluster is
>>> >> surely unclear/difficult and probably a moving target for most prod
>>> >> clusters.
>>> >>
>>> >> Given those observations, IMHO the smart Ceph admin should set
>>> >> osd_scrub_load_threshold = 10 or higher, to effectively disable that
>>> >> functionality. In the spirit of having good defaults, I therefore
>>> >> propose that we increase the default osd_scrub_load_threshold (to at
>>> >> least 5.0) and consider removing the load threshold logic completely.
>>> >
>>> > This sounds reasonable to me.  It would be great if we could use a 24-hour
>>> > average as the baseline or something so that it was self-tuning (e.g., set
>>> > threshold to .8 of daily average), but that's a bit trickier.  Generally
>>> > all for self-tuning, though... too many knobs...
>>>
>>> Yes, but we probably would need to make your 0.8 a function of the
>>> stddev of the loadavg over a day, to handle clusters with flat
>>> loadavgs as well as varying ones.
>>>
>>> In order to randomly spread the deep scrubs across the week, it's
>>> essential to give each PG many opportunities to scrub throughout the
>>> week. If PGs are only shallow scrubbed once a week (at interval_max),
>>> then every scrub would become a deep scrub and we again have the
>>> thundering herd problem.
>>>
>>> I'll push 5.0 for now.
>>
>> Sounds good.
>>
>> I would still love to see someone tackle the auto-tuning approach,
>> though! :)
>
> I should have some time next week to have a look, if nobody beat me to it.

Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
the loadavg is decreasing (or below the threshold)? As long as the
1min loadavg is less than the 15min loadavg, we should be ok to allow
new scrubs. If you agree I'll add the patch below to my PR.

-- dan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sage Weil Nov. 16, 2015, 3:20 p.m. UTC | #1
On Mon, 16 Nov 2015, Dan van der Ster wrote:
> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
> the loadavg is decreasing (or below the threshold)? As long as the
> 1min loadavg is less than the 15min loadavg, we should be ok to allow
> new scrubs. If you agree I'll add the patch below to my PR.

I like the simplicity of that, I'm afraid its going to just trigger a 
feedback loop and oscillations on the host.  I.e., as soo as we see *any* 
decrease, all osds on the host will start to scrub, which will push the 
load up.  Once that round of PGs finish, the load will start to drop 
again, triggering another round.  This'll happen regardless of whether 
we're in the peak hours or not, and the high-level goal (IMO at least) is 
to do scrubbing in non-peak hours.

sage

> -- dan
> 
> 
> diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
> index 0562eed..464162d 100644
> --- a/src/osd/OSD.cc
> +++ b/src/osd/OSD.cc
> @@ -6065,20 +6065,24 @@ bool OSD::scrub_time_permit(utime_t now)
> 
>  bool OSD::scrub_load_below_threshold()
>  {
> -  double loadavgs[1];
> -  if (getloadavg(loadavgs, 1) != 1) {
> +  double loadavgs[3];
> +  if (getloadavg(loadavgs, 3) != 3) {
>      dout(10) << __func__ << " couldn't read loadavgs\n" << dendl;
>      return false;
>    }
> 
>    if (loadavgs[0] >= cct->_conf->osd_scrub_load_threshold) {
> -    dout(20) << __func__ << " loadavg " << loadavgs[0]
> -            << " >= max " << cct->_conf->osd_scrub_load_threshold
> -            << " = no, load too high" << dendl;
> -    return false;
> +    if (loadavgs[0] >= loadavgs[2]) {
> +      dout(20) << __func__ << " loadavg " << loadavgs[0]
> +              << " >= max " << cct->_conf->osd_scrub_load_threshold
> +               << " and >= 15m avg " << loadavgs[2]
> +              << " = no, load too high" << dendl;
> +      return false;
> +    }
>    } else {
>      dout(20) << __func__ << " loadavg " << loadavgs[0]
>              << " < max " << cct->_conf->osd_scrub_load_threshold
> +            << " or < 15 min avg " << loadavgs[2]
>              << " = yes" << dendl;
>      return true;
>    }
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan van der Ster Nov. 16, 2015, 3:32 p.m. UTC | #2
On Mon, Nov 16, 2015 at 4:20 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 16 Nov 2015, Dan van der Ster wrote:
>> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
>> the loadavg is decreasing (or below the threshold)? As long as the
>> 1min loadavg is less than the 15min loadavg, we should be ok to allow
>> new scrubs. If you agree I'll add the patch below to my PR.
>
> I like the simplicity of that, I'm afraid its going to just trigger a
> feedback loop and oscillations on the host.  I.e., as soo as we see *any*
> decrease, all osds on the host will start to scrub, which will push the
> load up.  Once that round of PGs finish, the load will start to drop
> again, triggering another round.  This'll happen regardless of whether
> we're in the peak hours or not, and the high-level goal (IMO at least) is
> to do scrubbing in non-peak hours.

We checked our OSDs' 24hr loadavg plots today and found that the
original idea of 0.8 * 24hr loadavg wouldn't leave many chances for
scrubs to run. So maybe if we used 0.9 or 1.0 it would be doable.

BTW, I realized there was a silly error in that earlier patch, and we
anyway need an upper bound, say # cpus. So until your response came I
was working with this idea:
https://stikked.web.cern.ch/stikked/view/raw/5586a912

-- dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan van der Ster Nov. 16, 2015, 3:58 p.m. UTC | #3
On Mon, Nov 16, 2015 at 4:32 PM, Dan van der Ster <dan@vanderster.com> wrote:
> On Mon, Nov 16, 2015 at 4:20 PM, Sage Weil <sage@newdream.net> wrote:
>> On Mon, 16 Nov 2015, Dan van der Ster wrote:
>>> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
>>> the loadavg is decreasing (or below the threshold)? As long as the
>>> 1min loadavg is less than the 15min loadavg, we should be ok to allow
>>> new scrubs. If you agree I'll add the patch below to my PR.
>>
>> I like the simplicity of that, I'm afraid its going to just trigger a
>> feedback loop and oscillations on the host.  I.e., as soo as we see *any*
>> decrease, all osds on the host will start to scrub, which will push the
>> load up.  Once that round of PGs finish, the load will start to drop
>> again, triggering another round.  This'll happen regardless of whether
>> we're in the peak hours or not, and the high-level goal (IMO at least) is
>> to do scrubbing in non-peak hours.
>
> We checked our OSDs' 24hr loadavg plots today and found that the
> original idea of 0.8 * 24hr loadavg wouldn't leave many chances for
> scrubs to run. So maybe if we used 0.9 or 1.0 it would be doable.
>
> BTW, I realized there was a silly error in that earlier patch, and we
> anyway need an upper bound, say # cpus. So until your response came I
> was working with this idea:
> https://stikked.web.cern.ch/stikked/view/raw/5586a912

Sorry for SSO. Here:

https://gist.github.com/dvanders/f3b08373af0f5957f589
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan van der Ster Nov. 16, 2015, 5:06 p.m. UTC | #4
On Mon, Nov 16, 2015 at 4:58 PM, Dan van der Ster <dan@vanderster.com> wrote:
> On Mon, Nov 16, 2015 at 4:32 PM, Dan van der Ster <dan@vanderster.com> wrote:
>> On Mon, Nov 16, 2015 at 4:20 PM, Sage Weil <sage@newdream.net> wrote:
>>> On Mon, 16 Nov 2015, Dan van der Ster wrote:
>>>> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
>>>> the loadavg is decreasing (or below the threshold)? As long as the
>>>> 1min loadavg is less than the 15min loadavg, we should be ok to allow
>>>> new scrubs. If you agree I'll add the patch below to my PR.
>>>
>>> I like the simplicity of that, I'm afraid its going to just trigger a
>>> feedback loop and oscillations on the host.  I.e., as soo as we see *any*
>>> decrease, all osds on the host will start to scrub, which will push the
>>> load up.  Once that round of PGs finish, the load will start to drop
>>> again, triggering another round.  This'll happen regardless of whether
>>> we're in the peak hours or not, and the high-level goal (IMO at least) is
>>> to do scrubbing in non-peak hours.
>>
>> We checked our OSDs' 24hr loadavg plots today and found that the
>> original idea of 0.8 * 24hr loadavg wouldn't leave many chances for
>> scrubs to run. So maybe if we used 0.9 or 1.0 it would be doable.
>>
>> BTW, I realized there was a silly error in that earlier patch, and we
>> anyway need an upper bound, say # cpus. So until your response came I
>> was working with this idea:
>> https://stikked.web.cern.ch/stikked/view/raw/5586a912
>
> Sorry for SSO. Here:
>
> https://gist.github.com/dvanders/f3b08373af0f5957f589

Hi again. Here's a first shot at a daily loadavg heuristic:
https://github.com/ceph/ceph/commit/15474124a183c7e92f457f836f7008a2813aa672
I had to guess where it would be best to store the daily_loadavg
member and where to initialize it... please advise.

I took the conservative approach of triggering scrubs when either:
   1m loadavg < osd_scrub_load_threshold, or
   1m loadavg < 24hr loadavg && 1m loadavg < 15m loadavg

The whole PR would become this:
https://github.com/ceph/ceph/compare/master...cernceph:wip-deepscrub-daily

-- Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Nov. 16, 2015, 5:13 p.m. UTC | #5
On Mon, 16 Nov 2015, Dan van der Ster wrote:
> On Mon, Nov 16, 2015 at 4:58 PM, Dan van der Ster <dan@vanderster.com> wrote:
> > On Mon, Nov 16, 2015 at 4:32 PM, Dan van der Ster <dan@vanderster.com> wrote:
> >> On Mon, Nov 16, 2015 at 4:20 PM, Sage Weil <sage@newdream.net> wrote:
> >>> On Mon, 16 Nov 2015, Dan van der Ster wrote:
> >>>> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
> >>>> the loadavg is decreasing (or below the threshold)? As long as the
> >>>> 1min loadavg is less than the 15min loadavg, we should be ok to allow
> >>>> new scrubs. If you agree I'll add the patch below to my PR.
> >>>
> >>> I like the simplicity of that, I'm afraid its going to just trigger a
> >>> feedback loop and oscillations on the host.  I.e., as soo as we see *any*
> >>> decrease, all osds on the host will start to scrub, which will push the
> >>> load up.  Once that round of PGs finish, the load will start to drop
> >>> again, triggering another round.  This'll happen regardless of whether
> >>> we're in the peak hours or not, and the high-level goal (IMO at least) is
> >>> to do scrubbing in non-peak hours.
> >>
> >> We checked our OSDs' 24hr loadavg plots today and found that the
> >> original idea of 0.8 * 24hr loadavg wouldn't leave many chances for
> >> scrubs to run. So maybe if we used 0.9 or 1.0 it would be doable.
> >>
> >> BTW, I realized there was a silly error in that earlier patch, and we
> >> anyway need an upper bound, say # cpus. So until your response came I
> >> was working with this idea:
> >> https://stikked.web.cern.ch/stikked/view/raw/5586a912
> >
> > Sorry for SSO. Here:
> >
> > https://gist.github.com/dvanders/f3b08373af0f5957f589
> 
> Hi again. Here's a first shot at a daily loadavg heuristic:
> https://github.com/ceph/ceph/commit/15474124a183c7e92f457f836f7008a2813aa672
> I had to guess where it would be best to store the daily_loadavg
> member and where to initialize it... please advise.
> 
> I took the conservative approach of triggering scrubs when either:
>    1m loadavg < osd_scrub_load_threshold, or
>    1m loadavg < 24hr loadavg && 1m loadavg < 15m loadavg
> 
> The whole PR would become this:
> https://github.com/ceph/ceph/compare/master...cernceph:wip-deepscrub-daily

Looks reasonable to me!

I'm still a bit worried that the 1m < 15m thing will mean that on the 
completion of every scrub we have to wait ~1m before the next scrub 
starts.  Maybe that's okay, though... I'd say let's try this and adjust 
that later if it seems problematic (conservative == better).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan van der Ster Nov. 16, 2015, 5:30 p.m. UTC | #6
On Mon, Nov 16, 2015 at 6:13 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 16 Nov 2015, Dan van der Ster wrote:
>> On Mon, Nov 16, 2015 at 4:58 PM, Dan van der Ster <dan@vanderster.com> wrote:
>> > On Mon, Nov 16, 2015 at 4:32 PM, Dan van der Ster <dan@vanderster.com> wrote:
>> >> On Mon, Nov 16, 2015 at 4:20 PM, Sage Weil <sage@newdream.net> wrote:
>> >>> On Mon, 16 Nov 2015, Dan van der Ster wrote:
>> >>>> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
>> >>>> the loadavg is decreasing (or below the threshold)? As long as the
>> >>>> 1min loadavg is less than the 15min loadavg, we should be ok to allow
>> >>>> new scrubs. If you agree I'll add the patch below to my PR.
>> >>>
>> >>> I like the simplicity of that, I'm afraid its going to just trigger a
>> >>> feedback loop and oscillations on the host.  I.e., as soo as we see *any*
>> >>> decrease, all osds on the host will start to scrub, which will push the
>> >>> load up.  Once that round of PGs finish, the load will start to drop
>> >>> again, triggering another round.  This'll happen regardless of whether
>> >>> we're in the peak hours or not, and the high-level goal (IMO at least) is
>> >>> to do scrubbing in non-peak hours.
>> >>
>> >> We checked our OSDs' 24hr loadavg plots today and found that the
>> >> original idea of 0.8 * 24hr loadavg wouldn't leave many chances for
>> >> scrubs to run. So maybe if we used 0.9 or 1.0 it would be doable.
>> >>
>> >> BTW, I realized there was a silly error in that earlier patch, and we
>> >> anyway need an upper bound, say # cpus. So until your response came I
>> >> was working with this idea:
>> >> https://stikked.web.cern.ch/stikked/view/raw/5586a912
>> >
>> > Sorry for SSO. Here:
>> >
>> > https://gist.github.com/dvanders/f3b08373af0f5957f589
>>
>> Hi again. Here's a first shot at a daily loadavg heuristic:
>> https://github.com/ceph/ceph/commit/15474124a183c7e92f457f836f7008a2813aa672
>> I had to guess where it would be best to store the daily_loadavg
>> member and where to initialize it... please advise.
>>
>> I took the conservative approach of triggering scrubs when either:
>>    1m loadavg < osd_scrub_load_threshold, or
>>    1m loadavg < 24hr loadavg && 1m loadavg < 15m loadavg
>>
>> The whole PR would become this:
>> https://github.com/ceph/ceph/compare/master...cernceph:wip-deepscrub-daily
>
> Looks reasonable to me!
>
> I'm still a bit worried that the 1m < 15m thing will mean that on the
> completion of every scrub we have to wait ~1m before the next scrub
> starts.  Maybe that's okay, though... I'd say let's try this and adjust
> that later if it seems problematic (conservative == better).
>
> sage

Great. I've updated the PR:  https://github.com/ceph/ceph/pull/6550

Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 0562eed..464162d 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -6065,20 +6065,24 @@  bool OSD::scrub_time_permit(utime_t now)

 bool OSD::scrub_load_below_threshold()
 {
-  double loadavgs[1];
-  if (getloadavg(loadavgs, 1) != 1) {
+  double loadavgs[3];
+  if (getloadavg(loadavgs, 3) != 3) {
     dout(10) << __func__ << " couldn't read loadavgs\n" << dendl;
     return false;
   }

   if (loadavgs[0] >= cct->_conf->osd_scrub_load_threshold) {
-    dout(20) << __func__ << " loadavg " << loadavgs[0]
-            << " >= max " << cct->_conf->osd_scrub_load_threshold
-            << " = no, load too high" << dendl;
-    return false;
+    if (loadavgs[0] >= loadavgs[2]) {
+      dout(20) << __func__ << " loadavg " << loadavgs[0]
+              << " >= max " << cct->_conf->osd_scrub_load_threshold
+               << " and >= 15m avg " << loadavgs[2]
+              << " = no, load too high" << dendl;
+      return false;
+    }
   } else {
     dout(20) << __func__ << " loadavg " << loadavgs[0]
             << " < max " << cct->_conf->osd_scrub_load_threshold
+            << " or < 15 min avg " << loadavgs[2]
             << " = yes" << dendl;
     return true;
   }