diff mbox series

[4/8] bfq: Limit number of requests consumed by each cgroup

Message ID 20211006173157.6906-4-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series bfq: Limit number of allocated scheduler tags per cgroup | expand

Commit Message

Jan Kara Oct. 6, 2021, 5:31 p.m. UTC
When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.

Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/bfq-iosched.c | 137 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 118 insertions(+), 19 deletions(-)

Comments

Michal Koutný Nov. 2, 2021, 6:16 p.m. UTC | #1
Hello.

On Wed, Oct 06, 2021 at 07:31:43PM +0200, Jan Kara <jack@suse.cz> wrote:
> +	for (level--; level >= 0; level--) {
> +		entity = entities[level];
> +		if (level > 0) {
> +			wsum = bfq_entity_service_tree(entity)->wsum;
> +		} else {
> +			int i;
> +			/*
> +			 * For bfqq itself we take into account service trees
> +			 * of all higher priority classes and multiply their
> +			 * weights so that low prio queue from higher class
> +			 * gets more requests than high prio queue from lower
> +			 * class.
> +			 */
> +			wsum = 0;
> +			for (i = 0; i <= class_idx; i++) {
> +				wsum = wsum * IOPRIO_BE_NR +
> +					sched_data->service_tree[i].wsum;
> +			}
> +		}
> +		limit = DIV_ROUND_CLOSEST(limit * entity->weight, wsum);

This scheme caught my eye. You mutliply (tree) weights by a factor
depending on the class when counting the wsum but then you don't apply
the same scaling for the evaluated entity in the numerator.

IOW, I think there should be something like
	scale = (level > 0) ? 1 : int_pow(IOPRIO_BE_NR, BFQ_IOPRIO_CLASSES - bfq_class_idx(entity));
	limit = DIV_ROUND_CLOSEST(limit * entity->weight * scale, wsum);

For instance, if there are two cgroups (level=1) with weights 100 and
200, and each cgroup has a single IOPRIO_CLASS_BE entity (level=0) in
it, the `limit` distribution would honor the ratio of weights from
level=1 (100:200) but it would artificially lower the absolute amount of
allowed tags. If I am not mistaken, that would be reduced by factor
1/BFQ_IOPRIO_CLASSES.

Also if I consider it more broadly, is this supposed to match/extend
bfq_io_prio_to_weight() calculation?


Michal
Jan Kara Nov. 3, 2021, 1:03 p.m. UTC | #2
Hello,

On Tue 02-11-21 19:16:58, Michal Koutný wrote:
> On Wed, Oct 06, 2021 at 07:31:43PM +0200, Jan Kara <jack@suse.cz> wrote:
> > +	for (level--; level >= 0; level--) {
> > +		entity = entities[level];
> > +		if (level > 0) {
> > +			wsum = bfq_entity_service_tree(entity)->wsum;
> > +		} else {
> > +			int i;
> > +			/*
> > +			 * For bfqq itself we take into account service trees
> > +			 * of all higher priority classes and multiply their
> > +			 * weights so that low prio queue from higher class
> > +			 * gets more requests than high prio queue from lower
> > +			 * class.
> > +			 */
> > +			wsum = 0;
> > +			for (i = 0; i <= class_idx; i++) {
> > +				wsum = wsum * IOPRIO_BE_NR +
> > +					sched_data->service_tree[i].wsum;
> > +			}
> > +		}
> > +		limit = DIV_ROUND_CLOSEST(limit * entity->weight, wsum);
> 
> This scheme caught my eye. You mutliply (tree) weights by a factor
> depending on the class when counting the wsum but then you don't apply
> the same scaling for the evaluated entity in the numerator.

Since we stop the loop at bfq_class_idx(entity) I don't think scaling of
the numerator makes sense - effectively when all the processes having IO to
submit (these are accounted in wsum) are in the same IO priority class, the
above code collapses to just:

  limit = limit * entity->weight / sched_data->service_tree[bfq_class_idx(entity)].wsum

I.e., we scale available tags proportionally to bfq_queue weight (which
scales linearly with IO priority).

When there are processes say both in RT and BE IO priority classes, then a
process in RT class still uses the above formula - i.e., as if all tags
available for a cgroup are split proportionally only among active tasks in
RT IO priority class. So in principle it can happen that there would be no
tag left for a process in lower IO priority class - and that is fine, we
don't care, because we don't want to submit IO from lower IO priority class
while there is still IO in higher IO priority class.

Now consider a situation for a process in BE IO priority class in this
setting. All processes in BE class can together occupy at most BE_wsum /
(RT_wsum * IOPRIO_BE_NR + BE_wsum) fraction of tags. This is admittedly
somewhat arbitrary fraction but it makes sure for each process in RT class
there are at least as many tags left as for the highest priority process in
BE class.

> IOW, I think there should be something like
> 	scale = (level > 0) ? 1 : int_pow(IOPRIO_BE_NR, BFQ_IOPRIO_CLASSES - bfq_class_idx(entity));
> 	limit = DIV_ROUND_CLOSEST(limit * entity->weight * scale, wsum);
> 
> For instance, if there are two cgroups (level=1) with weights 100 and
> 200, and each cgroup has a single IOPRIO_CLASS_BE entity (level=0) in
> it, the `limit` distribution would honor the ratio of weights from
> level=1 (100:200) but it would artificially lower the absolute amount of
> allowed tags. If I am not mistaken, that would be reduced by factor
> 1/BFQ_IOPRIO_CLASSES.

I don't see where my code would lead to available number of tags being
artifically lower as you write - in your example wsum for RT class would be
0 so effectively all terms of the formula for that class will cancel out.
As I wrote above, the highest active IO priority class effectively allows
processes in this class to consume all tags available for a cgroup. If
there are lower IO priority classes active as well, we allow them to
consume some tags but never allow them to consume all of them...

> Also if I consider it more broadly, is this supposed to match/extend
> bfq_io_prio_to_weight() calculation?

Yes, this is kind of an extension of bfq_io_prio_to_weight() that allows
some comparison of queues from different IO priority classes.

								Honza
Michal Koutný Nov. 3, 2021, 6:12 p.m. UTC | #3
On Wed, Nov 03, 2021 at 02:03:14PM +0100, Jan Kara <jack@suse.cz> wrote:
> Since we stop the loop at bfq_class_idx(entity)

Aha! I overlooked the for loop ends with the entity's class here and not
after the full range of classes.

> I.e., we scale available tags proportionally to bfq_queue weight (which
> scales linearly with IO priority).

Yes, you're working within the "order" of the entity's class and it's
always the last, i.e. least too, so the scale is 1.

> So in principle it can happen that there would be no tag left for a
> process in lower IO priority class - and that is fine, we don't care,
> because we don't want to submit IO from lower IO priority class while
> there is still IO in higher IO priority class.

Actually, can it ever happen that the higher class leaves some tags for
the lower? (IOW, is the CLS_wsum anytime exceeding sum of all active
entities of the CLS at the given point in time?) (1)

> Now consider a situation for a process in BE IO priority class in this
> setting. All processes in BE class can together occupy at most BE_wsum /
> (RT_wsum * IOPRIO_BE_NR + BE_wsum) fraction of tags. This is admittedly
> somewhat arbitrary fraction but it makes sure for each process in RT class
> there are at least as many tags left as for the highest priority process in
> BE class.

Can it happen that bfqq_request_over_limit() is called for a BE entity
before calling it for an RT entity (more precisely, not the
bfqq_request_over_limit() calls but actual allocation of tags)? (2)

> As I wrote above, the highest active IO priority class effectively allows
> processes in this class to consume all tags available for a cgroup. If
> there are lower IO priority classes active as well, we allow them to
> consume some tags but never allow them to consume all of them...

I assume this implies the answer to my previous question (2) is "yes"
and to the question (1) is: "numerically no, but lower class entity can
take some tags if it gets to draw them earlier".

> Yes, this is kind of an extension of bfq_io_prio_to_weight() that allows
> some comparison of queues from different IO priority classes.

I see there's no point using the same values for the weights in the
bfqq_request_over_limit() calculations as bfq_ioprio_to_weight()
calculates given the nature of strict ordering of classes above each
other. Your scoring makes sense to me now.

Reviewed-by: Michal Koutný <mkoutny@suse.com>

(this patch only)
Jan Kara Nov. 4, 2021, 11:20 a.m. UTC | #4
On Wed 03-11-21 19:12:12, Michal Koutný wrote:
> On Wed, Nov 03, 2021 at 02:03:14PM +0100, Jan Kara <jack@suse.cz> wrote:
> > Since we stop the loop at bfq_class_idx(entity)
> 
> Aha! I overlooked the for loop ends with the entity's class here and not
> after the full range of classes.
> 
> > I.e., we scale available tags proportionally to bfq_queue weight (which
> > scales linearly with IO priority).
> 
> Yes, you're working within the "order" of the entity's class and it's
> always the last, i.e. least too, so the scale is 1.
> 
> > So in principle it can happen that there would be no tag left for a
> > process in lower IO priority class - and that is fine, we don't care,
> > because we don't want to submit IO from lower IO priority class while
> > there is still IO in higher IO priority class.
> 
> Actually, can it ever happen that the higher class leaves some tags for
> the lower? (IOW, is the CLS_wsum anytime exceeding sum of all active
> entities of the CLS at the given point in time?) (1)

Yes, that can happen. Here we compute just an upper bound on the number of
tags. Each entity can use less than this upper limit and thus there will be
tags left for entities in lower IO priority classes.

> > Now consider a situation for a process in BE IO priority class in this
> > setting. All processes in BE class can together occupy at most BE_wsum /
> > (RT_wsum * IOPRIO_BE_NR + BE_wsum) fraction of tags. This is admittedly
> > somewhat arbitrary fraction but it makes sure for each process in RT class
> > there are at least as many tags left as for the highest priority process in
> > BE class.
> 
> Can it happen that bfqq_request_over_limit() is called for a BE entity
> before calling it for an RT entity (more precisely, not the
> bfqq_request_over_limit() calls but actual allocation of tags)? (2)

Sure, that can happen. bfqq_request_over_limit() gets called (as well as
scheduler tag allocation happens) at the moment process calls submit_bio().
Time when each process calls submit_bio() is completely out of our control.
It can even happen that BE process submits lots of IO, we let it allocate
lots of tags (because there isn't any other process in the service trees)
and then RT process submits its first IO - only at this moment tag limit
for BE process is reduced so BE process will block when trying to allocate
any further tag until it frees enough tags by completing IO. This is
actually the reason why we always allow a process to allocate at least some
tags so that it can enter service tree, then it can gradually allocate more
and more tags (because its tag allocation is not limited unlike the tag
allocation for BE process) until it uses appropriate share of tags.

> > As I wrote above, the highest active IO priority class effectively allows
> > processes in this class to consume all tags available for a cgroup. If
> > there are lower IO priority classes active as well, we allow them to
> > consume some tags but never allow them to consume all of them...
> 
> I assume this implies the answer to my previous question (2) is "yes"
> and to the question (1) is: "numerically no, but lower class entity can
> take some tags if it gets to draw them earlier".

Exactly.

> > Yes, this is kind of an extension of bfq_io_prio_to_weight() that allows
> > some comparison of queues from different IO priority classes.
> 
> I see there's no point using the same values for the weights in the
> bfqq_request_over_limit() calculations as bfq_ioprio_to_weight()
> calculates given the nature of strict ordering of classes above each
> other. Your scoring makes sense to me now.
> 
> Reviewed-by: Michal Koutný <mkoutny@suse.com>

Thanks!

								Honza
diff mbox series

Patch

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c93a74b8e9c8..3806409610ca 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -565,26 +565,134 @@  static struct request *bfq_choose_req(struct bfq_data *bfqd,
 	}
 }
 
+#define BFQ_LIMIT_INLINE_DEPTH 16
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+static bool bfqq_request_over_limit(struct bfq_queue *bfqq, int limit)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_entity *inline_entities[BFQ_LIMIT_INLINE_DEPTH];
+	struct bfq_entity **entities = inline_entities;
+	int depth, level;
+	int class_idx = bfqq->ioprio_class - 1;
+	struct bfq_sched_data *sched_data;
+	unsigned long wsum;
+	bool ret = false;
+
+	if (!entity->on_st_or_in_serv)
+		return false;
+
+	/* +1 for bfqq entity, root cgroup not included */
+	depth = bfqg_to_blkg(bfqq_group(bfqq))->blkcg->css.cgroup->level + 1;
+	if (depth > BFQ_LIMIT_INLINE_DEPTH) {
+		entities = kmalloc_array(depth, sizeof(*entities), GFP_NOIO);
+		if (!entities)
+			return false;
+	}
+
+	spin_lock_irq(&bfqd->lock);
+	sched_data = entity->sched_data;
+	/* Gather our ancestors as we need to traverse them in reverse order */
+	level = 0;
+	for_each_entity(entity) {
+		/*
+		 * If at some level entity is not even active, allow request
+ 		 * queueing so that BFQ knows there's work to do and activate
+		 * entities.
+		 */
+		if (!entity->on_st_or_in_serv)
+			goto out;
+		/* Uh, more parents than cgroup subsystem thinks? */
+		if (WARN_ON_ONCE(level >= depth))
+			break;
+		entities[level++] = entity;
+	}
+	WARN_ON_ONCE(level != depth);
+	for (level--; level >= 0; level--) {
+		entity = entities[level];
+		if (level > 0) {
+			wsum = bfq_entity_service_tree(entity)->wsum;
+		} else {
+			int i;
+			/*
+			 * For bfqq itself we take into account service trees
+			 * of all higher priority classes and multiply their
+			 * weights so that low prio queue from higher class
+			 * gets more requests than high prio queue from lower
+			 * class.
+			 */
+			wsum = 0;
+			for (i = 0; i <= class_idx; i++) {
+				wsum = wsum * IOPRIO_BE_NR +
+					sched_data->service_tree[i].wsum;
+			}
+		}
+		limit = DIV_ROUND_CLOSEST(limit * entity->weight, wsum);
+		if (entity->allocated >= limit) {
+			bfq_log_bfqq(bfqq->bfqd, bfqq,
+				"too many requests: allocated %d limit %d level %d",
+				entity->allocated, limit, level);
+			ret = true;
+			break;
+		}
+	}
+out:
+	spin_unlock_irq(&bfqd->lock);
+	if (entities != inline_entities)
+		kfree(entities);
+	return ret;
+}
+#else
+static bool bfqq_request_over_limit(struct bfq_queue *bfqq, int limit)
+{
+	return false;
+}
+#endif
+
 /*
  * Async I/O can easily starve sync I/O (both sync reads and sync
  * writes), by consuming all tags. Similarly, storms of sync writes,
  * such as those that sync(2) may trigger, can starve sync reads.
  * Limit depths of async I/O and sync writes so as to counter both
  * problems.
+ *
+ * Also if a bfq queue or its parent cgroup consume more tags than would be
+ * appropriate for their weight, we trim the available tag depth to 1. This
+ * avoids a situation where one cgroup can starve another cgroup from tags and
+ * thus block service differentiation among cgroups. Note that because the
+ * queue / cgroup already has many requests allocated and queued, this does not
+ * significantly affect service guarantees coming from the BFQ scheduling
+ * algorithm.
  */
 static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
 {
 	struct bfq_data *bfqd = data->q->elevator->elevator_data;
+	struct bfq_io_cq *bic = data->icq ? icq_to_bic(data->icq) : NULL;
+	struct bfq_queue *bfqq = bic ? bic_to_bfqq(bic, op_is_sync(op)) : NULL;
+	int depth;
+	unsigned limit = data->q->nr_requests;
+
+	/* Sync reads have full depth available */
+	if (op_is_sync(op) && !op_is_write(op)) {
+		depth = 0;
+	} else {
+		depth = bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+		limit = (limit * depth) >> bfqd->full_depth_shift;
+	}
 
-	if (op_is_sync(op) && !op_is_write(op))
-		return;
-
-	data->shallow_depth =
-		bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+	/*
+	 * Does queue (or any parent entity) exceed number of requests that
+	 * should be available to it? Heavily limit depth so that it cannot
+	 * consume more available requests and thus starve other entities.
+	 */
+	if (bfqq && bfqq_request_over_limit(bfqq, limit))
+		depth = 1;
 
 	bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
-			__func__, bfqd->wr_busy_queues, op_is_sync(op),
-			data->shallow_depth);
+		__func__, bfqd->wr_busy_queues, op_is_sync(op), depth);
+	if (depth)
+		data->shallow_depth = depth;
 }
 
 static struct bfq_queue *
@@ -6851,10 +6959,8 @@  void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
  * See the comments on bfq_limit_depth for the purpose of
  * the depths set in the function. Return minimum shallow depth we'll use.
  */
-static unsigned int bfq_update_depths(struct bfq_data *bfqd,
-				      struct sbitmap_queue *bt)
+static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
 {
-	unsigned int i, j, min_shallow = UINT_MAX;
 	unsigned int depth = 1U << bt->sb.shift;
 
 	bfqd->full_depth_shift = bt->sb.shift;
@@ -6888,22 +6994,15 @@  static unsigned int bfq_update_depths(struct bfq_data *bfqd,
 	bfqd->word_depths[1][0] = max((depth * 3) >> 4, 1U);
 	/* no more than ~37% of tags for sync writes (~20% extra tags) */
 	bfqd->word_depths[1][1] = max((depth * 6) >> 4, 1U);
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < 2; j++)
-			min_shallow = min(min_shallow, bfqd->word_depths[i][j]);
-
-	return min_shallow;
 }
 
 static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
 {
 	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
 	struct blk_mq_tags *tags = hctx->sched_tags;
-	unsigned int min_shallow;
 
-	min_shallow = bfq_update_depths(bfqd, tags->bitmap_tags);
-	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, min_shallow);
+	bfq_update_depths(bfqd, tags->bitmap_tags);
+	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, 1);
 }
 
 static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)