[v8,0/3] blk-cgroup: Optimize blkcg_rstat_flush()

Message ID	20221004151748.293388-1-longman@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Waiman Long <longman@redhat.com> To: Tejun Heo <tj@kernel.org>, Jens Axboe <axboe@kernel.dk> Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Andy Shevchenko <andriy.shevchenko@linux.intel.com>, Andrew Morton <akpm@linux-foundation.org>, =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, Waiman Long <longman@redhat.com> Subject: [PATCH v8 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Date: Tue, 4 Oct 2022 11:17:45 -0400 Message-Id: <20221004151748.293388-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	blk-cgroup: Optimize blkcg_rstat_flush() \| expand [v8,0/3] blk-cgroup: Optimize blkcg_rstat_flush() [v8,1/3] llist: Allow optional sentinel node terminated lockless list [v8,2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path [v8,3/3] blk-cgroup: Optimize blkcg_rstat_flush()

Message ID

20221004151748.293388-1-longman@redhat.com (mailing list archive)

Headers

From: Waiman Long <longman@redhat.com>
To: Tejun Heo <tj@kernel.org>, Jens Axboe <axboe@kernel.dk>
Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org,
 linux-kernel@vger.kernel.org, Ming Lei <ming.lei@redhat.com>,
 Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Waiman Long <longman@redhat.com>
Subject: [PATCH v8 0/3] blk-cgroup: Optimize blkcg_rstat_flush()
Date: Tue,  4 Oct 2022 11:17:45 -0400
Message-Id: <20221004151748.293388-1-longman@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

blk-cgroup: Optimize blkcg_rstat_flush() | expand

Message

Waiman Long Oct. 4, 2022, 3:17 p.m. UTC

v8:
  - Update the llist patch to make existing llist functions and macros
    work for both NULL and sentinel terminated lockless list as much
    as possible and leave only the initialization and removal functions
    to have a sentinel terminated llist variants.

 v7:
  - Drop patch 1 ("blk-cgroup: Correctly free percpu iostat_cpu in blkg
    on error exit") as it is found to be unnecessary.
  - Add a new llist patch to provide a lockless list variant terminated
    by a sentinel node.
  - Modified patch 3 to use the new sllist API and move percpu_ref_put()
    later in the blkcg_rstat_flush() loop to prevent potential
    use-after-free problem.

 v6:
  - Add a missing free_percpu() into blkcg_css_free() in patch 3.
  - Integrating the documentation patch 4 back into patch 3.

 v5:
  - Add a new patch 2 to eliminate the use of intermediate "ret"
    variable in blkcg_css_alloc() to fix compilation warning reported
    by kernel test robot.

This patch series improves blkcg_rstat_flush() performance by eliminating
unnecessary blkg enumeration and flush operations for those blkg's and
blkg_iostat_set's that haven't been updated since the last flush. It
also enhances the llist API to support a sentinel termianted variant
of the lockless list.

Waiman Long (3):
  llist: Allow optional sentinel node terminated lockless list
  blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  blk-cgroup: Optimize blkcg_rstat_flush()

 block/blk-cgroup.c    |  87 ++++++++++++++++++++++++++++++------
 block/blk-cgroup.h    |   9 ++++
 include/linux/llist.h | 100 +++++++++++++++++++++++++++++++++---------
 lib/llist.c           |  20 ++++++---
 4 files changed, 177 insertions(+), 39 deletions(-)

Comments

Waiman Long Oct. 6, 2022, 9:34 p.m. UTC | #1

On 10/6/22 06:11, Hillf Danton wrote:
> On 4 Oct 2022 11:17:48 -0400 Waiman Long <longman@redhat.com>
>> For a system with many CPUs and block devices, the time to do
>> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
>> can be especially problematic as interrupt is disabled during the flush.
>> It was reported that it might take seconds to complete in some extreme
>> cases leading to hard lockup messages.
>>
>> As it is likely that not all the percpu blkg_iostat_set's has been
>> updated since the last flush, those stale blkg_iostat_set's don't need
>> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
>> by keeping a lockless list of recently updated blkg_iostat_set's in a
>> newly added percpu blkcg->lhead pointer.
>>
>> The blkg_iostat_set is added to a sentinel lockless list on the update
>> side in blk_cgroup_bio_start(). It is removed from the sentinel lockless
>> list when flushed in blkcg_rstat_flush(). Due to racing, it is possible
>> that blk_iostat_set's in the lockless list may have no new IO stats to
>> be flushed, but that is OK.
> So it is likely that another flag, updated when bis is added to/deleted
> from llist, can cut 1/3 off without raising the risk of getting your patch
> over complicated.
>
>>   
>>   struct blkg_iostat_set {
>>   	struct u64_stats_sync		sync;
>> +	struct llist_node		lnode;
>> +	struct blkcg_gq		       *blkg;
> +	atomic_t			queued;
>
>>   	struct blkg_iostat		cur;
>>   	struct blkg_iostat		last;
>>   };

Yes, by introducing a flag to record the lockless list state, it is 
possible to just use the current llist implementation. Maybe I can 
rework it for now without the sentinel variant and post a separate llist 
patch for that later on.

Cheers,
Longman