diff mbox series

[for-6.2/block,V3,2/2] block: Change the granularity of io ticks from ms to ns

Message ID 20221221040506.1174644-2-gulam.mohamed@oracle.com (mailing list archive)
State New, archived
Headers show
Series [for-6.2/block,V3,1/2] block: Data type conversion for IO accounting | expand

Commit Message

Gulam Mohamed Dec. 21, 2022, 4:05 a.m. UTC
Problem Desc
============
The "iostat" user-space utility was showing %util as 100% for the disks
which has latencies less than a milli-second i.e for latencies in the
range of micro-seconds and below.

Root Cause
==========
The IO accounting in block layer is currently done by updating the
io_ticks in jiffies which is of milli-seconds granularity. Due to this,
for the devices with IO latencies less than a milli-second, the latency
will be accounted as 1 milli-second even-though its in the range of
micro-seconds. This was causing the iostat command to show %util
as 100% which is incorrect.

Recreationg of the issue
========================
Setup
-----
Devices: NVMe 24 devices
Model number: 4610 (Intel)

fio
---
[global]
bs=4K
iodepth=1
direct=1
ioengine=libaio
group_reporting
time_based
runtime=100
thinktime=1ms
numjobs=1
name=raw-write
rw=randrw
ignore_error=EIO:EIO
[job1]
filename=/dev/nvme0n1

iostat o/p
----------

Device   %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme3n1   0.00    0.05    0.00  75.38     0.50     0.00   0.00 100.00

Device   %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme3n1   0.00    0.05    0.00  74.74     0.50     0.00   0.00 100.00

Solution
========
Use ktime_get_ns() to update the disk_stats io_ticks so that the io_ticks
are updated for every io start and end times.

Issues using ktime
==================

Using ktime_get_ns() has a performance overhead as ktime will go ahead
and reads the clock everytime its called. Following test environment
was used by Jens Axboe on which t/io_uring was run which has shown a
high performance drop.

Devices
-------
SSDs: P5800X
No of devices: 24

io_uring config
---------------
Polled IO
iostats: Enabled
Reads: Random
Block Size: 512
QDepth: 128
Batch Submit: 32
Batch Complete: 32
No of Threads: 24

With the above environment and ktime patch, it has shown a performance
drop of ~25% from iostats disabled and ~19% performance drop from current
iostats enabled. This performance drop is high.

Suggestion from Jens Axboe
==========================
Jens Axboe suggested to split the bigger patch into two as follows:

1. In first patch, change all the types from unsigned long to u64, that
   can be done while retaining jiffies.

2. In second patch, add an iostats == 2 setting, which enables the higher
   resolution mode using ktime. We'd still default to 1, lower granularity
   iostats enabled.

Fix details
===========
1. Use ktime_get_ns() to get the current nano-seconds to update the
   io_ticks for start and end time stamps in block layer for io accounting

2. Create a new setting '2' in sysfs for iostats variable i.e for
   /sys/block/<device-name>/queue/iostats, to enable the iostat (io
   accounting) with nano-seconds (using ktime) granularity. This setting
   should be enabled only if the iostat is needed with high resolution
   mode as it has a high performance drop

3. Earlier available settings were 0 and 1 for disable and enable io
   accounting with milli-seconds granularity (jiffies)

Testing
=======
Ran the t/io_uring command with following setup:

Devices
-------
SSDs: P4610
No of devices: 8

io_uring config
---------------
Polled IO
iostats: Enabled
Reads: Random
Block Size: 512
QDepth: 128
Batch Submit: 32
Batch Complete: 32
No of Threads: 24

io_uring o/p
------------
iostat=0, with patch: Maximum IOPS=10.09M
iostat=1, with patch: Maximum IOPS=9.84M
iostat=2, with patch: Maximum IOPS=9.48M

Changes from V2 to V3
=====================
1. Changed all the required variables data-type to u64 as a first patch
2. Create a new setting '2' for iostats in sysfs in this patch
3. Change the code to get the ktime values when iostat=2, in this patch

Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com>
---
 block/blk-core.c                  | 26 +++++++++++++++++----
 block/blk-mq.c                    |  4 ++--
 block/blk-sysfs.c                 | 39 ++++++++++++++++++++++++++++++-
 block/genhd.c                     | 18 ++++++++++----
 drivers/block/drbd/drbd_debugfs.c | 12 ++++++++--
 drivers/block/zram/zram_drv.c     |  3 ++-
 drivers/md/dm.c                   | 13 +++++++++--
 include/linux/blkdev.h            |  4 ++++
 8 files changed, 103 insertions(+), 16 deletions(-)

Comments

kernel test robot Dec. 21, 2022, 4:09 p.m. UTC | #1
Hi Gulam,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on song-md/md-next linus/master next-20221220]
[cannot apply to device-mapper-dm/for-next v6.1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Gulam-Mohamed/block-Data-type-conversion-for-IO-accounting/20221221-121052
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20221221040506.1174644-2-gulam.mohamed%40oracle.com
patch subject: [PATCH for-6.2/block V3 2/2] block: Change the granularity of io ticks from ms to ns
config: x86_64-rhel-8.3-func
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/afff703cb450ec3806aa1a27406d21fc4eaa1f43
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Gulam-Mohamed/block-Data-type-conversion-for-IO-accounting/20221221-121052
        git checkout afff703cb450ec3806aa1a27406d21fc4eaa1f43
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 olddefconfig
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/md/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/md/dm.c: In function 'dm_io_acct':
>> drivers/md/dm.c:507:13: error: redefinition of 'start_time'
     507 |         u64 start_time = (blk_queue_precise_io_stat(q) ?
         |             ^~~~~~~~~~
   drivers/md/dm.c:501:13: note: previous definition of 'start_time' with type 'u64' {aka 'long long unsigned int'}
     501 |         u64 start_time = io->start_time;
         |             ^~~~~~~~~~


vim +/start_time +507 drivers/md/dm.c

   497	
   498	static void dm_io_acct(struct dm_io *io, bool end)
   499	{
   500		struct dm_stats_aux *stats_aux = &io->stats_aux;
   501		u64 start_time = io->start_time;
   502		struct mapped_device *md = io->md;
   503		struct bio *bio = io->orig_bio;
   504		unsigned int sectors;
   505		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
   506	
 > 507		u64 start_time = (blk_queue_precise_io_stat(q) ?
   508					nsecs_to_jiffies(io->start_time) :
   509					io->start_time);
   510	
   511		/*
   512		 * If REQ_PREFLUSH set, don't account payload, it will be
   513		 * submitted (and accounted) after this flush completes.
   514		 */
   515		if (bio_is_flush_with_data(bio))
   516			sectors = 0;
   517		else if (likely(!(dm_io_flagged(io, DM_IO_WAS_SPLIT))))
   518			sectors = bio_sectors(bio);
   519		else
   520			sectors = io->sectors;
   521	
   522		if (!end)
   523			bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
   524					   start_time);
   525		else
   526			bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
   527	
   528		if (static_branch_unlikely(&stats_enabled) &&
   529		    unlikely(dm_stats_used(&md->stats))) {
   530			sector_t sector;
   531	
   532			if (likely(!dm_io_flagged(io, DM_IO_WAS_SPLIT)))
   533				sector = bio->bi_iter.bi_sector;
   534			else
   535				sector = bio_end_sector(bio) - io->sector_offset;
   536	
   537			dm_stats_account_io(&md->stats, bio_data_dir(bio),
   538					    sector, sectors,
   539					    end, start_time, stats_aux);
   540		}
   541	}
   542
Keith Busch Dec. 21, 2022, 5:09 p.m. UTC | #2
On Wed, Dec 21, 2022 at 04:05:06AM +0000, Gulam Mohamed wrote:
> +u64  blk_get_iostat_ticks(struct request_queue *q)
> +{
> +       return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : jiffies);
> +}
> +EXPORT_SYMBOL_GPL(blk_get_iostat_ticks);
> +
>  void update_io_ticks(struct block_device *part, u64 now, bool end)
>  {
>  	u64 stamp;
> @@ -968,20 +980,26 @@ EXPORT_SYMBOL(bdev_start_io_acct);
>  u64 bio_start_io_acct(struct bio *bio)
>  {
>  	return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio),
> -				  bio_op(bio), jiffies);
> +				  bio_op(bio),
> +				  blk_get_iostat_ticks(bio->bi_bdev->bd_queue));
>  }
>  EXPORT_SYMBOL_GPL(bio_start_io_acct);
>  
>  void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
>  		      u64 start_time)
>  {
> +	u64 now;
> +	u64 duration;
> +	struct request_queue *q = bdev_get_queue(bdev);
>  	const int sgrp = op_stat_group(op);
> -	u64 now = READ_ONCE(jiffies);
> -	u64 duration = (unsigned long)now -(unsigned long) start_time;
> +	now = blk_get_iostat_ticks(q);;

I don't think you can rely on the blk_queue_precise_io_stat() flag in
the completion side. The user can toggle this with data in flight, which
means the completion may use different tick units than the start. I
think you'll need to add a flag to the request in the start accounting
side to know which method to use for the completion.

> @@ -951,6 +951,7 @@ ssize_t part_stat_show(struct device *dev,
>  	struct request_queue *q = bdev_get_queue(bdev);
>  	struct disk_stats stat;
>  	unsigned int inflight;
> +	u64 stat_ioticks;
>  
>  	if (queue_is_mq(q))
>  		inflight = blk_mq_in_flight(q, bdev);
> @@ -959,10 +960,13 @@ ssize_t part_stat_show(struct device *dev,
>  
>  	if (inflight) {
>  		part_stat_lock();
> -		update_io_ticks(bdev, jiffies, true);
> +		update_io_ticks(bdev, blk_get_iostat_ticks(q), true);
>  		part_stat_unlock();
>  	}
>  	part_stat_read_all(bdev, &stat);
> +	stat_ioticks = (blk_queue_precise_io_stat(q) ?
> +				div_u64(stat.io_ticks, NSEC_PER_MSEC) :
> +				jiffies_to_msecs(stat.io_ticks));


With the user able to change the precision at will, I think these
io_ticks need to have a consistent unit size. Mixing jiffies and nsecs
is going to create garbage stats output. Could existing io_ticks using
jiffies be converted to jiffies_to_nsecs() so that you only have one
unit to consider?
kernel test robot Dec. 21, 2022, 10:54 p.m. UTC | #3
Hi Gulam,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on song-md/md-next linus/master next-20221220]
[cannot apply to device-mapper-dm/for-next v6.1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Gulam-Mohamed/block-Data-type-conversion-for-IO-accounting/20221221-121052
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20221221040506.1174644-2-gulam.mohamed%40oracle.com
patch subject: [PATCH for-6.2/block V3 2/2] block: Change the granularity of io ticks from ms to ns
config: i386-randconfig-a014-20221219
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/afff703cb450ec3806aa1a27406d21fc4eaa1f43
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Gulam-Mohamed/block-Data-type-conversion-for-IO-accounting/20221221-121052
        git checkout afff703cb450ec3806aa1a27406d21fc4eaa1f43
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash drivers/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/md/dm.c:507:6: error: redefinition of 'start_time'
           u64 start_time = (blk_queue_precise_io_stat(q) ?
               ^
   drivers/md/dm.c:501:6: note: previous definition is here
           u64 start_time = io->start_time;
               ^
   1 error generated.


vim +/start_time +507 drivers/md/dm.c

   497	
   498	static void dm_io_acct(struct dm_io *io, bool end)
   499	{
   500		struct dm_stats_aux *stats_aux = &io->stats_aux;
   501		u64 start_time = io->start_time;
   502		struct mapped_device *md = io->md;
   503		struct bio *bio = io->orig_bio;
   504		unsigned int sectors;
   505		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
   506	
 > 507		u64 start_time = (blk_queue_precise_io_stat(q) ?
   508					nsecs_to_jiffies(io->start_time) :
   509					io->start_time);
   510	
   511		/*
   512		 * If REQ_PREFLUSH set, don't account payload, it will be
   513		 * submitted (and accounted) after this flush completes.
   514		 */
   515		if (bio_is_flush_with_data(bio))
   516			sectors = 0;
   517		else if (likely(!(dm_io_flagged(io, DM_IO_WAS_SPLIT))))
   518			sectors = bio_sectors(bio);
   519		else
   520			sectors = io->sectors;
   521	
   522		if (!end)
   523			bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
   524					   start_time);
   525		else
   526			bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
   527	
   528		if (static_branch_unlikely(&stats_enabled) &&
   529		    unlikely(dm_stats_used(&md->stats))) {
   530			sector_t sector;
   531	
   532			if (likely(!dm_io_flagged(io, DM_IO_WAS_SPLIT)))
   533				sector = bio->bi_iter.bi_sector;
   534			else
   535				sector = bio_end_sector(bio) - io->sector_offset;
   536	
   537			dm_stats_account_io(&md->stats, bio_data_dir(bio),
   538					    sector, sectors,
   539					    end, start_time, stats_aux);
   540		}
   541	}
   542
Gulam Mohamed Dec. 23, 2022, 2:47 p.m. UTC | #4
Hi Keith,

   Thanks for reviewing this request. Can you please see my inline comments?

Regards,
Gulam Mohamed.

-----Original Message-----
From: Keith Busch <kbusch@kernel.org> 
Sent: Wednesday, December 21, 2022 10:39 PM
To: Gulam Mohamed <gulam.mohamed@oracle.com>
Cc: linux-block@vger.kernel.org; axboe@kernel.dk; philipp.reisner@linbit.com; lars.ellenberg@linbit.com; christoph.boehmwalder@linbit.com; minchan@kernel.org; ngupta@vflare.org; senozhatsky@chromium.org; colyli@suse.de; kent.overstreet@gmail.com; agk@redhat.com; snitzer@kernel.org; dm-devel@redhat.com; song@kernel.org; dan.j.williams@intel.com; vishal.l.verma@intel.com; dave.jiang@intel.com; ira.weiny@intel.com; Junxiao Bi <junxiao.bi@oracle.com>; Martin Petersen <martin.petersen@oracle.com>; kch@nvidia.com; drbd-dev@lists.linbit.com; linux-kernel@vger.kernel.org; linux-bcache@vger.kernel.org; linux-raid@vger.kernel.org; nvdimm@lists.linux.dev; Konrad Wilk <konrad.wilk@oracle.com>; Joe Jin <joe.jin@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>; Shminderjit Singh <shminderjit.singh@oracle.com>
Subject: Re: [PATCH for-6.2/block V3 2/2] block: Change the granularity of io ticks from ms to ns

On Wed, Dec 21, 2022 at 04:05:06AM +0000, Gulam Mohamed wrote:
> +u64  blk_get_iostat_ticks(struct request_queue *q) {
> +       return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : 
> +jiffies); } EXPORT_SYMBOL_GPL(blk_get_iostat_ticks);
> +
>  void update_io_ticks(struct block_device *part, u64 now, bool end)  {
>  	u64 stamp;
> @@ -968,20 +980,26 @@ EXPORT_SYMBOL(bdev_start_io_acct);
>  u64 bio_start_io_acct(struct bio *bio)  {
>  	return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio),
> -				  bio_op(bio), jiffies);
> +				  bio_op(bio),
> +				  blk_get_iostat_ticks(bio->bi_bdev->bd_queue));
>  }
>  EXPORT_SYMBOL_GPL(bio_start_io_acct);
>  
>  void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
>  		      u64 start_time)
>  {
> +	u64 now;
> +	u64 duration;
> +	struct request_queue *q = bdev_get_queue(bdev);
>  	const int sgrp = op_stat_group(op);
> -	u64 now = READ_ONCE(jiffies);
> -	u64 duration = (unsigned long)now -(unsigned long) start_time;
> +	now = blk_get_iostat_ticks(q);;

I don't think you can rely on the blk_queue_precise_io_stat() flag in the completion side. The user can toggle this with data in flight, which means the completion may use different tick units than the start. I think you'll need to add a flag to the request in the start accounting side to know which method to use for the completion.

[GULAM]: As per my understanding, this may work for a single request_queue implemetation. But this request based accounting, as per my understanding, may be an issue with the Multi-QUEUE as there is a separate queue for each CPU and the time-stamp being recorded for the block device is a global one. Also, the issue you mentioned about the start and end accounting may update the ticks in different 
units for the inflight IOs, may be just for a while. So, even if it works for MQ, I am trying to understand how much is it feasible to do this request-based change for an issue which may be there for just a moment?
So, can you please correct me if I am wrong and explore more on your suggestion so that I can understand properly?

> @@ -951,6 +951,7 @@ ssize_t part_stat_show(struct device *dev,
>  	struct request_queue *q = bdev_get_queue(bdev);
>  	struct disk_stats stat;
>  	unsigned int inflight;
> +	u64 stat_ioticks;
>  
>  	if (queue_is_mq(q))
>  		inflight = blk_mq_in_flight(q, bdev); @@ -959,10 +960,13 @@ ssize_t 
> part_stat_show(struct device *dev,
>  
>  	if (inflight) {
>  		part_stat_lock();
> -		update_io_ticks(bdev, jiffies, true);
> +		update_io_ticks(bdev, blk_get_iostat_ticks(q), true);
>  		part_stat_unlock();
>  	}
>  	part_stat_read_all(bdev, &stat);
> +	stat_ioticks = (blk_queue_precise_io_stat(q) ?
> +				div_u64(stat.io_ticks, NSEC_PER_MSEC) :
> +				jiffies_to_msecs(stat.io_ticks));


With the user able to change the precision at will, I think these io_ticks need to have a consistent unit size. Mixing jiffies and nsecs is going to create garbage stats output. Could existing io_ticks using jiffies be converted to jiffies_to_nsecs() so that you only have one unit to consider?
[GULAM]: I am not sure if this will work as we just multiply with 1000000 to convert jiffies to nano-seconds.
kernel test robot Dec. 24, 2022, 1:26 a.m. UTC | #5
Hi Gulam,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on song-md/md-next linus/master next-20221220]
[cannot apply to device-mapper-dm/for-next v6.1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Gulam-Mohamed/block-Data-type-conversion-for-IO-accounting/20221221-121052
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20221221040506.1174644-2-gulam.mohamed%40oracle.com
patch subject: [PATCH for-6.2/block V3 2/2] block: Change the granularity of io ticks from ms to ns
config: loongarch-randconfig-c44-20221218
compiler: loongarch64-linux-gcc (GCC) 12.1.0

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

cocci warnings: (new ones prefixed by >>)
>> block/blk-core.c:995:31-32: Unneeded semicolon
Mike Snitzer Jan. 17, 2023, 5:52 p.m. UTC | #6
[comments/questions inlined below]

On Tue, Dec 20 2022 at 11:05P -0500,
Gulam Mohamed <gulam.mohamed@oracle.com> wrote:

> Problem Desc
> ============
> The "iostat" user-space utility was showing %util as 100% for the disks
> which has latencies less than a milli-second i.e for latencies in the
> range of micro-seconds and below.
> 
> Root Cause
> ==========
> The IO accounting in block layer is currently done by updating the
> io_ticks in jiffies which is of milli-seconds granularity. Due to this,
> for the devices with IO latencies less than a milli-second, the latency
> will be accounted as 1 milli-second even-though its in the range of
> micro-seconds. This was causing the iostat command to show %util
> as 100% which is incorrect.
> 
> Recreationg of the issue
> ========================
> Setup
> -----
> Devices: NVMe 24 devices
> Model number: 4610 (Intel)
> 
> fio
> ---
> [global]
> bs=4K
> iodepth=1
> direct=1
> ioengine=libaio
> group_reporting
> time_based
> runtime=100
> thinktime=1ms
> numjobs=1
> name=raw-write
> rw=randrw
> ignore_error=EIO:EIO
> [job1]
> filename=/dev/nvme0n1
> 
> iostat o/p
> ----------
> 
> Device   %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme3n1   0.00    0.05    0.00  75.38     0.50     0.00   0.00 100.00
> 
> Device   %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme3n1   0.00    0.05    0.00  74.74     0.50     0.00   0.00 100.00
> 
> Solution
> ========
> Use ktime_get_ns() to update the disk_stats io_ticks so that the io_ticks
> are updated for every io start and end times.
> 
> Issues using ktime
> ==================
> 
> Using ktime_get_ns() has a performance overhead as ktime will go ahead
> and reads the clock everytime its called. Following test environment
> was used by Jens Axboe on which t/io_uring was run which has shown a
> high performance drop.
> 
> Devices
> -------
> SSDs: P5800X
> No of devices: 24
> 
> io_uring config
> ---------------
> Polled IO
> iostats: Enabled
> Reads: Random
> Block Size: 512
> QDepth: 128
> Batch Submit: 32
> Batch Complete: 32
> No of Threads: 24
> 
> With the above environment and ktime patch, it has shown a performance
> drop of ~25% from iostats disabled and ~19% performance drop from current
> iostats enabled. This performance drop is high.
> 
> Suggestion from Jens Axboe
> ==========================
> Jens Axboe suggested to split the bigger patch into two as follows:
> 
> 1. In first patch, change all the types from unsigned long to u64, that
>    can be done while retaining jiffies.
> 
> 2. In second patch, add an iostats == 2 setting, which enables the higher
>    resolution mode using ktime. We'd still default to 1, lower granularity
>    iostats enabled.
> 
> Fix details
> ===========
> 1. Use ktime_get_ns() to get the current nano-seconds to update the
>    io_ticks for start and end time stamps in block layer for io accounting
> 
> 2. Create a new setting '2' in sysfs for iostats variable i.e for
>    /sys/block/<device-name>/queue/iostats, to enable the iostat (io
>    accounting) with nano-seconds (using ktime) granularity. This setting
>    should be enabled only if the iostat is needed with high resolution
>    mode as it has a high performance drop
> 
> 3. Earlier available settings were 0 and 1 for disable and enable io
>    accounting with milli-seconds granularity (jiffies)
> 
> Testing
> =======
> Ran the t/io_uring command with following setup:
> 
> Devices
> -------
> SSDs: P4610
> No of devices: 8
> 
> io_uring config
> ---------------
> Polled IO
> iostats: Enabled
> Reads: Random
> Block Size: 512
> QDepth: 128
> Batch Submit: 32
> Batch Complete: 32
> No of Threads: 24
> 
> io_uring o/p
> ------------
> iostat=0, with patch: Maximum IOPS=10.09M
> iostat=1, with patch: Maximum IOPS=9.84M
> iostat=2, with patch: Maximum IOPS=9.48M
> 
> Changes from V2 to V3
> =====================
> 1. Changed all the required variables data-type to u64 as a first patch
> 2. Create a new setting '2' for iostats in sysfs in this patch
> 3. Change the code to get the ktime values when iostat=2, in this patch
> 
> Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com>
> ---
>  block/blk-core.c                  | 26 +++++++++++++++++----
>  block/blk-mq.c                    |  4 ++--
>  block/blk-sysfs.c                 | 39 ++++++++++++++++++++++++++++++-
>  block/genhd.c                     | 18 ++++++++++----
>  drivers/block/drbd/drbd_debugfs.c | 12 ++++++++--
>  drivers/block/zram/zram_drv.c     |  3 ++-
>  drivers/md/dm.c                   | 13 +++++++++--
>  include/linux/blkdev.h            |  4 ++++
>  8 files changed, 103 insertions(+), 16 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5670032fe932..0b5e4eb909a5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -927,6 +927,18 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
>  }
>  EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
>  
> +/*
> + * Get the time based upon the available granularity for io accounting
> + * If the resolution mode is set to precise (2) i.e
> + * (/sys/block/<device>/queue/iostats = 2), then this will return time
> + * in nano-seconds else this will return time in jiffies
> + */
> +u64  blk_get_iostat_ticks(struct request_queue *q)
> +{
> +       return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : jiffies);
> +}
> +EXPORT_SYMBOL_GPL(blk_get_iostat_ticks);
> +

Would prefer to see blk_get_iostat_ticks tagged with 'static inline'
and moved to blkdev.h (obviously below blk_queue_precise_io_stat).

Also, just a general comment: I've historically been a bigger (ab)user
of jump_labels to avoid excess branching costs per IO, e.g. see commit
442761fd2b29 ("dm: conditionally enable branching for less used features").
It could be there is an opportunity for a follow-on patch that
leverages them?  Or should that just be left to each driver to decide
to do with its own resources (rather than have block core provide that)?

(if nothing in the system ever uses QUEUE_FLAG_PRECISE_IO_STAT then
it'd be nice to avoid needless branching).

<snip>

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 011a85ea40da..1bb58a0b8cd1 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -482,7 +482,11 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
>  
>  u64 dm_start_time_ns_from_clone(struct bio *bio)
>  {
> -	return jiffies_to_nsecs(clone_to_tio(bio)->io->start_time);
> +	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> +	u64 start_time_ns = (blk_queue_precise_io_stat(q) ?
> +				clone_to_tio(bio)->io->start_time :
> +				jiffies_to_nsecs(clone_to_tio(bio)->io->start_time));
> +	return start_time_ns;
>  }
>  EXPORT_SYMBOL_GPL(dm_start_time_ns_from_clone);
>  
> @@ -498,6 +502,11 @@ static void dm_io_acct(struct dm_io *io, bool end)
>  	struct mapped_device *md = io->md;
>  	struct bio *bio = io->orig_bio;
>  	unsigned int sectors;
> +	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> +
> +	u64 start_time = (blk_queue_precise_io_stat(q) ?
> +				nsecs_to_jiffies(io->start_time) :
> +				io->start_time);
>  
>  	/*
>  	 * If REQ_PREFLUSH set, don't account payload, it will be
> @@ -589,7 +598,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
>  	io->orig_bio = bio;
>  	io->md = md;
>  	spin_lock_init(&io->lock);
> -	io->start_time = jiffies;
> +	io->start_time = blk_get_iostat_ticks(bio->bi_bdev->bd_queue);
>  	io->flags = 0;
>  
>  	if (static_branch_unlikely(&stats_enabled))

The above seems correct, by checking the top-level DM device's
disposition on QUEUE_FLAG_PRECISE_IO_STAT, but I'm now wondering
if there should be code that ensures consistency across stacked
devices (which DM is the biggest creator/consumer of)?

dm_table_set_restrictions() deals with such inconsistencies at DM
device creation time (so basically: if any device in the DM table has
QUEUE_FLAG_PRECISE_IO_STAT set then the top-level DM device should
inherit that queue flag). A user could still override it but at least
the default will be sane initially.

Mike
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 5670032fe932..0b5e4eb909a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -927,6 +927,18 @@  int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
 }
 EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
 
+/*
+ * Get the time based upon the available granularity for io accounting
+ * If the resolution mode is set to precise (2) i.e
+ * (/sys/block/<device>/queue/iostats = 2), then this will return time
+ * in nano-seconds else this will return time in jiffies
+ */
+u64  blk_get_iostat_ticks(struct request_queue *q)
+{
+       return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : jiffies);
+}
+EXPORT_SYMBOL_GPL(blk_get_iostat_ticks);
+
 void update_io_ticks(struct block_device *part, u64 now, bool end)
 {
 	u64 stamp;
@@ -968,20 +980,26 @@  EXPORT_SYMBOL(bdev_start_io_acct);
 u64 bio_start_io_acct(struct bio *bio)
 {
 	return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio),
-				  bio_op(bio), jiffies);
+				  bio_op(bio),
+				  blk_get_iostat_ticks(bio->bi_bdev->bd_queue));
 }
 EXPORT_SYMBOL_GPL(bio_start_io_acct);
 
 void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
 		      u64 start_time)
 {
+	u64 now;
+	u64 duration;
+	struct request_queue *q = bdev_get_queue(bdev);
 	const int sgrp = op_stat_group(op);
-	u64 now = READ_ONCE(jiffies);
-	u64 duration = (unsigned long)now -(unsigned long) start_time;
+	now = blk_get_iostat_ticks(q);;
+	duration = (unsigned long)now -(unsigned long) start_time;
 
 	part_stat_lock();
 	update_io_ticks(bdev, now, true);
-	part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
+	part_stat_add(bdev, nsecs[sgrp],
+		(blk_queue_precise_io_stat(q) ? duration :
+		 jiffies_to_nsecs(duration)));
 	part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
 	part_stat_unlock();
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e6b3ccd4989..5318bf87f17e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -975,7 +975,7 @@  static void __blk_account_io_done(struct request *req, u64 now)
 	const int sgrp = op_stat_group(req_op(req));
 
 	part_stat_lock();
-	update_io_ticks(req->part, jiffies, true);
+	update_io_ticks(req->part, blk_get_iostat_ticks(req->q), true);
 	part_stat_inc(req->part, ios[sgrp]);
 	part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
 	part_stat_unlock();
@@ -1007,7 +1007,7 @@  static void __blk_account_io_start(struct request *rq)
 		rq->part = rq->q->disk->part0;
 
 	part_stat_lock();
-	update_io_ticks(rq->part, jiffies, false);
+	update_io_ticks(rq->part, blk_get_iostat_ticks(rq->q), false);
 	part_stat_unlock();
 }
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 93d9e9c9a6ea..e7959782ce59 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -307,7 +307,6 @@  queue_##name##_store(struct request_queue *q, const char *page, size_t count) \
 
 QUEUE_SYSFS_BIT_FNS(nonrot, NONROT, 1);
 QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
-QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
 QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0);
 #undef QUEUE_SYSFS_BIT_FNS
 
@@ -363,6 +362,44 @@  static ssize_t queue_nomerges_store(struct request_queue *q, const char *page,
 	return ret;
 }
 
+static ssize_t queue_iostats_show(struct request_queue *q, char *page)
+{
+       bool iostat = blk_queue_io_stat(q);
+       bool precise_iostat = blk_queue_precise_io_stat(q);
+
+       return queue_var_show(iostat << precise_iostat, page);
+}
+
+static ssize_t queue_iostats_store(struct request_queue *q, const char *page,
+                                  size_t count)
+{
+       unsigned long val;
+       ssize_t ret = -EINVAL;
+
+       ret = queue_var_store(&val, page, count);
+
+       if (ret < 0)
+               return ret;
+
+       if (val == 2) {
+               blk_queue_flag_set(QUEUE_FLAG_IO_STAT, q);
+               blk_queue_flag_set(QUEUE_FLAG_PRECISE_IO_STAT, q);
+               q->disk->part0->bd_stamp = 0;
+       }
+       else if (val == 1) {
+               blk_queue_flag_set(QUEUE_FLAG_IO_STAT, q);
+               blk_queue_flag_clear(QUEUE_FLAG_PRECISE_IO_STAT, q);
+               q->disk->part0->bd_stamp = 0;
+       }
+       else if (val == 0) {
+               blk_queue_flag_clear(QUEUE_FLAG_IO_STAT, q);
+               blk_queue_flag_clear(QUEUE_FLAG_PRECISE_IO_STAT, q);
+               q->disk->part0->bd_stamp = 0;
+       }
+
+       return ret;
+}
+
 static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
 {
 	bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags);
diff --git a/block/genhd.c b/block/genhd.c
index 03a96d6473e1..d034219a4683 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -951,6 +951,7 @@  ssize_t part_stat_show(struct device *dev,
 	struct request_queue *q = bdev_get_queue(bdev);
 	struct disk_stats stat;
 	unsigned int inflight;
+	u64 stat_ioticks;
 
 	if (queue_is_mq(q))
 		inflight = blk_mq_in_flight(q, bdev);
@@ -959,10 +960,13 @@  ssize_t part_stat_show(struct device *dev,
 
 	if (inflight) {
 		part_stat_lock();
-		update_io_ticks(bdev, jiffies, true);
+		update_io_ticks(bdev, blk_get_iostat_ticks(q), true);
 		part_stat_unlock();
 	}
 	part_stat_read_all(bdev, &stat);
+	stat_ioticks = (blk_queue_precise_io_stat(q) ?
+				div_u64(stat.io_ticks, NSEC_PER_MSEC) :
+				jiffies_to_msecs(stat.io_ticks));
 	return sprintf(buf,
 		"%8lu %8lu %8llu %8u "
 		"%8lu %8lu %8llu %8u "
@@ -979,7 +983,7 @@  ssize_t part_stat_show(struct device *dev,
 		(unsigned long long)stat.sectors[STAT_WRITE],
 		(unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC),
 		inflight,
-		jiffies_to_msecs(stat.io_ticks),
+		(unsigned int)stat_ioticks,
 		(unsigned int)div_u64(stat.nsecs[STAT_READ] +
 				      stat.nsecs[STAT_WRITE] +
 				      stat.nsecs[STAT_DISCARD] +
@@ -1217,6 +1221,8 @@  static int diskstats_show(struct seq_file *seqf, void *v)
 	unsigned int inflight;
 	struct disk_stats stat;
 	unsigned long idx;
+	struct request_queue *q;
+	u64 stat_ioticks;
 
 	/*
 	if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
@@ -1235,12 +1241,16 @@  static int diskstats_show(struct seq_file *seqf, void *v)
 		else
 			inflight = part_in_flight(hd);
 
+		q = bdev_get_queue(hd);
 		if (inflight) {
 			part_stat_lock();
-			update_io_ticks(hd, jiffies, true);
+			update_io_ticks(hd, blk_get_iostat_ticks(q), true);
 			part_stat_unlock();
 		}
 		part_stat_read_all(hd, &stat);
+		stat_ioticks = (blk_queue_precise_io_stat(q) ?
+				div_u64(stat.io_ticks, NSEC_PER_MSEC) :
+				jiffies_to_msecs(stat.io_ticks));
 		seq_printf(seqf, "%4d %7d %pg "
 			   "%lu %lu %lu %u "
 			   "%lu %lu %lu %u "
@@ -1260,7 +1270,7 @@  static int diskstats_show(struct seq_file *seqf, void *v)
 			   (unsigned int)div_u64(stat.nsecs[STAT_WRITE],
 							NSEC_PER_MSEC),
 			   inflight,
-			   jiffies_to_msecs(stat.io_ticks),
+			   (unsigned int)stat_ioticks,
 			   (unsigned int)div_u64(stat.nsecs[STAT_READ] +
 						 stat.nsecs[STAT_WRITE] +
 						 stat.nsecs[STAT_DISCARD] +
diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c
index a72c096aa5b1..af193bcc4f3a 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -97,6 +97,12 @@  static void seq_print_one_request(struct seq_file *m, struct drbd_request *req,
 {
 	/* change anything here, fixup header below! */
 	unsigned int s = req->rq_state;
+	unsigned long start_time;
+	struct request_queue *q = req->device->rq_queue;
+
+	start_time = (blk_queue_precise_io_stat(q) ?
+			nsecs_to_jiffies(req->start_jif) :
+			req->start_jif);
 
 #define RQ_HDR_1 "epoch\tsector\tsize\trw"
 	seq_printf(m, "0x%x\t%llu\t%u\t%s",
@@ -105,7 +111,7 @@  static void seq_print_one_request(struct seq_file *m, struct drbd_request *req,
 		(s & RQ_WRITE) ? "W" : "R");
 
 #define RQ_HDR_2 "\tstart\tin AL\tsubmit"
-	seq_printf(m, "\t%d", jiffies_to_msecs(now - req->start_jif));
+	seq_printf(m, "\t%d", jiffies_to_msecs(now - start_time));
 	seq_print_age_or_dash(m, s & RQ_IN_ACT_LOG, now - req->in_actlog_jif);
 	seq_print_age_or_dash(m, s & RQ_LOCAL_PENDING, now - req->pre_submit_jif);
 
@@ -171,7 +177,9 @@  static void seq_print_waiting_for_AL(struct seq_file *m, struct drbd_resource *r
 			/* if the oldest request does not wait for the activity log
 			 * it is not interesting for us here */
 			if (req && !(req->rq_state & RQ_IN_ACT_LOG))
-				jif = req->start_jif;
+				jif = (blk_queue_precise_io_stat(device->rq_queue) ?
+						nsecs_to_jiffies(req->start_jif):
+						req->start_jif);
 			else
 				req = NULL;
 			spin_unlock_irq(&device->resource->req_lock);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index da28eb83e6ed..8e22487de7de 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1663,6 +1663,7 @@  static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	struct zram *zram;
 	struct bio_vec bv;
 	u64 start_time;
+	struct request_queue *q = bdev_get_queue(bdev);
 
 	if (PageTransHuge(page))
 		return -ENOTSUPP;
@@ -1682,7 +1683,7 @@  static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	bv.bv_offset = 0;
 
 	start_time = bdev_start_io_acct(bdev->bd_disk->part0,
-			SECTORS_PER_PAGE, op, jiffies);
+			SECTORS_PER_PAGE, op, blk_get_iostat_ticks(q));
 	ret = zram_bvec_rw(zram, &bv, index, offset, op, NULL);
 	bdev_end_io_acct(bdev->bd_disk->part0, op, start_time);
 out:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 011a85ea40da..1bb58a0b8cd1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -482,7 +482,11 @@  static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
 
 u64 dm_start_time_ns_from_clone(struct bio *bio)
 {
-	return jiffies_to_nsecs(clone_to_tio(bio)->io->start_time);
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+	u64 start_time_ns = (blk_queue_precise_io_stat(q) ?
+				clone_to_tio(bio)->io->start_time :
+				jiffies_to_nsecs(clone_to_tio(bio)->io->start_time));
+	return start_time_ns;
 }
 EXPORT_SYMBOL_GPL(dm_start_time_ns_from_clone);
 
@@ -498,6 +502,11 @@  static void dm_io_acct(struct dm_io *io, bool end)
 	struct mapped_device *md = io->md;
 	struct bio *bio = io->orig_bio;
 	unsigned int sectors;
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+
+	u64 start_time = (blk_queue_precise_io_stat(q) ?
+				nsecs_to_jiffies(io->start_time) :
+				io->start_time);
 
 	/*
 	 * If REQ_PREFLUSH set, don't account payload, it will be
@@ -589,7 +598,7 @@  static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io->orig_bio = bio;
 	io->md = md;
 	spin_lock_init(&io->lock);
-	io->start_time = jiffies;
+	io->start_time = blk_get_iostat_ticks(bio->bi_bdev->bd_queue);
 	io->flags = 0;
 
 	if (static_branch_unlikely(&stats_enabled))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ca94d690d292..0543a536c6e5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -570,6 +570,7 @@  struct request_queue {
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
 #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
 #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE	31 /* quiesce_tagset skip the queue*/
+#define QUEUE_FLAG_PRECISE_IO_STAT 32   /* To enable precise io accounting */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1UL << QUEUE_FLAG_IO_STAT) |		\
 				 (1UL << QUEUE_FLAG_SAME_COMP) |	\
@@ -578,6 +579,7 @@  struct request_queue {
 void blk_queue_flag_set(unsigned int flag, struct request_queue *q);
 void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
+u64  blk_get_iostat_ticks(struct request_queue *q);
 
 #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
 #define blk_queue_dying(q)	test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags)
@@ -589,6 +591,8 @@  bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_stable_writes(q) \
 	test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
 #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
+#define blk_queue_precise_io_stat(q)   \
+	test_bit(QUEUE_FLAG_PRECISE_IO_STAT, &(q)->queue_flags)
 #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
 #define blk_queue_zone_resetall(q)	\
 	test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)