md: raid1 mixed RW performance improvement

Message ID	20241004140514.20209-1-paulluselinux@gmail.com (mailing list archive)
State	Changes Requested
Headers	show Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CEC8C15748E for <linux-raid@vger.kernel.org>; Fri, 4 Oct 2024 14:05:19 +0000 (UTC) From: Paul Luse <paulluselinux@gmail.com> To: linux-raid@vger.kernel.org Cc: paul.e.luse@intel.com, Paul Luse <paulluselinux@gmail.com> Subject: [PATCH] md: raid1 mixed RW performance improvement Date: Fri, 4 Oct 2024 07:05:14 -0700 Message-ID: <20241004140514.20209-1-paulluselinux@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	md: raid1 mixed RW performance improvement \| expand md: raid1 mixed RW performance improvement

Message ID

20241004140514.20209-1-paulluselinux@gmail.com (mailing list archive)

State

Changes Requested

Headers

From: Paul Luse <paulluselinux@gmail.com>
To: linux-raid@vger.kernel.org
Cc: paul.e.luse@intel.com,
	Paul Luse <paulluselinux@gmail.com>
Subject: [PATCH] md: raid1 mixed RW performance improvement
Date: Fri,  4 Oct 2024 07:05:14 -0700
Message-ID: <20241004140514.20209-1-paulluselinux@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

md: raid1 mixed RW performance improvement | expand

Context	Check	Description
mdraidci/vmtest-md-6_12-PR	success	PR summary
mdraidci/vmtest-md-6_12-VM_Test-0	success	Logs for per-patch-testing

Context

Check

Description

mdraidci/vmtest-md-6_12-PR

success

PR summary

mdraidci/vmtest-md-6_12-VM_Test-0

success

Logs for per-patch-testing

Commit Message

Paul E Luse Oct. 4, 2024, 2:05 p.m. UTC

While working on read_balance() earlier this year I realized that
for nonrot media the primary disk selection criteria for RAID1 reads
was to choose the disk with the least pending IO. At the same time it
was noticed that rarely did this work out to a 50/50 split between
the disks.  Initially I looked at a round-robin scheme however this
proved to be too complex when taking into account concurrency.

That led to this patch.  Writes typically take longer than
reads for nonrot media so choosing the disk with the least pending
IO without knowing the mix of outstanding IO, reads vs writes, is not
optimal.

This patch takes a very simple implantation approach to place more
weight on writes vs reads when selecting a disk to read from.  Based
on empirical testing of multiple drive vendors NVMe and SATA (some
data included below) I found that weighing writes by 1.25x and rounding
gave the best overall performance improvement.  The boost varies by
drive, as do most drive dependent performance optimizations.  Kioxia
gets the least benefit while Samsung gets the most.  I also confirmed
no impact on pure read cases (or writes of course).  I left the total
pending counter in place and simply added one specific to reads, there
was no reason to count them separately especially given the additional
complexity in the write path for tracking pending IO.

The fewer writes that are outstanding the less positive impact this
patch has.  So the math works out well.  Here are the non-weighted
and weighted values for looking at outstanding writes.  The first column
is the unweighted value and the second is what is used with this patch.
Until there are at least 4 pending, no change.  The more pending, the
more the value is weighted which is perfect for how the drives behave.

1 1
2 2
3 3
4 5
5 6
6 7
7 8
8 10
9 11
10 12
11 13
12 15
13 16
14 17
15 18
16 20

Below is performance data for the patch, 3 different NVMe drives
and one SATA.

WD SATA: https://photos.app.goo.gl/1smadpDEzgLaa5G48
WD NVMe: https://photos.app.goo.gl/YkTTcYfU8Yc8XWA58
SamSung NVMe: https://photos.app.goo.gl/F6MvEfmbGtRyPUFz6
Kioxia NVMe: https://photos.app.goo.gl/BAEhi8hUwsdTyj9y5

Signed-off-by: Paul Luse <paulluselinux@gmail.com>
---
 drivers/md/md.h    |  9 +++++++++
 drivers/md/raid1.c | 34 +++++++++++++++++++++++++---------
 2 files changed, 34 insertions(+), 9 deletions(-)

Comments

Yu Kuai Oct. 8, 2024, 3:20 a.m. UTC | #1

Hi,

在 2024/10/04 22:05, Paul Luse 写道:
> While working on read_balance() earlier this year I realized that
> for nonrot media the primary disk selection criteria for RAID1 reads
> was to choose the disk with the least pending IO. At the same time it
> was noticed that rarely did this work out to a 50/50 split between
> the disks.  Initially I looked at a round-robin scheme however this
> proved to be too complex when taking into account concurrency.
> 
> That led to this patch.  Writes typically take longer than
> reads for nonrot media so choosing the disk with the least pending
> IO without knowing the mix of outstanding IO, reads vs writes, is not
> optimal.
> 
> This patch takes a very simple implantation approach to place more
> weight on writes vs reads when selecting a disk to read from.  Based
> on empirical testing of multiple drive vendors NVMe and SATA (some
> data included below) I found that weighing writes by 1.25x and rounding
> gave the best overall performance improvement.  The boost varies by
> drive, as do most drive dependent performance optimizations.  Kioxia
> gets the least benefit while Samsung gets the most.  I also confirmed
> no impact on pure read cases (or writes of course).  I left the total
> pending counter in place and simply added one specific to reads, there
> was no reason to count them separately especially given the additional
> complexity in the write path for tracking pending IO.
> 
> The fewer writes that are outstanding the less positive impact this
> patch has.  So the math works out well.  Here are the non-weighted
> and weighted values for looking at outstanding writes.  The first column
> is the unweighted value and the second is what is used with this patch.
> Until there are at least 4 pending, no change.  The more pending, the
> more the value is weighted which is perfect for how the drives behave.
> 
> 1 1
> 2 2
> 3 3
> 4 5
> 5 6
> 6 7
> 7 8
> 8 10
> 9 11
> 10 12
> 11 13
> 12 15
> 13 16
> 14 17
> 15 18
> 16 20
> 
> Below is performance data for the patch, 3 different NVMe drives
> and one SATA.
> 
> WD SATA: https://photos.app.goo.gl/1smadpDEzgLaa5G48
> WD NVMe: https://photos.app.goo.gl/YkTTcYfU8Yc8XWA58
> SamSung NVMe: https://photos.app.goo.gl/F6MvEfmbGtRyPUFz6
> Kioxia NVMe: https://photos.app.goo.gl/BAEhi8hUwsdTyj9y5
> 
> Signed-off-by: Paul Luse <paulluselinux@gmail.com>
> ---
>   drivers/md/md.h    |  9 +++++++++
>   drivers/md/raid1.c | 34 +++++++++++++++++++++++++---------
>   2 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 5d2e6bd58e4d..1a1040ec3c4a 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -162,6 +162,9 @@ struct md_rdev {
>   					 */
>   	};
>   
> +	atomic_t nr_reads_pending;      /* tracks only mirrored reads pending
> +					 * to support a performance optimization
> +					 */
>   	atomic_t	nr_pending;	/* number of pending requests.
>   					 * only maintained for arrays that
>   					 * support hot removal
> @@ -923,6 +926,12 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
>   	}
>   }
>   
> +static inline void mirror_rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
> +{
> +	atomic_dec(&rdev->nr_reads_pending);
> +	rdev_dec_pending(rdev, mddev);

I know that this will cause more code changes, will it make more sense
just to split writes and reads? With following apis:

rdev_inc_pending(rdev, rw);
rdev_dec_pending(rdev, rw);
rdev_pending(rdev);
rdev_rw_pending(rdev, rw);

Thanks,
Kuai

> +}
> +
>   extern const struct md_cluster_operations *md_cluster_ops;
>   static inline int mddev_is_clustered(struct mddev *mddev)
>   {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index f55c8e67d059..5315b46d2cca 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -394,7 +394,7 @@ static void raid1_end_read_request(struct bio *bio)
>   
>   	if (uptodate) {
>   		raid_end_bio_io(r1_bio);
> -		rdev_dec_pending(rdev, conf->mddev);
> +		mirror_rdev_dec_pending(rdev, conf->mddev);
>   	} else {
>   		/*
>   		 * oops, read error:
> @@ -584,6 +584,7 @@ static void update_read_sectors(struct r1conf *conf, int disk,
>   	struct raid1_info *info = &conf->mirrors[disk];
>   
>   	atomic_inc(&info->rdev->nr_pending);
> +	atomic_inc(&info->rdev->nr_reads_pending);
>   	if (info->next_seq_sect != this_sector)
>   		info->seq_start = this_sector;
>   	info->next_seq_sect = this_sector + len;
> @@ -760,9 +761,11 @@ struct read_balance_ctl {
>   	int readable_disks;
>   };
>   
> +#define WRITE_WEIGHT 2
>   static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>   {
>   	int disk;
> +	int nonrot = READ_ONCE(conf->nonrot_disks);
>   	struct read_balance_ctl ctl = {
>   		.closest_dist_disk      = -1,
>   		.closest_dist           = MaxSector,
> @@ -774,7 +777,7 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>   	for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
>   		struct md_rdev *rdev;
>   		sector_t dist;
> -		unsigned int pending;
> +		unsigned int total_pending, reads_pending;
>   
>   		if (r1_bio->bios[disk] == IO_BLOCKED)
>   			continue;
> @@ -787,7 +790,21 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>   		if (ctl.readable_disks++ == 1)
>   			set_bit(R1BIO_FailFast, &r1_bio->state);
>   
> -		pending = atomic_read(&rdev->nr_pending);
> +		total_pending = atomic_read(&rdev->nr_pending);
> +		if (nonrot) {
> +			/* for nonrot we weigh writes slightly heavier than
> +			 * reads when deciding disk based on pending IOs as
> +			 * writes typically take longer
> +			 */
> +			reads_pending = atomic_read(&rdev->nr_reads_pending);
> +			if (total_pending > reads_pending) {
> +				int writes;
> +
> +				writes = total_pending - reads_pending;
> +				writes += (writes >> WRITE_WEIGHT);
> +				total_pending = writes + reads_pending;
> +			}
> +		}
>   		dist = abs(r1_bio->sector - conf->mirrors[disk].head_position);
>   
>   		/* Don't change to another disk for sequential reads */
> @@ -799,7 +816,7 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>   			 * Add 'pending' to avoid choosing this disk if
>   			 * there is other idle disk.
>   			 */
> -			pending++;
> +			total_pending++;
>   			/*
>   			 * If there is no other idle disk, this disk
>   			 * will be chosen.
> @@ -807,8 +824,8 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>   			ctl.sequential_disk = disk;
>   		}
>   
> -		if (ctl.min_pending > pending) {
> -			ctl.min_pending = pending;
> +		if (ctl.min_pending > total_pending) {
> +			ctl.min_pending = total_pending;
>   			ctl.min_pending_disk = disk;
>   		}
>   
> @@ -831,8 +848,7 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>   	 * disk is rotational, which might/might not be optimal for raids with
>   	 * mixed ratation/non-rotational disks depending on workload.
>   	 */
> -	if (ctl.min_pending_disk != -1 &&
> -	    (READ_ONCE(conf->nonrot_disks) || ctl.min_pending == 0))
> +	if (ctl.min_pending_disk != -1 && (nonrot || ctl.min_pending == 0))
>   		return ctl.min_pending_disk;
>   	else
>   		return ctl.closest_dist_disk;
> @@ -2622,7 +2638,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>   		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>   	}
>   
> -	rdev_dec_pending(rdev, conf->mddev);
> +	mirror_rdev_dec_pending(rdev, conf->mddev);
>   	sector = r1_bio->sector;
>   	bio = r1_bio->master_bio;
>   
>

Paul E Luse Oct. 8, 2024, 1:24 p.m. UTC | #2

On 10/7/24 8:20 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2024/10/04 22:05, Paul Luse 写道:
>> While working on read_balance() earlier this year I realized that
>> for nonrot media the primary disk selection criteria for RAID1 reads
>> was to choose the disk with the least pending IO. At the same time it
>> was noticed that rarely did this work out to a 50/50 split between
>> the disks.  Initially I looked at a round-robin scheme however this
>> proved to be too complex when taking into account concurrency.
>>
>> That led to this patch.  Writes typically take longer than
>> reads for nonrot media so choosing the disk with the least pending
>> IO without knowing the mix of outstanding IO, reads vs writes, is not
>> optimal.
>>
>> This patch takes a very simple implantation approach to place more
>> weight on writes vs reads when selecting a disk to read from.  Based
>> on empirical testing of multiple drive vendors NVMe and SATA (some
>> data included below) I found that weighing writes by 1.25x and rounding
>> gave the best overall performance improvement.  The boost varies by
>> drive, as do most drive dependent performance optimizations.  Kioxia
>> gets the least benefit while Samsung gets the most.  I also confirmed
>> no impact on pure read cases (or writes of course).  I left the total
>> pending counter in place and simply added one specific to reads, there
>> was no reason to count them separately especially given the additional
>> complexity in the write path for tracking pending IO.
>>
>> The fewer writes that are outstanding the less positive impact this
>> patch has.  So the math works out well.  Here are the non-weighted
>> and weighted values for looking at outstanding writes.  The first column
>> is the unweighted value and the second is what is used with this patch.
>> Until there are at least 4 pending, no change.  The more pending, the
>> more the value is weighted which is perfect for how the drives behave.
>>
>> 1 1
>> 2 2
>> 3 3
>> 4 5
>> 5 6
>> 6 7
>> 7 8
>> 8 10
>> 9 11
>> 10 12
>> 11 13
>> 12 15
>> 13 16
>> 14 17
>> 15 18
>> 16 20
>>
>> Below is performance data for the patch, 3 different NVMe drives
>> and one SATA.
>>
>> WD SATA: https://photos.app.goo.gl/1smadpDEzgLaa5G48
>> WD NVMe: https://photos.app.goo.gl/YkTTcYfU8Yc8XWA58
>> SamSung NVMe: https://photos.app.goo.gl/F6MvEfmbGtRyPUFz6
>> Kioxia NVMe: https://photos.app.goo.gl/BAEhi8hUwsdTyj9y5
>>
>> Signed-off-by: Paul Luse <paulluselinux@gmail.com>
>> ---
>>   drivers/md/md.h    |  9 +++++++++
>>   drivers/md/raid1.c | 34 +++++++++++++++++++++++++---------
>>   2 files changed, 34 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 5d2e6bd58e4d..1a1040ec3c4a 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -162,6 +162,9 @@ struct md_rdev {
>>                        */
>>       };
>> +    atomic_t nr_reads_pending;      /* tracks only mirrored reads 
>> pending
>> +                     * to support a performance optimization
>> +                     */
>>       atomic_t    nr_pending;    /* number of pending requests.
>>                        * only maintained for arrays that
>>                        * support hot removal
>> @@ -923,6 +926,12 @@ static inline void rdev_dec_pending(struct 
>> md_rdev *rdev, struct mddev *mddev)
>>       }
>>   }
>> +static inline void mirror_rdev_dec_pending(struct md_rdev *rdev, 
>> struct mddev *mddev)
>> +{
>> +    atomic_dec(&rdev->nr_reads_pending);
>> +    rdev_dec_pending(rdev, mddev);
> 
> I know that this will cause more code changes, will it make more sense
> just to split writes and reads? With following apis:
> 
> rdev_inc_pending(rdev, rw);
> rdev_dec_pending(rdev, rw);
> rdev_pending(rdev);
> rdev_rw_pending(rdev, rw);
> 
> Thanks,
> Kuai


Thanks Kwai.  Yes I started that way and as mentioned in the commit
message the write path will require more testing and closer review
as there are many inc/dec as compared to the read path.  I will work
on it though, thanks!

-Paul

> 
>> +}
>> +
>>   extern const struct md_cluster_operations *md_cluster_ops;
>>   static inline int mddev_is_clustered(struct mddev *mddev)
>>   {
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index f55c8e67d059..5315b46d2cca 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -394,7 +394,7 @@ static void raid1_end_read_request(struct bio *bio)
>>       if (uptodate) {
>>           raid_end_bio_io(r1_bio);
>> -        rdev_dec_pending(rdev, conf->mddev);
>> +        mirror_rdev_dec_pending(rdev, conf->mddev);
>>       } else {
>>           /*
>>            * oops, read error:
>> @@ -584,6 +584,7 @@ static void update_read_sectors(struct r1conf 
>> *conf, int disk,
>>       struct raid1_info *info = &conf->mirrors[disk];
>>       atomic_inc(&info->rdev->nr_pending);
>> +    atomic_inc(&info->rdev->nr_reads_pending);
>>       if (info->next_seq_sect != this_sector)
>>           info->seq_start = this_sector;
>>       info->next_seq_sect = this_sector + len;
>> @@ -760,9 +761,11 @@ struct read_balance_ctl {
>>       int readable_disks;
>>   };
>> +#define WRITE_WEIGHT 2
>>   static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>>   {
>>       int disk;
>> +    int nonrot = READ_ONCE(conf->nonrot_disks);
>>       struct read_balance_ctl ctl = {
>>           .closest_dist_disk      = -1,
>>           .closest_dist           = MaxSector,
>> @@ -774,7 +777,7 @@ static int choose_best_rdev(struct r1conf *conf, 
>> struct r1bio *r1_bio)
>>       for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
>>           struct md_rdev *rdev;
>>           sector_t dist;
>> -        unsigned int pending;
>> +        unsigned int total_pending, reads_pending;
>>           if (r1_bio->bios[disk] == IO_BLOCKED)
>>               continue;
>> @@ -787,7 +790,21 @@ static int choose_best_rdev(struct r1conf *conf, 
>> struct r1bio *r1_bio)
>>           if (ctl.readable_disks++ == 1)
>>               set_bit(R1BIO_FailFast, &r1_bio->state);
>> -        pending = atomic_read(&rdev->nr_pending);
>> +        total_pending = atomic_read(&rdev->nr_pending);
>> +        if (nonrot) {
>> +            /* for nonrot we weigh writes slightly heavier than
>> +             * reads when deciding disk based on pending IOs as
>> +             * writes typically take longer
>> +             */
>> +            reads_pending = atomic_read(&rdev->nr_reads_pending);
>> +            if (total_pending > reads_pending) {
>> +                int writes;
>> +
>> +                writes = total_pending - reads_pending;
>> +                writes += (writes >> WRITE_WEIGHT);
>> +                total_pending = writes + reads_pending;
>> +            }
>> +        }
>>           dist = abs(r1_bio->sector - conf->mirrors[disk].head_position);
>>           /* Don't change to another disk for sequential reads */
>> @@ -799,7 +816,7 @@ static int choose_best_rdev(struct r1conf *conf, 
>> struct r1bio *r1_bio)
>>                * Add 'pending' to avoid choosing this disk if
>>                * there is other idle disk.
>>                */
>> -            pending++;
>> +            total_pending++;
>>               /*
>>                * If there is no other idle disk, this disk
>>                * will be chosen.
>> @@ -807,8 +824,8 @@ static int choose_best_rdev(struct r1conf *conf, 
>> struct r1bio *r1_bio)
>>               ctl.sequential_disk = disk;
>>           }
>> -        if (ctl.min_pending > pending) {
>> -            ctl.min_pending = pending;
>> +        if (ctl.min_pending > total_pending) {
>> +            ctl.min_pending = total_pending;
>>               ctl.min_pending_disk = disk;
>>           }
>> @@ -831,8 +848,7 @@ static int choose_best_rdev(struct r1conf *conf, 
>> struct r1bio *r1_bio)
>>        * disk is rotational, which might/might not be optimal for 
>> raids with
>>        * mixed ratation/non-rotational disks depending on workload.
>>        */
>> -    if (ctl.min_pending_disk != -1 &&
>> -        (READ_ONCE(conf->nonrot_disks) || ctl.min_pending == 0))
>> +    if (ctl.min_pending_disk != -1 && (nonrot || ctl.min_pending == 0))
>>           return ctl.min_pending_disk;
>>       else
>>           return ctl.closest_dist_disk;
>> @@ -2622,7 +2638,7 @@ static void handle_read_error(struct r1conf 
>> *conf, struct r1bio *r1_bio)
>>           r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>       }
>> -    rdev_dec_pending(rdev, conf->mddev);
>> +    mirror_rdev_dec_pending(rdev, conf->mddev);
>>       sector = r1_bio->sector;
>>       bio = r1_bio->master_bio;
>>
>

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5d2e6bd58e4d..1a1040ec3c4a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -162,6 +162,9 @@  struct md_rdev {
 					 */
 	};
 
+	atomic_t nr_reads_pending;      /* tracks only mirrored reads pending
+					 * to support a performance optimization
+					 */
 	atomic_t	nr_pending;	/* number of pending requests.
 					 * only maintained for arrays that
 					 * support hot removal
@@ -923,6 +926,12 @@  static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
 	}
 }
 
+static inline void mirror_rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
+{
+	atomic_dec(&rdev->nr_reads_pending);
+	rdev_dec_pending(rdev, mddev);
+}
+
 extern const struct md_cluster_operations *md_cluster_ops;
 static inline int mddev_is_clustered(struct mddev *mddev)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index f55c8e67d059..5315b46d2cca 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -394,7 +394,7 @@  static void raid1_end_read_request(struct bio *bio)
 
 	if (uptodate) {
 		raid_end_bio_io(r1_bio);
-		rdev_dec_pending(rdev, conf->mddev);
+		mirror_rdev_dec_pending(rdev, conf->mddev);
 	} else {
 		/*
 		 * oops, read error:
@@ -584,6 +584,7 @@  static void update_read_sectors(struct r1conf *conf, int disk,
 	struct raid1_info *info = &conf->mirrors[disk];
 
 	atomic_inc(&info->rdev->nr_pending);
+	atomic_inc(&info->rdev->nr_reads_pending);
 	if (info->next_seq_sect != this_sector)
 		info->seq_start = this_sector;
 	info->next_seq_sect = this_sector + len;
@@ -760,9 +761,11 @@  struct read_balance_ctl {
 	int readable_disks;
 };
 
+#define WRITE_WEIGHT 2
 static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
 {
 	int disk;
+	int nonrot = READ_ONCE(conf->nonrot_disks);
 	struct read_balance_ctl ctl = {
 		.closest_dist_disk      = -1,
 		.closest_dist           = MaxSector,
@@ -774,7 +777,7 @@  static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
 	for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
 		struct md_rdev *rdev;
 		sector_t dist;
-		unsigned int pending;
+		unsigned int total_pending, reads_pending;
 
 		if (r1_bio->bios[disk] == IO_BLOCKED)
 			continue;
@@ -787,7 +790,21 @@  static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
 		if (ctl.readable_disks++ == 1)
 			set_bit(R1BIO_FailFast, &r1_bio->state);
 
-		pending = atomic_read(&rdev->nr_pending);
+		total_pending = atomic_read(&rdev->nr_pending);
+		if (nonrot) {
+			/* for nonrot we weigh writes slightly heavier than
+			 * reads when deciding disk based on pending IOs as
+			 * writes typically take longer
+			 */
+			reads_pending = atomic_read(&rdev->nr_reads_pending);
+			if (total_pending > reads_pending) {
+				int writes;
+
+				writes = total_pending - reads_pending;
+				writes += (writes >> WRITE_WEIGHT);
+				total_pending = writes + reads_pending;
+			}
+		}
 		dist = abs(r1_bio->sector - conf->mirrors[disk].head_position);
 
 		/* Don't change to another disk for sequential reads */
@@ -799,7 +816,7 @@  static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
 			 * Add 'pending' to avoid choosing this disk if
 			 * there is other idle disk.
 			 */
-			pending++;
+			total_pending++;
 			/*
 			 * If there is no other idle disk, this disk
 			 * will be chosen.
@@ -807,8 +824,8 @@  static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
 			ctl.sequential_disk = disk;
 		}
 
-		if (ctl.min_pending > pending) {
-			ctl.min_pending = pending;
+		if (ctl.min_pending > total_pending) {
+			ctl.min_pending = total_pending;
 			ctl.min_pending_disk = disk;
 		}
 
@@ -831,8 +848,7 @@  static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
 	 * disk is rotational, which might/might not be optimal for raids with
 	 * mixed ratation/non-rotational disks depending on workload.
 	 */
-	if (ctl.min_pending_disk != -1 &&
-	    (READ_ONCE(conf->nonrot_disks) || ctl.min_pending == 0))
+	if (ctl.min_pending_disk != -1 && (nonrot || ctl.min_pending == 0))
 		return ctl.min_pending_disk;
 	else
 		return ctl.closest_dist_disk;
@@ -2622,7 +2638,7 @@  static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
 	}
 
-	rdev_dec_pending(rdev, conf->mddev);
+	mirror_rdev_dec_pending(rdev, conf->mddev);
 	sector = r1_bio->sector;
 	bio = r1_bio->master_bio;

md: raid1 mixed RW performance improvement

Checks

Commit Message

Comments

Patch