From patchwork Fri Oct 4 14:05:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paul E Luse X-Patchwork-Id: 13822443 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CEC8C15748E for ; Fri, 4 Oct 2024 14:05:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728050722; cv=none; b=K9RiHFBxLqV/Ii4Ew7qC8GQv3NeTSJzHIeE4MBLmO9CuaVJFiUKvWwxqg+h1eiMMXJSiRiSduCq3O/zywHRaGjM5+lD0QpyjYW2Ft/EAh6z/lDleA2AyNk/jeFBfOZSr+tADnt/9hyJ1dVULuaqjoZBlKh6t2oX5kJGmmCmhmwY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728050722; c=relaxed/simple; bh=ChoLaS3sphAOs8K6yYLrbmVjxpCDaHVvu6+UxvvOVp0=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=f4PQ3u8w6KQDKwvyyxAsXSp2+7fo42qG6T2uZUvV/pPj8plgnXWHMS8MgzbW7/JBY5u0KcShiOmQ+++wOcbqLVd9nUMLtoD9chILsxyiJziK4OUTQ43OuSd22rEzn6PNgThR76gDwZw1mR56n5c1xNzi9nyHyGqdeLBKBnblpKI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=QrKSycpZ; arc=none smtp.client-ip=209.85.216.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="QrKSycpZ" Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-2e137183587so1843162a91.3 for ; Fri, 04 Oct 2024 07:05:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728050719; x=1728655519; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=M/jkHkLxO1JFtG/m+iJa+D895QCA9iN19d1rF1P3g8k=; b=QrKSycpZkq0lE8hmzsc8uMhf9PVMBEo6RkdMq+QNqtv3MrOFKefjPZmRbvZLaEtFW1 rJySkjk6797JUVQl3eW+CNTPc5Na75GFSn8CF+3NpusSlcmc9Dcywp9ndgSVCLtRKHrC w38shdEZk0f2AgkPpIaD8W0BFOWbX4reCi2zjR6H4Fv8OCXTsvKXWN1iAFRzJNnZP6/m Gs17/o3umONJmO2LIQbFbs4WOGP4c3Rt3Yardhkq1vicGTrY4llLy+5TeU+wQsLRwoyx ZZP53WY6JB0mOEv2KMLJAqMBozCXoApQXt7/CdDEslAbobOiNhra/lgdzuSjlBbahJoI 6FrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728050719; x=1728655519; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=M/jkHkLxO1JFtG/m+iJa+D895QCA9iN19d1rF1P3g8k=; b=lPSwellhsnfLbdK+trCCuTQA7WSEEKYkJlDvd7icQsAYfEdVezg4z34TtCWFzQndJH 8lofCJ/U0mURVBqc7RLDJnokyW9zLpsOHNVh39cGsAA8smnMV10uZqOiC2uHCJS+qKfT 2iEYznkLJR8hoAG0mCpINuXfRGCjyLx+TqGIUBbUi20fZY8zAVpJ/jid0sFbrWpKexsj 6Od1FKPh9L7wEtrfj4G6v1t7OnK502BstI2DzrigVvn16Y0yHv3PEyzgx9COJC2sdFRH XjF4Z/dEReprVlhrtSAdPJEy7ATe2vBHP+VopvC21Oa7VTaO+E28cmwRmMn7BBD0Mj7p jDZw== X-Gm-Message-State: AOJu0YwZo+BVh0VlACxhwf3gRWh0tygmIAS5A1wG4U0J09PsS7zsL3vx oQiYfg16g+dzFxnEsa7CPvvLdWaUP9gdrp5seUTHimGX3/NLaxVuNLiEUQ== X-Google-Smtp-Source: AGHT+IF/K3t4h1rTXoLPjlMlxFnk2NUtiJTQ37IPj5ebdGE3JcMf3Y/IiGG2Sfa6TGKtXxe5Vpt8Jw== X-Received: by 2002:a17:90b:102:b0:2c9:7611:e15d with SMTP id 98e67ed59e1d1-2e1e629fa8cmr3883423a91.20.1728050718446; Fri, 04 Oct 2024 07:05:18 -0700 (PDT) Received: from localhost.localdomain ([2600:8800:1600:140::5e]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2e1e85d93b1sm1623114a91.25.2024.10.04.07.05.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Oct 2024 07:05:18 -0700 (PDT) From: Paul Luse To: linux-raid@vger.kernel.org Cc: paul.e.luse@intel.com, Paul Luse Subject: [PATCH] md: raid1 mixed RW performance improvement Date: Fri, 4 Oct 2024 07:05:14 -0700 Message-ID: <20241004140514.20209-1-paulluselinux@gmail.com> X-Mailer: git-send-email 2.46.0 Precedence: bulk X-Mailing-List: linux-raid@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 While working on read_balance() earlier this year I realized that for nonrot media the primary disk selection criteria for RAID1 reads was to choose the disk with the least pending IO. At the same time it was noticed that rarely did this work out to a 50/50 split between the disks. Initially I looked at a round-robin scheme however this proved to be too complex when taking into account concurrency. That led to this patch. Writes typically take longer than reads for nonrot media so choosing the disk with the least pending IO without knowing the mix of outstanding IO, reads vs writes, is not optimal. This patch takes a very simple implantation approach to place more weight on writes vs reads when selecting a disk to read from. Based on empirical testing of multiple drive vendors NVMe and SATA (some data included below) I found that weighing writes by 1.25x and rounding gave the best overall performance improvement. The boost varies by drive, as do most drive dependent performance optimizations. Kioxia gets the least benefit while Samsung gets the most. I also confirmed no impact on pure read cases (or writes of course). I left the total pending counter in place and simply added one specific to reads, there was no reason to count them separately especially given the additional complexity in the write path for tracking pending IO. The fewer writes that are outstanding the less positive impact this patch has. So the math works out well. Here are the non-weighted and weighted values for looking at outstanding writes. The first column is the unweighted value and the second is what is used with this patch. Until there are at least 4 pending, no change. The more pending, the more the value is weighted which is perfect for how the drives behave. 1 1 2 2 3 3 4 5 5 6 6 7 7 8 8 10 9 11 10 12 11 13 12 15 13 16 14 17 15 18 16 20 Below is performance data for the patch, 3 different NVMe drives and one SATA. WD SATA: https://photos.app.goo.gl/1smadpDEzgLaa5G48 WD NVMe: https://photos.app.goo.gl/YkTTcYfU8Yc8XWA58 SamSung NVMe: https://photos.app.goo.gl/F6MvEfmbGtRyPUFz6 Kioxia NVMe: https://photos.app.goo.gl/BAEhi8hUwsdTyj9y5 Signed-off-by: Paul Luse --- drivers/md/md.h | 9 +++++++++ drivers/md/raid1.c | 34 +++++++++++++++++++++++++--------- 2 files changed, 34 insertions(+), 9 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 5d2e6bd58e4d..1a1040ec3c4a 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -162,6 +162,9 @@ struct md_rdev { */ }; + atomic_t nr_reads_pending; /* tracks only mirrored reads pending + * to support a performance optimization + */ atomic_t nr_pending; /* number of pending requests. * only maintained for arrays that * support hot removal @@ -923,6 +926,12 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) } } +static inline void mirror_rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) +{ + atomic_dec(&rdev->nr_reads_pending); + rdev_dec_pending(rdev, mddev); +} + extern const struct md_cluster_operations *md_cluster_ops; static inline int mddev_is_clustered(struct mddev *mddev) { diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index f55c8e67d059..5315b46d2cca 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -394,7 +394,7 @@ static void raid1_end_read_request(struct bio *bio) if (uptodate) { raid_end_bio_io(r1_bio); - rdev_dec_pending(rdev, conf->mddev); + mirror_rdev_dec_pending(rdev, conf->mddev); } else { /* * oops, read error: @@ -584,6 +584,7 @@ static void update_read_sectors(struct r1conf *conf, int disk, struct raid1_info *info = &conf->mirrors[disk]; atomic_inc(&info->rdev->nr_pending); + atomic_inc(&info->rdev->nr_reads_pending); if (info->next_seq_sect != this_sector) info->seq_start = this_sector; info->next_seq_sect = this_sector + len; @@ -760,9 +761,11 @@ struct read_balance_ctl { int readable_disks; }; +#define WRITE_WEIGHT 2 static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio) { int disk; + int nonrot = READ_ONCE(conf->nonrot_disks); struct read_balance_ctl ctl = { .closest_dist_disk = -1, .closest_dist = MaxSector, @@ -774,7 +777,7 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio) for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { struct md_rdev *rdev; sector_t dist; - unsigned int pending; + unsigned int total_pending, reads_pending; if (r1_bio->bios[disk] == IO_BLOCKED) continue; @@ -787,7 +790,21 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio) if (ctl.readable_disks++ == 1) set_bit(R1BIO_FailFast, &r1_bio->state); - pending = atomic_read(&rdev->nr_pending); + total_pending = atomic_read(&rdev->nr_pending); + if (nonrot) { + /* for nonrot we weigh writes slightly heavier than + * reads when deciding disk based on pending IOs as + * writes typically take longer + */ + reads_pending = atomic_read(&rdev->nr_reads_pending); + if (total_pending > reads_pending) { + int writes; + + writes = total_pending - reads_pending; + writes += (writes >> WRITE_WEIGHT); + total_pending = writes + reads_pending; + } + } dist = abs(r1_bio->sector - conf->mirrors[disk].head_position); /* Don't change to another disk for sequential reads */ @@ -799,7 +816,7 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio) * Add 'pending' to avoid choosing this disk if * there is other idle disk. */ - pending++; + total_pending++; /* * If there is no other idle disk, this disk * will be chosen. @@ -807,8 +824,8 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio) ctl.sequential_disk = disk; } - if (ctl.min_pending > pending) { - ctl.min_pending = pending; + if (ctl.min_pending > total_pending) { + ctl.min_pending = total_pending; ctl.min_pending_disk = disk; } @@ -831,8 +848,7 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio) * disk is rotational, which might/might not be optimal for raids with * mixed ratation/non-rotational disks depending on workload. */ - if (ctl.min_pending_disk != -1 && - (READ_ONCE(conf->nonrot_disks) || ctl.min_pending == 0)) + if (ctl.min_pending_disk != -1 && (nonrot || ctl.min_pending == 0)) return ctl.min_pending_disk; else return ctl.closest_dist_disk; @@ -2622,7 +2638,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; } - rdev_dec_pending(rdev, conf->mddev); + mirror_rdev_dec_pending(rdev, conf->mddev); sector = r1_bio->sector; bio = r1_bio->master_bio;