From patchwork Wed Dec  7 11:50:33 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Coly Li <colyli@suse.de>
X-Patchwork-Id: 9464375
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	E7E886022E for <patchwork-linux-block@patchwork.kernel.org>;
	Wed,  7 Dec 2016 12:04:03 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D448F2840F
	for <patchwork-linux-block@patchwork.kernel.org>;
	Wed,  7 Dec 2016 12:04:03 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id C90E8284F9; Wed,  7 Dec 2016 12:04:03 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	T_TVD_MIME_EPI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B88C02840F
	for <patchwork-linux-block@patchwork.kernel.org>;
	Wed,  7 Dec 2016 12:04:01 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752125AbcLGMEB (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Wed, 7 Dec 2016 07:04:01 -0500
Received: from mx2.suse.de ([195.135.220.15]:39043 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751758AbcLGMEA (ORCPT <rfc822;linux-block@vger.kernel.org>);
	Wed, 7 Dec 2016 07:04:00 -0500
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254])
	by mx2.suse.de (Postfix) with ESMTP id 926F5AAB7;
	Wed,  7 Dec 2016 11:50:46 +0000 (UTC)
Subject: Re: raid0 vs. mkfs
To: Avi Kivity <avi@scylladb.com>, NeilBrown <neilb@suse.com>
References: <56c83c4e-d451-07e5-88e2-40b085d8681c@scylladb.com>
	<87oa108a1x.fsf@notabene.neil.brown.name>
	<286a5fc1-eda3-0421-a88e-b03c09403259@scylladb.com>
	<87inr880au.fsf@notabene.neil.brown.name>
	<df73ebc4-9b78-09b5-022b-089c30dea17c@scylladb.com>
	<87d1he7zv9.fsf@notabene.neil.brown.name>
	<33bb250a-4dfd-0acc-9958-30fdac10918c@scylladb.com>
From: Coly Li <colyli@suse.de>
Cc: linux-raid@vger.kernel.org, linux-block@vger.kernel.org
Message-ID: <ccf447df-e7b5-0a08-7753-67ee4ad574ef@suse.de>
Date: Wed, 7 Dec 2016 19:50:33 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0)
	Gecko/20100101 Thunderbird/45.5.1
MIME-Version: 1.0
In-Reply-To: <33bb250a-4dfd-0acc-9958-30fdac10918c@scylladb.com>
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On 2016/11/30 上午6:45, Avi Kivity wrote:
> On 11/29/2016 11:14 PM, NeilBrown wrote:
[snip]

>>> So I disagree that all the work should be pushed to the merging layer.
>>> It has less information to work with, so the fewer decisions it has to
>>> make, the better.
>> I think that the merging layer should be as efficient as it reasonably
>> can be, and particularly should take into account plugging.  This
>> benefits all callers.
> 
> Yes, but plugging does not mean "please merge anything you can until the
> unplug".
> 
>> If it can be demonstrated that changes to some of the upper layers bring
>> further improvements with acceptable costs, then certainly it is good to
>> have those too.
> 
> Generating millions of requests only to merge them again is
> inefficient.  It happens in an edge case (TRIM of the entirety of a very
> large RAID), but it already caused on user to believe the system
> failed.  I think the system should be more robust than that.

Neil,

As my understand, if a large discard bio received by
raid0_make_request(), for example it requests to discard chunk 1 to 24
on a raid0 device built by 4 SSDs. This large discard bio will be split
and written to each SSD as the following layout,

SSD1: C1,C5,C9,C13,C17,C21
SSD2: C2,C6,C10,C14,C18,C22
SSD3: C3,C7,C11,C15,C19,C23
SSD4: C4,C8,C12,C16,C20,C24

Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of each
split bio, so we can combine all the bios into four per-SSD large bio,
like this,

bio1 (on SSD1): C{1,5,9,13,17,21}
bio2 (on SSD2): C{2,6,10,14,18,22}
bio3 (on SSD3): C{3,7,11,15,19,23}
bio4 (on SSD4): C{4,8,12,16,20,24}

Now we only need to call generic_make_request() for 4 times. Rebuild the
per-device discard bios is more efficient in raid0 code then in block
layer. There are some reasons that I know,
- there are splice timeout, block layer cannot merge all split bio into
one large bio before time out.
- rebuilt per-device bio in raid0 is just by a few calculation, block
layer does merge on queue with list operations, it is slower.
- raid0 code knows its on disk layout, so rebuild per-device bio is
possible here. block layer has no idea on raid0 layout, it can only do
request merge.

Avi,

I compose a prototype patch, the code is not simple, indeed it is quite
complicated IMHO.

I do a little research, some NVMe SSDs support whole device size
DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
bio to block layer. But raid0_make_request() only receives 512KB size
DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
original large bio into 512KB small bios, the limitation is from
q->limits.discard_granularity.

At this moment, I don't know why a q->limits.discard_granularity is
512KB even the underlying SSD supports whole device size discard. We
also need to fix q->limits.discard_granularity, otherwise
block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop
to split the original large discard bio into smaller ones and sends them
to raid0 code by next_bio().

I also CC this email to linux-block@vger.kernel.org to ask for help. My
question is, if a NVMe SSD supports whole-device-size DISCARD, is
q->limits.discard_granularity still necessary ?

Here I also attach my prototype patch as a proof of concept, it is
runnable with Linux 4.9-rc7.

Coly
Subject: optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/raid0.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 224 insertions(+)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..73c5fac 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -452,6 +452,225 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
 	}
 }
 
+
+struct bio_record {
+	sector_t	bi_sector;
+	unsigned long	sectors;
+	struct md_rdev	*rdev;
+};
+
+static void handle_discard_request(struct mddev *mddev, struct bio *bio)
+{
+	struct bio_record *recs = NULL;
+	struct bio *split;
+	struct r0conf *conf = mddev->private;
+	sector_t sectors, sector;
+	struct strip_zone *first_zone;
+	int zone_idx;
+	sector_t zone_start, zone_end;
+	int nr_strip_zones = conf->nr_strip_zones;
+	int disks;
+	int first_rdev_idx = -1, rdev_idx;
+	struct md_rdev *first_rdev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	sector = bio->bi_iter.bi_sector;
+	first_zone = find_zone(conf, &sector);
+	first_rdev = map_sector(mddev, first_zone, sector, &sector);
+
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		 ? (sector & (chunk_sects - 1))
+		 : sector_div(sector, chunk_sects));
+
+	/* if bio size is not exceed a chunk boundary,
+	 * simply handle it here.
+	 */
+	if (sectors >= bio_sectors(bio)) {
+		bio->bi_bdev = first_rdev->bdev;
+		bio->bi_iter.bi_sector = sector + first_zone->dev_start +
+					 first_rdev->data_offset;
+		if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(bio);
+		else
+			generic_make_request(bio);
+		return;
+	}
+
+	/* bio is large enough to be split, allocate recs firstly */
+	disks = mddev->raid_disks;
+	recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+	if (recs == NULL) {
+		printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+				"for bio_record", mdname(mddev));
+		bio->bi_error = -ENOMEM;
+		bio_endio(bio);
+		return;
+	}
+
+	zone_idx = first_zone - conf->strip_zone;
+	for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+		struct md_rdev *rdev;
+
+		rdev = conf->devlist[zone_idx * disks + rdev_idx];
+		recs[rdev_idx].rdev = rdev;
+		if (rdev == first_rdev)
+			first_rdev_idx = rdev_idx;
+	}
+
+	/* Restore due to sector_div */
+	sector = bio->bi_iter.bi_sector;
+	recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+	recs[first_rdev_idx].sectors = sectors;
+	BUG_ON(recs[first_rdev_idx].rdev != first_rdev);
+
+	/* recs[first_rdev_idx] is initialized with 'sectors', we need to
+	 * handle the rested sectors, which is sotred in 'sectors' too.
+	 */
+	sectors = bio_sectors(bio) - sectors;
+
+	/* bio may not be chunk size aligned, the split bio on first rdev
+	 * may not be chunk size aligned too. But the rested split bios
+	 * on rested rdevs must be chunk size aligned, and aligned to
+	 * round down chunk number.
+	 */
+	zone_end = first_zone->zone_end;
+	rdev_idx = first_rdev_idx + 1;
+	sector = likely(is_power_of_2(chunk_sects))
+		 ? sector & (~(chunk_sects - 1))
+		 : chunk_sects * (sector/chunk_sects);
+
+	while (rdev_idx < first_zone->nb_dev) {
+		recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+		if (sectors <= chunk_sects) {
+			recs[rdev_idx].sectors = sectors;
+			goto issue;
+		}
+		recs[rdev_idx].sectors = chunk_sects;
+		sectors -= chunk_sects;
+		rdev_idx++;
+	}
+
+	sector += chunk_sects;
+	zone_start = sector + first_zone->dev_start;
+	if (zone_start == zone_end) {
+		zone_idx++;
+		zone_start = conf->strip_zone[zone_idx].dev_start;
+	}
+
+	while (zone_idx < nr_strip_zones) {
+		int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+		int chunks_per_rdev, rested_chunks, rested_sectors;
+		sector_t zone_sectors, grow_sectors;
+		int add_rested_sectors = 0;
+
+		zone_end = conf->strip_zone[zone_idx].zone_end;
+		zone_sectors = zone_end - zone_start;
+		chunks_per_rdev = sectors;
+		rested_sectors =
+			sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+		rested_chunks = rested_sectors;
+		rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+		if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+			chunks_per_rdev = zone_sectors/chunk_sects;
+
+		/* rested_chunks and rested_sectors go into next zone, we won't
+		 * handle them in this zone. Set them to 0.
+		 */
+		if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+		    (rested_chunks != 0 || rested_sectors != 0)) {
+			if (rested_chunks != 0)
+				rested_chunks = 0;
+			if (rested_sectors != 0)
+				rested_sectors = 0;
+		}
+
+		if (rested_chunks == 0 && rested_sectors != 0)
+			add_rested_sectors = 1;
+
+		for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+			/* if .sectors is not initailized (== 0), it indicates
+			 * .bi_sector is not initialized neither. We initiate
+			 * .bi_sector firstly, then set .sectors by
+			 * grow_sectors.
+			 */
+			if (recs[rdev_idx].sectors == 0)
+				recs[rdev_idx].bi_sector = zone_start;
+			grow_sectors = chunks_per_rdev * chunk_sects;
+			if (rested_chunks) {
+				grow_sectors += chunk_sects;
+				rested_chunks--;
+				if (rested_chunks == 0 &&
+				    rested_sectors != 0) {
+					recs[rdev_idx].sectors += grow_sectors;
+					sectors -= grow_sectors;
+					add_rested_sectors = 1;
+					continue;
+				}
+			}
+
+			/* if add_rested_sectors != 0, it indicates
+			 * rested_sectors != 0
+			 */
+			if (add_rested_sectors)
+				grow_sectors += rested_sectors;
+			recs[rdev_idx].sectors += grow_sectors;
+			sectors -= grow_sectors;
+			if (add_rested_sectors)
+				break;
+		}
+
+		if (sectors == 0)
+			break;
+		zone_start = zone_end;
+		zone_idx++;
+		BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+	}
+
+
+issue:
+	/* recs contains the re-ordered requests layout, now we can
+	 * chain split bios from recs
+	 */
+	for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+		if (rdev_idx == first_rdev_idx ||
+		    recs[rdev_idx].sectors == 0)
+			continue;
+		split = bio_split(bio,
+				  recs[rdev_idx].sectors,
+				  GFP_NOIO,
+				  fs_bio_set);
+		if (split == NULL)
+			break;
+		bio_chain(split, bio);
+		BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+		split->bi_bdev = recs[rdev_idx].rdev->bdev;
+		split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+				recs[rdev_idx].rdev->data_offset;
+
+		if (unlikely(!blk_queue_discard(
+				bdev_get_queue(split->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(split);
+		else
+			generic_make_request(split);
+	}
+	BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+	bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+					recs[first_rdev_idx].rdev->data_offset;
+	bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	kfree(recs);
+}
+
 static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
@@ -463,6 +682,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		handle_discard_request(mddev, bio);
+		return;
+	}
+
 	do {
 		sector_t sector = bio->bi_iter.bi_sector;
 		unsigned chunk_sects = mddev->chunk_sectors;