raid0 vs. mkfs

On 2016/11/30 上午6:45, Avi Kivity wrote:
> On 11/29/2016 11:14 PM, NeilBrown wrote:
[snip]

>>> So I disagree that all the work should be pushed to the merging layer.
>>> It has less information to work with, so the fewer decisions it has to
>>> make, the better.
>> I think that the merging layer should be as efficient as it reasonably
>> can be, and particularly should take into account plugging.  This
>> benefits all callers.
> 
> Yes, but plugging does not mean "please merge anything you can until the
> unplug".
> 
>> If it can be demonstrated that changes to some of the upper layers bring
>> further improvements with acceptable costs, then certainly it is good to
>> have those too.
> 
> Generating millions of requests only to merge them again is
> inefficient.  It happens in an edge case (TRIM of the entirety of a very
> large RAID), but it already caused on user to believe the system
> failed.  I think the system should be more robust than that.

Neil,

As my understand, if a large discard bio received by
raid0_make_request(), for example it requests to discard chunk 1 to 24
on a raid0 device built by 4 SSDs. This large discard bio will be split
and written to each SSD as the following layout,

SSD1: C1,C5,C9,C13,C17,C21
SSD2: C2,C6,C10,C14,C18,C22
SSD3: C3,C7,C11,C15,C19,C23
SSD4: C4,C8,C12,C16,C20,C24

Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of each
split bio, so we can combine all the bios into four per-SSD large bio,
like this,

bio1 (on SSD1): C{1,5,9,13,17,21}
bio2 (on SSD2): C{2,6,10,14,18,22}
bio3 (on SSD3): C{3,7,11,15,19,23}
bio4 (on SSD4): C{4,8,12,16,20,24}

Now we only need to call generic_make_request() for 4 times. Rebuild the
per-device discard bios is more efficient in raid0 code then in block
layer. There are some reasons that I know,
- there are splice timeout, block layer cannot merge all split bio into
one large bio before time out.
- rebuilt per-device bio in raid0 is just by a few calculation, block
layer does merge on queue with list operations, it is slower.
- raid0 code knows its on disk layout, so rebuild per-device bio is
possible here. block layer has no idea on raid0 layout, it can only do
request merge.

Avi,

I compose a prototype patch, the code is not simple, indeed it is quite
complicated IMHO.

I do a little research, some NVMe SSDs support whole device size
DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
bio to block layer. But raid0_make_request() only receives 512KB size
DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
original large bio into 512KB small bios, the limitation is from
q->limits.discard_granularity.

At this moment, I don't know why a q->limits.discard_granularity is
512KB even the underlying SSD supports whole device size discard. We
also need to fix q->limits.discard_granularity, otherwise
block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop
to split the original large discard bio into smaller ones and sends them
to raid0 code by next_bio().

I also CC this email to linux-block@vger.kernel.org to ask for help. My
question is, if a NVMe SSD supports whole-device-size DISCARD, is
q->limits.discard_granularity still necessary ?

Here I also attach my prototype patch as a proof of concept, it is
runnable with Linux 4.9-rc7.

Coly
Subject: optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/raid0.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 224 insertions(+)

raid0 vs. mkfs

Commit Message

Comments

Patch