From patchwork Sun Feb 15 10:40:08 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 5829911
Return-Path: <linux-nvdimm-bounces@lists.01.org>
X-Original-To: patchwork-linux-nvdimm@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id 4E9D5BF440
	for <patchwork-linux-nvdimm@patchwork.kernel.org>;
	Sun, 15 Feb 2015 10:41:09 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 974B920154
	for <patchwork-linux-nvdimm@patchwork.kernel.org>;
	Sun, 15 Feb 2015 10:41:04 +0000 (UTC)
Received: from ml01.01.org (ml01.01.org [198.145.21.10])
	(using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id C12E420142
	for <patchwork-linux-nvdimm@patchwork.kernel.org>;
	Sun, 15 Feb 2015 10:40:59 +0000 (UTC)
Received: from ml01.vlan14.01.org (localhost [IPv6:::1])
	by ml01.01.org (Postfix) with ESMTP id 959B48116D;
	Sun, 15 Feb 2015 02:40:59 -0800 (PST)
X-Original-To: linux-nvdimm@lists.01.org
Delivered-To: linux-nvdimm@lists.01.org
Received: from mail-wi0-x22d.google.com (mail-wi0-x22d.google.com
	[IPv6:2a00:1450:400c:c05::22d])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128
	bits)) (No client certificate requested)
	by ml01.01.org (Postfix) with ESMTPS id 3584D8116D
	for <linux-nvdimm@lists.01.org>; Sun, 15 Feb 2015 02:40:58 -0800 (PST)
Received: by mail-wi0-f173.google.com with SMTP id bs8so20484644wib.0
	for <linux-nvdimm@lists.01.org>; Sun, 15 Feb 2015 02:40:10 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to
	:subject:references:in-reply-to:content-type
	:content-transfer-encoding;
	bh=uFdSSIBEQSQfGMpel0KO9VVKvgMqsp5fJHo6P/8PLLU=;
	b=W5HZKQnKCmJEqCQBpFGTdFZBC5iaHjvylgef7N5m9FBdqc40U+sBGnFea1OwDGyYdJ
	7XTvX/hmZcZyQLRWDZXye/q1Ub5U4vZjKCFpPPkKL5jlQNhd667dFKjRhAWdWhnsIvWG
	Rfv79jfXR8Xk9sZ84sAGZJ0bWyOCEc1i1IzdTpLGo8hKKf4wgt5C1+HGE0bf+EIvkJmb
	BmAFlQS3IpbXBDjdJcFUq9/VNGiNe7tkPcYY+RrhOcqr3o+FuBYDYnwi2nCUeOEwlGy0
	/VOa4kil8Tjnpc3aeAd/iNx3QydxZaqfrwNLA//9NVXy9EQrOKNZgYjxvumwzSXUYiC3
	EDxg==
X-Gm-Message-State: 
 ALoCoQnbX9aW9DVU3yULa9ihGbifRYz8kq8I8AX1A3GaXKGM8lWTL1e6Ju8PiWHvj+w24vPQkc4k
X-Received: by 10.180.76.72 with SMTP id i8mr34892672wiw.22.1423996810514;
	Sun, 15 Feb 2015 02:40:10 -0800 (PST)
Received: from [10.0.0.5] ([207.232.55.62]) by mx.google.com with ESMTPSA id
	cj12sm17752796wjb.35.2015.02.15.02.40.09
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Sun, 15 Feb 2015 02:40:10 -0800 (PST)
Message-ID: <54E07788.8060907@plexistor.com>
Date: Sun, 15 Feb 2015 12:40:08 +0200
From: Boaz Harrosh <boaz@plexistor.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: "Roger C. Pao (Enmotus)" <roger.pao@enmotus.com>,
	"Roger C. Pao (Enmotus)" <rcpao.enmotus@gmail.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	hugh.daschbach@enmotus.com
References: 
 <CALZRu8zgA+kHCM-QWtapWE7M0qjpsugTUDQfnU3TU80F+O-s5Q@mail.gmail.com>
	<54DCA0FC.80305@plexistor.com> <54DD586C.3020303@enmotus.com>
In-Reply-To: <54DD586C.3020303@enmotus.com>
Subject: Re: [Linux-nvdimm] One liner patch for prd to avoid "kernel BUG at
	drivers/block/pmem.c:176!"
X-BeenThere: linux-nvdimm@lists.01.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Linux-nvdimm developer list." <linux-nvdimm.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/linux-nvdimm>,
	<mailto:linux-nvdimm-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/linux-nvdimm/>
List-Post: <mailto:linux-nvdimm@lists.01.org>
List-Help: <mailto:linux-nvdimm-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/linux-nvdimm>,
	<mailto:linux-nvdimm-request@lists.01.org?subject=subscribe>
Errors-To: linux-nvdimm-bounces@lists.01.org
Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, T_RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On 02/13/2015 03:50 AM, Roger C. Pao (Enmotus) wrote:
> 
> 
<>
> <hugh.daschbach@enmotus.com> wrote:
> 
>      The biovec can present a size greater that PAGE_SIZE if an I/O 
> buffer contains physically contiguous pages.  This may be unusual for 
> user space pages, as the virtual to physical memory map gets fragmented. 
>   But for buffers allocated by the kernel with kmalloc, physical memory 
> will be contiguous.
> 
>      Even if a single I/O request does not contain two contiguous pages, 
> the block layer may merge two requests that have contiguous pages.  It 
> will then attempt to coalesce biovecs.  You probably won't see that if 
> you avoid the I/O scheduler by capturing requests at make_request.  But 
> it is still a good idea to declare your devices segment size limitation 
> with blk_queue_max_segment_size.  There are a couple drivers in 
> drivers/block/ that do just that to limit segments to PAGE_SIZE.
> 

Ye only I cannot find this code in block layer except of course as I said
in blk_rq_map_sg which only SCSI layer uses (and maybe some other block
device)

In fact I know that the Kernel goes through great pain to call split_huge_pages
when it needs to SWAP anon HUGE pages through the block layer.

The one thing that looks very weird in what you are saying is the end_io API
it is called per bio each bio has biovec of a single page. If the block layer
goes on a coalescing spree than that would be unexpected for a caller who
expect a notification per bio per page, for example to call put_page() on each
page or page_unlock(). So the block layer must present back a one-to-one pages to
submitted pages. What are you saying that the block layer has code to coalesce
contiguous pages then un-coalesce them before the call to end_io. Where is that
info stored? (This is exactly what happens with  blk_rq_map_sg() coalescing is
done on the extra allocated sg_list and the bio is left as is for end_io
time)

I'm not saying it is not there, only that I cannot find it.

>      We caught this because the Enmotus tiering engine issues rather 
> large I/O requests to buffers that were allocated with kmalloc.  It is 
> fairly common for the tiering engine to allocate I/O buffers of 64KB or 
> greater.  If the underlying block device supports it, we will submit a 
> bio with a biovec mapping many contiguous pages.  The entire buffer will 
> possibly be mapped by a single biovec.  

OK So  you have a special Kernel modules that maps these biovecs not through
the use of bio_add_page, what API do you use? do you hack in biovec members
directly?

> The tiering engine uses 
> max_segment_size to determine how to build it's biovec list.
> 

That said I don't mind supporting this, in pmem. Do you know of an API
parallel to now used kmap_atomic(PAGE) that will return a valid pointer
for more than a PAGE_SIZE.

So OK I could easily use page_address + page_to_pfn, pfn_to_page, but that
is only good for 64bit. Now we are already very limited on 32bit platforms
anyway (Sub giga size NvDIMM), and I do not mind at all to limit pmem.c to
only compile on 64bit.

And actually in 64bit kmap_atomic(PAGE) will give me a valid virtual-pointer
for larger than PAGE_SIZE, since as you say these are pfn contiguous and the
kernel mapping is flat.

But what are the API's that we are suppose to use with such pages-pointers
that actually point to a more-then-one-page.
I do not want to have to crash and fix this code for ARCHs other than x86_64

>      Thanks for considering the change.
> 
>      Hugh
> 

Here below is a simple patch for pmem to support such pages on x86_64
NOTE: This is based and is only good on my tree, with the simplified code.

All we need to do is remove the BUG_ON.

[Which is badly placed because the limiting condition is the call to
kmap_atomic(page) not in the below function but in pmem_do_bvec() so
the BUG_ON should have been there.]
---

Thanks
Boaz

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 5eced12..26d61da 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -120,7 +120,6 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
-		BUG_ON(len > PAGE_SIZE);
 		pmem_do_bvec(pmem, bvec.bv_page, len,
 			    bvec.bv_offset, rw, sector);
 		sector += len >> SECTOR_SHIFT;