From patchwork Sun Feb 15 10:40:08 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boaz Harrosh X-Patchwork-Id: 5829911 Return-Path: X-Original-To: patchwork-linux-nvdimm@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 4E9D5BF440 for ; Sun, 15 Feb 2015 10:41:09 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 974B920154 for ; Sun, 15 Feb 2015 10:41:04 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id C12E420142 for ; Sun, 15 Feb 2015 10:40:59 +0000 (UTC) Received: from ml01.vlan14.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 959B48116D; Sun, 15 Feb 2015 02:40:59 -0800 (PST) X-Original-To: linux-nvdimm@lists.01.org Delivered-To: linux-nvdimm@lists.01.org Received: from mail-wi0-x22d.google.com (mail-wi0-x22d.google.com [IPv6:2a00:1450:400c:c05::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 3584D8116D for ; Sun, 15 Feb 2015 02:40:58 -0800 (PST) Received: by mail-wi0-f173.google.com with SMTP id bs8so20484644wib.0 for ; Sun, 15 Feb 2015 02:40:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=uFdSSIBEQSQfGMpel0KO9VVKvgMqsp5fJHo6P/8PLLU=; b=W5HZKQnKCmJEqCQBpFGTdFZBC5iaHjvylgef7N5m9FBdqc40U+sBGnFea1OwDGyYdJ 7XTvX/hmZcZyQLRWDZXye/q1Ub5U4vZjKCFpPPkKL5jlQNhd667dFKjRhAWdWhnsIvWG Rfv79jfXR8Xk9sZ84sAGZJ0bWyOCEc1i1IzdTpLGo8hKKf4wgt5C1+HGE0bf+EIvkJmb BmAFlQS3IpbXBDjdJcFUq9/VNGiNe7tkPcYY+RrhOcqr3o+FuBYDYnwi2nCUeOEwlGy0 /VOa4kil8Tjnpc3aeAd/iNx3QydxZaqfrwNLA//9NVXy9EQrOKNZgYjxvumwzSXUYiC3 EDxg== X-Gm-Message-State: ALoCoQnbX9aW9DVU3yULa9ihGbifRYz8kq8I8AX1A3GaXKGM8lWTL1e6Ju8PiWHvj+w24vPQkc4k X-Received: by 10.180.76.72 with SMTP id i8mr34892672wiw.22.1423996810514; Sun, 15 Feb 2015 02:40:10 -0800 (PST) Received: from [10.0.0.5] ([207.232.55.62]) by mx.google.com with ESMTPSA id cj12sm17752796wjb.35.2015.02.15.02.40.09 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 15 Feb 2015 02:40:10 -0800 (PST) Message-ID: <54E07788.8060907@plexistor.com> Date: Sun, 15 Feb 2015 12:40:08 +0200 From: Boaz Harrosh User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: "Roger C. Pao (Enmotus)" , "Roger C. Pao (Enmotus)" , "linux-nvdimm@lists.01.org" , hugh.daschbach@enmotus.com References: <54DCA0FC.80305@plexistor.com> <54DD586C.3020303@enmotus.com> In-Reply-To: <54DD586C.3020303@enmotus.com> Subject: Re: [Linux-nvdimm] One liner patch for prd to avoid "kernel BUG at drivers/block/pmem.c:176!" X-BeenThere: linux-nvdimm@lists.01.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: "Linux-nvdimm developer list." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 02/13/2015 03:50 AM, Roger C. Pao (Enmotus) wrote: > > <> > wrote: > > The biovec can present a size greater that PAGE_SIZE if an I/O > buffer contains physically contiguous pages. This may be unusual for > user space pages, as the virtual to physical memory map gets fragmented. > But for buffers allocated by the kernel with kmalloc, physical memory > will be contiguous. > > Even if a single I/O request does not contain two contiguous pages, > the block layer may merge two requests that have contiguous pages. It > will then attempt to coalesce biovecs. You probably won't see that if > you avoid the I/O scheduler by capturing requests at make_request. But > it is still a good idea to declare your devices segment size limitation > with blk_queue_max_segment_size. There are a couple drivers in > drivers/block/ that do just that to limit segments to PAGE_SIZE. > Ye only I cannot find this code in block layer except of course as I said in blk_rq_map_sg which only SCSI layer uses (and maybe some other block device) In fact I know that the Kernel goes through great pain to call split_huge_pages when it needs to SWAP anon HUGE pages through the block layer. The one thing that looks very weird in what you are saying is the end_io API it is called per bio each bio has biovec of a single page. If the block layer goes on a coalescing spree than that would be unexpected for a caller who expect a notification per bio per page, for example to call put_page() on each page or page_unlock(). So the block layer must present back a one-to-one pages to submitted pages. What are you saying that the block layer has code to coalesce contiguous pages then un-coalesce them before the call to end_io. Where is that info stored? (This is exactly what happens with blk_rq_map_sg() coalescing is done on the extra allocated sg_list and the bio is left as is for end_io time) I'm not saying it is not there, only that I cannot find it. > We caught this because the Enmotus tiering engine issues rather > large I/O requests to buffers that were allocated with kmalloc. It is > fairly common for the tiering engine to allocate I/O buffers of 64KB or > greater. If the underlying block device supports it, we will submit a > bio with a biovec mapping many contiguous pages. The entire buffer will > possibly be mapped by a single biovec. OK So you have a special Kernel modules that maps these biovecs not through the use of bio_add_page, what API do you use? do you hack in biovec members directly? > The tiering engine uses > max_segment_size to determine how to build it's biovec list. > That said I don't mind supporting this, in pmem. Do you know of an API parallel to now used kmap_atomic(PAGE) that will return a valid pointer for more than a PAGE_SIZE. So OK I could easily use page_address + page_to_pfn, pfn_to_page, but that is only good for 64bit. Now we are already very limited on 32bit platforms anyway (Sub giga size NvDIMM), and I do not mind at all to limit pmem.c to only compile on 64bit. And actually in 64bit kmap_atomic(PAGE) will give me a valid virtual-pointer for larger than PAGE_SIZE, since as you say these are pfn contiguous and the kernel mapping is flat. But what are the API's that we are suppose to use with such pages-pointers that actually point to a more-then-one-page. I do not want to have to crash and fix this code for ARCHs other than x86_64 > Thanks for considering the change. > > Hugh > Here below is a simple patch for pmem to support such pages on x86_64 NOTE: This is based and is only good on my tree, with the simplified code. All we need to do is remove the BUG_ON. [Which is badly placed because the limiting condition is the call to kmap_atomic(page) not in the below function but in pmem_do_bvec() so the BUG_ON should have been there.] --- Thanks Boaz diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index 5eced12..26d61da 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -120,7 +120,6 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio) bio_for_each_segment(bvec, bio, iter) { unsigned int len = bvec.bv_len; - BUG_ON(len > PAGE_SIZE); pmem_do_bvec(pmem, bvec.bv_page, len, bvec.bv_offset, rw, sector); sector += len >> SECTOR_SHIFT;