From patchwork Thu Nov 10 17:00:37 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 9421627
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	2A51160484 for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 10 Nov 2016 17:00:43 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 302A6297A7
	for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 10 Nov 2016 17:00:43 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 237A8297AD; Thu, 10 Nov 2016 17:00:43 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 66A92297A7
	for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 10 Nov 2016 17:00:42 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754761AbcKJRAl (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Thu, 10 Nov 2016 12:00:41 -0500
Received: from mail-it0-f52.google.com ([209.85.214.52]:38907 "EHLO
	mail-it0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752528AbcKJRAk (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Thu, 10 Nov 2016 12:00:40 -0500
Received: by mail-it0-f52.google.com with SMTP id q124so54088127itd.1
	for <linux-block@vger.kernel.org>;
	Thu, 10 Nov 2016 09:00:40 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=kernel-dk.20150623.gappssmtp.com; s=20150623;
	h=to:cc:from:subject:message-id:date:user-agent:mime-version
	:content-transfer-encoding;
	bh=W5N5VO7gzSjHpaugQTy0QHzvXM2gcUtZejdGB1K+dB0=;
	b=zj2TTdFynWWHH7hJb7d5+UJCrpkHcGG5rYU939sFr8ML06vEVBOvBKFbGRHfQSeysv
	rGpWaik45DBUOFdz7Oegn6nKd5KGlu+V7qpvimwqJGz3Bz6U+2qw/hz1FQLV0/L9whUc
	ZkRqRxwNMZhni+i7GglWwUoYl4BfijAgPcZKaWZ8MMy61A9Xgzv28X23AoVHGpudHzMM
	99sYlDabB/H4FDcEAOfVFY3xOta4ktZ7W2zP2mx4Oe5nVyBxPjQDRaFTTh/HKi94Y6jE
	ED0NBo1DLGLP0hngZ6tGg67to3efg6f9llzhTBz4T8azTnVNuEqramwiXYyPz7qLTFBd
	ba2A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:to:cc:from:subject:message-id:date:user-agent
	:mime-version:content-transfer-encoding;
	bh=W5N5VO7gzSjHpaugQTy0QHzvXM2gcUtZejdGB1K+dB0=;
	b=KugGY3DaxPyKTqt1sUzpcEmIj7OTZx/Y7NS7DBhHevh7BFeo0522l0d2zXj/oX+xoa
	4o2ynXP5tILia2DQckY/eCPLFT2Bhge0+lAITyBXSwqN/AFxaXH7D3YX8ALcfVmmEAjV
	+31ERxi3rXoxkrlNim3oDmd2XfutyW0u4tNhaL94iAW95t+c6mPXinstdfC5TRBfeVrU
	LkBzsjukfginFmzayDhENJOYB+xINLhdFX5XEKF8vUZUw/G8RfE3XFxZoylMxh/6oXW7
	0ETmQOfu/qOi4xpyCyBg122Jgg68g1Mtb4PrDsZ6yYHA8fFobaDFrhp0dbkHCzr0zz/3
	DEwQ==
X-Gm-Message-State: 
 ABUngvfLBagNGklIjFPh5aaHTikPhBR0Jfq5APlMpUXE9DcL0T/V9Dxsic1IuoPsVrPjBQ==
X-Received: by 10.36.250.201 with SMTP id v192mr5252373ith.106.1478797239042;
	Thu, 10 Nov 2016 09:00:39 -0800 (PST)
Received: from [192.168.1.129] ([216.160.245.98])
	by smtp.gmail.com with ESMTPSA id
	f24sm2289661iod.21.2016.11.10.09.00.38
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Thu, 10 Nov 2016 09:00:38 -0800 (PST)
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
From: Jens Axboe <axboe@kernel.dk>
Subject: [PATCH/RFC] mm: don't cap request size based on read-ahead setting
Message-ID: <7d8739c2-09ea-8c1f-cef7-9b8b40766c6a@kernel.dk>
Date: Thu, 10 Nov 2016 10:00:37 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
	Thunderbird/45.4.0
MIME-Version: 1.0
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Hi,

We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This doesn't
make a lot of sense - if someone is issuing 256K reads, they should see
256K reads, regardless of the read-ahead setting.

To make matters more confusing, there's an odd interaction with the
fadvise hint setting. If we tell the kernel we're doing sequential IO on
this file descriptor, we can get twice the read-ahead size. But if we
tell the kernel that we are doing random IO, hence disabling read-ahead,
we do get nice 256K requests at the lower level. An application
developer will be, rightfully, scratching his head at this point,
wondering wtf is going on. A good one will dive into the kernel source,
and silently weep.

This patch introduces a bdi hint, io_pages. This is the soft max IO size
for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.


  	if (!offset)
@@ -385,7 +393,7 @@ ondemand_readahead(struct address_space *mapping,
  	if ((offset == (ra->start + ra->size - ra->async_size) ||
  	     offset == (ra->start + ra->size))) {
  		ra->start += ra->size;
-		ra->size = get_next_ra_size(ra, max);
+		ra->size = get_next_ra_size(ra, max_pages);
  		ra->async_size = ra->size;
  		goto readit;
  	}
@@ -400,16 +408,16 @@ ondemand_readahead(struct address_space *mapping,
  		pgoff_t start;

  		rcu_read_lock();
-		start = page_cache_next_hole(mapping, offset + 1, max);
+		start = page_cache_next_hole(mapping, offset + 1, max_pages);
  		rcu_read_unlock();

-		if (!start || start - offset > max)
+		if (!start || start - offset > max_pages)
  			return 0;

  		ra->start = start;
  		ra->size = start - offset;	/* old async_size */
  		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max);
+		ra->size = get_next_ra_size(ra, max_pages);
  		ra->async_size = ra->size;
  		goto readit;
  	}
@@ -417,7 +425,7 @@ ondemand_readahead(struct address_space *mapping,
  	/*
  	 * oversize read
  	 */
-	if (req_size > max)
+	if (req_size > max_pages)
  		goto initial_readahead;

  	/*
@@ -433,7 +441,7 @@ ondemand_readahead(struct address_space *mapping,
  	 * Query the page cache and look for the traces(cached history pages)
  	 * that a sequential stream would leave behind.
  	 */
-	if (try_context_readahead(mapping, ra, offset, req_size, max))
+	if (try_context_readahead(mapping, ra, offset, req_size, max_pages))
  		goto readit;

  	/*
@@ -444,7 +452,7 @@ ondemand_readahead(struct address_space *mapping,

  initial_readahead:
  	ra->start = offset;
-	ra->size = get_init_ra_size(req_size, max);
+	ra->size = get_init_ra_size(req_size, max_pages);
  	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;

  readit:
@@ -454,7 +462,7 @@ ondemand_readahead(struct address_space *mapping,
  	 * the resulted next readahead window into the current one.
  	 */
  	if (offset == ra->start && ra->size == ra->async_size) {
-		ra->async_size = get_next_ra_size(ra, max);
+		ra->async_size = get_next_ra_size(ra, max_pages);
  		ra->size += ra->async_size;
  	}

diff --git a/block/blk-settings.c b/block/blk-settings.c
index f679ae122843..65f16cf4f850 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -249,6 +249,7 @@ void blk_queue_max_hw_sectors(struct request_queue 
*q, unsigned int max_hw_secto
  	max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors);
  	max_sectors = min_t(unsigned int, max_sectors, BLK_DEF_MAX_SECTORS);
  	limits->max_sectors = max_sectors;
+	q->backing_dev_info.io_pages = max_sectors >> (PAGE_SHIFT - 9);
  }
  EXPORT_SYMBOL(blk_queue_max_hw_sectors);

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9cc8d7c5439a..ea374e820775 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -212,6 +212,7 @@ queue_max_sectors_store(struct request_queue *q, 
const char *page, size_t count)

  	spin_lock_irq(q->queue_lock);
  	q->limits.max_sectors = max_sectors_kb << 1;
+	q->backing_dev_info.io_pages = max_sectors_kb >> (PAGE_SHIFT - 10);
  	spin_unlock_irq(q->queue_lock);

  	return ret;
diff --git a/include/linux/backing-dev-defs.h 
b/include/linux/backing-dev-defs.h
index c357f27d5483..b8144b2d59ce 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -136,6 +136,7 @@ struct bdi_writeback {
  struct backing_dev_info {
  	struct list_head bdi_list;
  	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
+	unsigned long io_pages;	/* max allowed IO size */
  	unsigned int capabilities; /* Device capabilities */
  	congested_fn *congested_fn; /* Function pointer if device is md/dm */
  	void *congested_data;	/* Pointer to aux data for congested func */
diff --git a/mm/readahead.c b/mm/readahead.c
index c8a955b1297e..49515238cdb1 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -369,10 +369,18 @@ ondemand_readahead(struct address_space *mapping,
  		   bool hit_readahead_marker, pgoff_t offset,
  		   unsigned long req_size)
  {
-	unsigned long max = ra->ra_pages;
+	unsigned long max_pages;
  	pgoff_t prev_offset;

  	/*
+	 * Use the max of the read-ahead pages setting and the requested IO
+	 * size, and then the min of that and the soft IO size for the
+	 * underlying device.
+	 */
+	max_pages = max_t(unsigned long, ra->ra_pages, req_size);
+	max_pages = min_not_zero(inode_to_bdi(mapping->host)->io_pages, 
max_pages);
+
+	/*
  	 * start of file
  	 */