From patchwork Fri Feb 15 11:13:06 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 10814545 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BB07A1575 for ; Fri, 15 Feb 2019 11:13:52 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B412C2D504 for ; Fri, 15 Feb 2019 11:13:52 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A3E4428BAA; Fri, 15 Feb 2019 11:13:52 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=unavailable version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C8F3628BAA for ; Fri, 15 Feb 2019 11:13:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A0AEE8E0002; Fri, 15 Feb 2019 06:13:50 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 9BCAA8E0001; Fri, 15 Feb 2019 06:13:50 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8AB438E0002; Fri, 15 Feb 2019 06:13:50 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by kanga.kvack.org (Postfix) with ESMTP id 616C38E0001 for ; Fri, 15 Feb 2019 06:13:50 -0500 (EST) Received: by mail-qk1-f199.google.com with SMTP id h6so7681620qke.18 for ; Fri, 15 Feb 2019 03:13:50 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id; bh=H99uGmNjHv9HxOit1pMLQbl9AvT7Dvx8pL1ca9QeTWA=; b=cry7oYS3tV+yhMwLabPwbS2kGK3p4LU86/xTsURoHiopt4RwvlqZqDZcNHBL7zjNnp ozLb8KD/YzzC0DOVYdgbyxkyrCgWFml0qLZ/wH0srF7+RkId2OBxSXtEMYJ/z3FWpX9p F3cI89SKM00YIJreEE9p4XxVpAhcBSMPJs1AQIYbx5ma2N+L+Ox2g4hr1zC6n0lpEy7g L8fLrNCvHw5JETr6dQfGeGM+GEXnyADAvGOxSyIO9bNDg53ZjY9yVQnIOX3mnjIwlw6r pTWdaVzjnlrfauN0vXZrdAG+h9YXJkhWJoJLsNhWRjvHeJqmx6FgyIqoh/qeWXo5dxmw tVbg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of ming.lei@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Gm-Message-State: AHQUAuZK3m3+qq+1YOyT+/7Z8qvkOF8v/5pTgZTp9fX21a/cW1k2m3TH NohnDIOCjUsyybTNvHK3NnUi7tDOKSqugLKciXPzfuivFhQhV9+rBhHUJEstkiaS0QBbs2zg+Vb AQbdVIA6F99eeDM5Fk7BZxptBCA0golfEx+l11REz3GYoW9C0B58BFW9FZlqnwfj28A== X-Received: by 2002:ac8:7545:: with SMTP id b5mr6898967qtr.244.1550229229962; Fri, 15 Feb 2019 03:13:49 -0800 (PST) X-Google-Smtp-Source: AHgI3IbXzkFRbizdl4iElNL87d5PHaQHk16wpNOh3nrXG/j57XdzZqC2I37VuvA7Dk0qpxppwzkK X-Received: by 2002:ac8:7545:: with SMTP id b5mr6898902qtr.244.1550229228818; Fri, 15 Feb 2019 03:13:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550229228; cv=none; d=google.com; s=arc-20160816; b=XrVtA8HA3zdXndspI48MF1Lvxrdu2mmfZKdDUfYMXC30p+tzoBamZO6TCe45tpwWjf RbWWrNq4WUEdHpDdU31DLd4r8hFn5IGG0rWYSIehPI1RB3xLeYNw7PD+NrWGJikt8HKA 9ByOSrFfKXLnBeV46oCQPxPk8N54eXprJXYFugKFEAO+rdGGsdx9Ut5QMviSemaQZPQN 1CqvMQXpy2RDEufDe+V012UIYmY7aKs/1OexbTG/wqvOtx6C2mR2/l9nG44/rcHSp4lE sFlESsK93wDsQzlD8cnfipd09noqZI2lTJDLaAV5ciOEczsLNOkfUpUnv7PInrXTUSMO oqOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:cc:to:from; bh=H99uGmNjHv9HxOit1pMLQbl9AvT7Dvx8pL1ca9QeTWA=; b=K5BMYU6BRg/ZwCsrH28Cq+uaSpe0yYrf0Q5K6ApaOl3g3+Nkd4qCCl8P6YeDTzIhDT hmOWjj8LuAeeaQ5zIcssFfDyyBIpLPE8tZz2KVtb4ZSWvC7vcTae3tZx+2MLKBtoJ+NX 9leE+42WeAveKVIYfPBMLepPn6vhQpuQSV16X1oUZZjKJs11HEgfL4refKid20sw3n3t 6M5ddtDxh86Lgicvo2FprYRB1W7Xmgyy5v1CzPwxirRQWNbDMxjISAnXIhqmSrgWIKrO RdwVoZXGfP+3yKWzu20kM2saQkT/nXv2QG+Equ+vnyQp0oYMMiFTLYL60/l9pzeFszig azLw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ming.lei@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id v57si3467170qtv.286.2019.02.15.03.13.48 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Feb 2019 03:13:48 -0800 (PST) Received-SPF: pass (google.com: domain of ming.lei@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ming.lei@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7826EC0D804C; Fri, 15 Feb 2019 11:13:47 +0000 (UTC) Received: from localhost (ovpn-8-22.pek2.redhat.com [10.72.8.22]) by smtp.corp.redhat.com (Postfix) with ESMTP id DD6435D6A9; Fri, 15 Feb 2019 11:13:33 +0000 (UTC) From: Ming Lei To: Jens Axboe Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Theodore Ts'o , Omar Sandoval , Sagi Grimberg , Dave Chinner , Kent Overstreet , Mike Snitzer , dm-devel@redhat.com, Alexander Viro , linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, David Sterba , linux-btrfs@vger.kernel.org, "Darrick J . Wong" , linux-xfs@vger.kernel.org, Gao Xiang , Christoph Hellwig , linux-ext4@vger.kernel.org, Coly Li , linux-bcache@vger.kernel.org, Boaz Harrosh , Bob Peterson , cluster-devel@redhat.com, Ming Lei Subject: [PATCH V15 00/18] block: support multi-page bvec Date: Fri, 15 Feb 2019 19:13:06 +0800 Message-Id: <20190215111324.30129-1-ming.lei@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Fri, 15 Feb 2019 11:13:48 +0000 (UTC) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Hi, This patchset brings multi-page bvec into block layer: 1) what is multi-page bvec? Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages which are physically contiguous instead of one single page used in linux kernel for long time. 2) why is multi-page bvec introduced? Kent proposed the idea[1] first. As system's RAM becomes much bigger than before, and huge page, transparent huge page and memory compaction are widely used, it is a bit easy now to see physically contiguous pages from fs in I/O. On the other hand, from block layer's view, it isn't necessary to store intermediate pages into bvec, and it is enough to just store the physicallly contiguous 'segment' in each io vector. Also huge pages are being brought to filesystem and swap [2][6], we can do IO on a hugepage each time[3], which requires that one bio can transfer at least one huge page one time. Turns out it isn't flexiable to change BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well. As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured as much bigger, such as 512, which requires at least two 4K pages for holding the bvec table. With multi-page bvec: - Inside block layer, both bio splitting and sg map can become more efficient than before by just traversing the physically contiguous 'segment' instead of each page. - segment handling in block layer can be improved much in future since it should be quite easy to convert multipage bvec into segment easily. For example, we might just store segment in each bvec directly in future. - bio size can be increased and it should improve some high-bandwidth IO case in theory[4]. - there is opportunity in future to improve memory footprint of bvecs. 3) how is multi-page bvec implemented in this patchset? Patch 1 ~ 3 parpares for supporting multi-page bvec. Patches 4 ~ 14 implement multipage bvec in block layer: - put all tricks into bvec/bio/rq iterators, and as far as drivers and fs use these standard iterators, they are happy with multipage bvec - introduce bio_for_each_bvec() to iterate over multipage bvec for splitting bio and mapping sg - keep current bio_for_each_segment*() to itereate over singlepage bvec and make sure current users won't be broken; especailly, convert to this new helper prototype in single patch 21 given it is bascially a mechanism conversion - deal with iomap & xfs's sub-pagesize io vec in patch 13 - enalbe multipage bvec in patch 14 Patch 15 redefines BIO_MAX_PAGES as 256. Patch 16 documents usages of bio iterator helpers. Patch 17~18 kills NO_SG_MERGE. These patches can be found in the following git tree: git: https://github.com/ming1/linux.git v5.0-blk_mp_bvec_v14 Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset, and not see regression. Thanks Christoph for reviewing the early version and providing very good suggestions, such as: introduce bio_init_with_vec_table(), remove another unnecessary helpers for cleanup and so on. Thanks Chritoph and Omar for reviewing V10/V11/V12, and provides lots of helpful comments. V15: - rename bio_for_each_mp_bvec/rq_for_each_mp_bvec as bio_for_each_bvec/rq_for_each_bvec, as suggested by Christoph, so the mp_bvec name is only used by bvec helpers V14: - drop patch(patch 4 in V13) for renaming bvec helpers, as suggested by Jens - use mp_bvec_* as multi-page bvec helper name - fix one build issue, which is caused by missing one converion of bio_for_each_segment_all in fs/gfs2 - fix one 32bit ARCH specific issue caused by segment boundary mask overflow V13: - rebase on v5.0-rc2 - address Omar's comment on patch 1 of V12 by using V11's approach - rename one local vairable in patch 15 as suggested by Christoph V12: - deal with non-cluster by max segment size & segment boundary limit - rename bvec helper's name - revert new change on bvec_iter_advance() in V11 - introduce rq_for_each_bvec() - use simpler check on enalbing multi-page bvec - fix Document change V11: - address most of reviews from Omar and christoph - rename mp_bvec_* as segment_* helpers - remove 'mp' parameter from bvec_iter_advance() and related helpers - cleanup patch on bvec_split_segs() and blk_bio_segment_split(), remove unnecessary checks - simplify bvec_last_segment() - drop bio_pages_all() - introduce dedicated functions/file for handling non-cluser bio for avoiding checking queue cluster before adding page to bio - introduce bio_try_merge_segment() for simplifying iomap/xfs page accounting code - Fix Document change V10: - no any code change, just add more guys and list into patch's CC list, as suggested by Christoph and Dave Chinner V9: - fix regression on iomap's sub-pagesize io vec, covered by patch 13 V8: - remove prepare patches which all are merged to linus tree - rebase on for-4.21/block - address comments on V7 - add patches of killing NO_SG_MERGE V7: - include Christoph and Mike's bio_clone_bioset() patches, which is actually prepare patches for multipage bvec - address Christoph's comments V6: - avoid to introduce lots of renaming, follow Jen's suggestion of using the name of chunk for multipage io vector - include Christoph's three prepare patches - decrease stack usage for using bio_for_each_chunk_segment_all() - address Kent's comment V5: - remove some of prepare patches, which have been merged already - add bio_clone_seg_bioset() to fix DM's bio clone, which is introduced by 18a25da84354c6b (dm: ensure bio submission follows a depth-first tree walk) - rebase on the latest block for-v4.18 V4: - rename bio_for_each_segment*() as bio_for_each_page*(), rename bio_segments() as bio_pages(), rename rq_for_each_segment() as rq_for_each_pages(), because these helpers never return real segment, and they always return single page bvec - introducing segment_for_each_page_all() - introduce new bio_for_each_segment*()/rq_for_each_segment()/bio_segments() for returning real multipage segment - rewrite segment_last_page() - rename bvec iterator helper as suggested by Christoph - replace comment with applying bio helpers as suggested by Christoph - document usage of bio iterator helpers - redefine BIO_MAX_PAGES as 256 to make the biggest bvec table accommodated in 4K page - move bio_alloc_pages() into bcache as suggested by Christoph V3: - rebase on v4.13-rc3 with for-next of block tree - run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear), MD(raid1), and not see regressions triggered - add Reviewed-by on some btrfs patches - remove two MD patches because both are merged to linus tree already V2: - bvec table direct access in raid has been cleaned, so NO_MP flag is dropped - rebase on recent Neil Brown's change on bio and bounce code - reorganize the patchset V1: - against v4.10-rc1 and some cleanup in V0 are in -linus already - handle queue_virt_boundary() in mp bvec change and make NVMe happy - further BTRFS cleanup - remove QUEUE_FLAG_SPLIT_MP - rename for two new helpers of bio_for_each_segment_all() - fix bounce convertion - address comments in V0 [1], http://marc.info/?l=linux-kernel&m=141680246629547&w=2 [2], https://patchwork.kernel.org/patch/9451523/ [3], http://marc.info/?t=147735447100001&r=1&w=2 [4], http://marc.info/?l=linux-mm&m=147745525801433&w=2 [5], http://marc.info/?t=149569484500007&r=1&w=2 [6], http://marc.info/?t=149820215300004&r=1&w=2 Christoph Hellwig (1): btrfs: look at bi_size for repair decisions Ming Lei (17): block: don't use bio->bi_vcnt to figure out segment number block: remove bvec_iter_rewind() block: introduce multi-page bvec helpers block: introduce bio_for_each_bvec() and rq_for_each_bvec() block: use bio_for_each_bvec() to compute multi-page bvec count block: use bio_for_each_bvec() to map sg block: introduce mp_bvec_last_segment() fs/buffer.c: use bvec iterator to truncate the bio btrfs: use mp_bvec_last_segment to get bio's last page block: loop: pass multi-page bvec to iov_iter bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages() block: allow bio_for_each_segment_all() to iterate over multi-page bvec block: enable multipage bvecs block: always define BIO_MAX_PAGES as 256 block: document usage of bio iterator helpers block: kill QUEUE_FLAG_NO_SG_MERGE block: kill BLK_MQ_F_SG_MERGE Documentation/block/biovecs.txt | 25 +++++ block/bio.c | 49 ++++++--- block/blk-merge.c | 210 +++++++++++++++++++++++++------------- block/blk-mq-debugfs.c | 2 - block/blk-mq.c | 3 - block/bounce.c | 6 +- drivers/block/loop.c | 22 ++-- drivers/block/nbd.c | 2 +- drivers/block/rbd.c | 2 +- drivers/block/skd_main.c | 1 - drivers/block/xen-blkfront.c | 2 +- drivers/md/bcache/btree.c | 3 +- drivers/md/bcache/util.c | 6 +- drivers/md/dm-crypt.c | 3 +- drivers/md/dm-rq.c | 2 +- drivers/md/dm-table.c | 13 --- drivers/md/raid1.c | 3 +- drivers/mmc/core/queue.c | 3 +- drivers/scsi/scsi_lib.c | 2 +- drivers/staging/erofs/data.c | 3 +- drivers/staging/erofs/unzip_vle.c | 3 +- fs/block_dev.c | 6 +- fs/btrfs/compression.c | 3 +- fs/btrfs/disk-io.c | 3 +- fs/btrfs/extent_io.c | 16 +-- fs/btrfs/inode.c | 6 +- fs/btrfs/raid56.c | 3 +- fs/buffer.c | 5 +- fs/crypto/bio.c | 3 +- fs/direct-io.c | 4 +- fs/exofs/ore.c | 3 +- fs/exofs/ore_raid.c | 3 +- fs/ext4/page-io.c | 3 +- fs/ext4/readpage.c | 3 +- fs/f2fs/data.c | 9 +- fs/gfs2/lops.c | 9 +- fs/gfs2/meta_io.c | 3 +- fs/iomap.c | 10 +- fs/mpage.c | 3 +- fs/xfs/xfs_aops.c | 9 +- include/linux/bio.h | 37 ++++--- include/linux/blk-mq.h | 1 - include/linux/blkdev.h | 5 +- include/linux/bvec.h | 106 ++++++++++++++----- 44 files changed, 404 insertions(+), 214 deletions(-)