From patchwork Wed Jul 6 17:44:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Snitzer X-Patchwork-Id: 12908440 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D389C433EF for ; Wed, 6 Jul 2022 17:44:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234089AbiGFRoy (ORCPT ); Wed, 6 Jul 2022 13:44:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60466 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234078AbiGFRoh (ORCPT ); Wed, 6 Jul 2022 13:44:37 -0400 Received: from mail-qk1-f173.google.com (mail-qk1-f173.google.com [209.85.222.173]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 583622C11F for ; Wed, 6 Jul 2022 10:44:07 -0700 (PDT) Received: by mail-qk1-f173.google.com with SMTP id z16so11589587qkj.7 for ; Wed, 06 Jul 2022 10:44:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=4aJPJnUyGDLhuRd915bE7Ld6etx47YZKGWzxEZhIh5M=; b=GfIrjTx+IYe9C46HizD8b+D8v4+ESAnzoBO47wmc7LxFvZzcOphoG62/huzIhkk2kb +CB2xBIiXUK8F8WyJwoD76gATjionVd3jBH8y4fUEBw89dQzBf8oqqxa9WG8H/Txz8Z6 0lKLCXsE6OjhIKCp1h9WLePvWyWrHG99rRekMey6M0wjQsI6g3Cboy/gDJvnDOsqDDaL cuwyrof03QkJrjKll68QFeshO8C8ugyRWGxO9d3xvld4i1bFV91fnZ0NozxQKmV0/UYT BmOrJfHJn2VXvWqM2Kxw9wV8vQUVf7cDQsDDyxsGKyfqguzyCm3WwGyTgRXDPWoRZ9be Nekg== X-Gm-Message-State: AJIora8ef1nte4G3pagrxGmwN1sUcjYn0j0F8EEx/FobOZktza5hwrkl mzIQiJpuiDP+wQ7zue9ak3ih X-Google-Smtp-Source: AGRyM1uNrZrO9fQAI5ero6hmTyHbveGcNBnBKZdnYsoKgRJr7AFVvhpM5mCo6xPw1XfK9W6fjYP5uQ== X-Received: by 2002:a05:620a:4548:b0:6af:5a0:fa34 with SMTP id u8-20020a05620a454800b006af05a0fa34mr28424240qkp.29.1657129445996; Wed, 06 Jul 2022 10:44:05 -0700 (PDT) Received: from localhost (pool-68-160-176-52.bstnma.fios.verizon.net. [68.160.176.52]) by smtp.gmail.com with ESMTPSA id c190-20020ae9edc7000000b006a743b360bcsm28929053qkg.136.2022.07.06.10.44.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Jul 2022 10:44:05 -0700 (PDT) From: Mike Snitzer To: dm-devel@redhat.com Cc: linux-block@vger.kernel.org Subject: [5.20 PATCH v3 1/2] dm: add bio_rewind() API to DM core Date: Wed, 6 Jul 2022 13:44:02 -0400 Message-Id: <20220706174403.79317-2-snitzer@kernel.org> X-Mailer: git-send-email 2.15.0 In-Reply-To: <20220706174403.79317-1-snitzer@kernel.org> References: <20220706174403.79317-1-snitzer@kernel.org> Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Ming Lei Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removed a similar API for the following reasons: ``` It is pointed that bio_rewind_iter() is one very bad API[1]: 1) bio size may not be restored after rewinding 2) it causes some bogus change, such as 5151842b9d8732 (block: reset bi_iter.bi_done after splitting bio) 3) rewinding really makes things complicated wrt. bio splitting 4) unnecessary updating of .bi_done in fast path [1] https://marc.info/?t=153549924200005&r=1&w=2 So this patch takes Kent's suggestion to restore one bio into its original state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(), given now bio_rewind_iter() is only used by bio integrity code. ``` However, saving off a copy of the 32 bytes bio->bi_iter in case rewind needed isn't efficient because it bloats per-bio-data for what is an unlikely case. That suggestion also ignores the need to restore crypto and integrity info. Add bio_rewind() API for a specific use-case that is much more narrow than the previous more generic rewind code that was reverted: 1) most bios have a fixed end sector since bio split is done from front of the bio, if driver just records how many sectors between current bio's start sector and the original bio's end sector, the original position can be restored. Keeping the original bio's end sector fixed is a _hard_ requirement for this bio_rewind() interface! 2) if a bio's end sector won't change (usually bio_trim() isn't called, or in the case of DM it preserves original bio), user can restore the original position by storing sector offset from the current ->bi_iter.bi_sector to bio's end sector; together with saving bio size, only 8 bytes is needed to restore to original bio. 3) DM's requeue use case: when BLK_STS_DM_REQUEUE happens, DM core needs to restore to an "original bio" which represents the current dm_io to be requeued (which may be a subset of the original bio). By storing the sector offset from the original bio's end sector and dm_io's size, bio_rewind() can restore such original bio. See commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting") for more details on how DM does this. Leveraging this, allows DM core to shift the need for bio cloning from bio-split time (during IO submission) to the less likely BLK_STS_DM_REQUEUE handling (after IO completes with that error). 4) Unlike the original rewind API, bio_rewind() doesn't add .bi_done to bvec_iter and there is no effect on the fast path. Implement bio_rewind() by factoring out clear helpers that it calls: bio_integrity_rewind, bio_crypt_rewind and bio_rewind_iter. DM is able to ensure that bio_rewind() is used safely but, given the constraint that the bio's end must never change, other hypothetical future callers may not take the same care. So make bio_rewind() and all supporting code local to DM to avoid risk of hypothetical abuse. Suggested-by: Jens Axboe Signed-off-by: Ming Lei Signed-off-by: Mike Snitzer --- drivers/md/Makefile | 2 +- drivers/md/dm-core.h | 2 + drivers/md/dm-io-rewind.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 146 insertions(+), 1 deletion(-) create mode 100644 drivers/md/dm-io-rewind.c diff --git a/drivers/md/Makefile b/drivers/md/Makefile index 0454b0885b01..270f694850ec 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -5,7 +5,7 @@ dm-mod-y += dm.o dm-table.o dm-target.o dm-linear.o dm-stripe.o \ dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o dm-stats.o \ - dm-rq.o + dm-rq.o dm-io-rewind.o dm-multipath-y += dm-path-selector.o dm-mpath.o dm-historical-service-time-y += dm-ps-historical-service-time.o dm-io-affinity-y += dm-ps-io-affinity.o diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 5d9afca0d105..5793a27b2118 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -319,4 +319,6 @@ extern atomic_t dm_global_event_nr; extern wait_queue_head_t dm_global_eventq; void dm_issue_global_event(void); +void bio_rewind(struct bio *bio, unsigned bytes); + #endif diff --git a/drivers/md/dm-io-rewind.c b/drivers/md/dm-io-rewind.c new file mode 100644 index 000000000000..fbeaa8a342ed --- /dev/null +++ b/drivers/md/dm-io-rewind.c @@ -0,0 +1,143 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright 2022 Red Hat, Inc. + */ + +#include +#include +#include + +#include "dm-core.h" + +static inline bool bvec_iter_rewind(const struct bio_vec *bv, + struct bvec_iter *iter, + unsigned int bytes) +{ + int idx; + + iter->bi_size += bytes; + if (bytes <= iter->bi_bvec_done) { + iter->bi_bvec_done -= bytes; + return true; + } + + bytes -= iter->bi_bvec_done; + idx = iter->bi_idx - 1; + + while (idx >= 0 && bytes && bytes > bv[idx].bv_len) { + bytes -= bv[idx].bv_len; + idx--; + } + + if (WARN_ONCE(idx < 0 && bytes, + "Attempted to rewind iter beyond bvec's boundaries\n")) { + iter->bi_size -= bytes; + iter->bi_bvec_done = 0; + iter->bi_idx = 0; + return false; + } + + iter->bi_idx = idx; + iter->bi_bvec_done = bv[idx].bv_len - bytes; + return true; +} + +#if defined(CONFIG_BLK_DEV_INTEGRITY) + +/** + * bio_integrity_rewind - Rewind integrity vector + * @bio: bio whose integrity vector to update + * @bytes_done: number of data bytes to rewind + * + * Description: This function calculates how many integrity bytes the + * number of completed data bytes correspond to and rewind the + * integrity vector accordingly. + */ +static void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done) +{ + struct bio_integrity_payload *bip = bio_integrity(bio); + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); + unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9); + + bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9); + bvec_iter_rewind(bip->bip_vec, &bip->bip_iter, bytes); +} + +#else /* CONFIG_BLK_DEV_INTEGRITY */ + +static inline void bio_integrity_rewind(struct bio *bio, + unsigned int bytes_done) +{ + return; +} + +#endif + +#if defined(CONFIG_BLK_INLINE_ENCRYPTION) + +/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */ +static void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], + unsigned int dec) +{ + int i; + + for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) { + u64 prev = dun[i]; + + dun[i] -= dec; + if (dun[i] > prev) + dec = 1; + else + dec = 0; + } +} + +static void bio_crypt_rewind(struct bio *bio, unsigned int bytes) +{ + struct bio_crypt_ctx *bc = bio->bi_crypt_context; + + bio_crypt_dun_decrement(bc->bc_dun, + bytes >> bc->bc_key->data_unit_size_bits); +} + +#else /* CONFIG_BLK_INLINE_ENCRYPTION */ + +static inline void bio_crypt_rewind(struct bio *bio, unsigned int bytes) +{ + return; +} + +#endif + +static inline void bio_rewind_iter(const struct bio *bio, + struct bvec_iter *iter, unsigned int bytes) +{ + iter->bi_sector -= bytes >> 9; + + /* No advance means no rewind */ + if (bio_no_advance_iter(bio)) + iter->bi_size += bytes; + else + bvec_iter_rewind(bio->bi_io_vec, iter, bytes); +} + +/** + * bio_rewind - update ->bi_iter of @bio by rewinding @bytes. + * @bio: bio to rewind + * @bytes: how many bytes to rewind + * + * WARNING: + * Caller must ensure that @bio has a fixed end sector, to allow + * rewinding from end of bio and restoring its original position. + * Caller is also responsibile for restoring bio's size. + */ +void bio_rewind(struct bio *bio, unsigned bytes) +{ + if (bio_integrity(bio)) + bio_integrity_rewind(bio, bytes); + + if (bio_has_crypt_ctx(bio)) + bio_crypt_rewind(bio, bytes); + + bio_rewind_iter(bio, &bio->bi_iter, bytes); +} From patchwork Wed Jul 6 17:44:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Snitzer X-Patchwork-Id: 12908441 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 972BCC43334 for ; Wed, 6 Jul 2022 17:44:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234107AbiGFRo4 (ORCPT ); Wed, 6 Jul 2022 13:44:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33190 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233945AbiGFRoi (ORCPT ); Wed, 6 Jul 2022 13:44:38 -0400 Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB8002BB0C for ; Wed, 6 Jul 2022 10:44:08 -0700 (PDT) Received: by mail-qt1-f182.google.com with SMTP id he28so19267715qtb.13 for ; Wed, 06 Jul 2022 10:44:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=SQIHCrXZxRbxb4t/6IOmpTMMhkU2VlZ9Pj8HK758FhE=; b=60YqH/y42tyFrvByoL124rsTBTMO1ZhrBCwBsYVbI7ff59R3memc+WCPcPsxfkJ+Sc 8/KntQIeVt15Cpw+zUbPeEiMC5licbJZxKMYoufq+OVPTMclxBmMUPOW+N/xXZrfesjl 7aLWZesvkmljDZeMUfjMqkeikZe/Vd/mxr4brHkc9eKoHidq3wPt2qRiZ/yz2htzUNqY kTHtegnUVcd6VgyQ88lwTtwtrHmPeoO48hrhECJnZy5ir4PlJop7X+8sDEEBinMYmDPC OdEoUNEVh+iFUFn9pvpjDbMahMeWIbYLlNejkTqqbOT19+bx982nRCVzC2Y/wgkObXO6 mFMw== X-Gm-Message-State: AJIora/UE+eM9i0RyZtNnzZk9uLNsjqhpOsjij//bJF4m/2BhlkSubV8 FZanrvMxJXkFDL8uQGsWMytuZ2TqqzXm X-Google-Smtp-Source: AGRyM1um6OSyP/0PZszNyQooJt67NokL9RyZ+4I39ddectX6Ebb4MCnZGfrqGqlr+PRW9cvJVnJg/Q== X-Received: by 2002:a05:6214:1c87:b0:46b:c547:543d with SMTP id ib7-20020a0562141c8700b0046bc547543dmr37080842qvb.52.1657129447341; Wed, 06 Jul 2022 10:44:07 -0700 (PDT) Received: from localhost (pool-68-160-176-52.bstnma.fios.verizon.net. [68.160.176.52]) by smtp.gmail.com with ESMTPSA id j17-20020a05620a289100b006a793bde241sm33577598qkp.63.2022.07.06.10.44.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Jul 2022 10:44:06 -0700 (PDT) From: Mike Snitzer To: dm-devel@redhat.com Cc: linux-block@vger.kernel.org Subject: [5.20 PATCH v3 2/2] dm: add two stage requeue mechanism Date: Wed, 6 Jul 2022 13:44:03 -0400 Message-Id: <20220706174403.79317-3-snitzer@kernel.org> X-Mailer: git-send-email 2.15.0 In-Reply-To: <20220706174403.79317-1-snitzer@kernel.org> References: <20220706174403.79317-1-snitzer@kernel.org> Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Ming Lei Commit 61b6e2e5321d ("dm: fix BLK_STS_DM_REQUEUE handling when dm_io represents split bio") reverted DM core's bio splitting back to using bio_split()+bio_chain() because it was found that otherwise DM's BLK_STS_DM_REQUEUE would trigger a live-lock waiting for bio completion that would never occur. Restore using bio_trim()+bio_inc_remaining(), like was done in commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting"), but this time with proper handling for the above scenario that is covered in more detail in the commit header for 61b6e2e5321d. Solve this issue by adding a two staged dm_io requeue mechanism that uses the new bio_rewind() via dm_io_rewind(): 1) requeue the dm_io into the requeue_list added to struct mapped_device, and schedule it via new added requeue work. This workqueue just clones the dm_io->orig_bio (which DM saves and ensures its end sector isn't modified). dm_io_rewind() uses the sectors and sectors_offset members of the dm_io that are recorded relative to the end of orig_bio: bio_rewind()+bio_trim() are then used to make that cloned bio reflect the subset of the original bio that is represented by the dm_io that is being requeued. 2) the 2nd stage requeue is same with original requeue, but io->orig_bio points to new cloned bio (which matches the requeued dm_io as described above). This allows DM core to shift the need for bio cloning from bio-split time (during IO submission) to the less likely BLK_STS_DM_REQUEUE handling (after IO completes with that error). Signed-off-by: Ming Lei Signed-off-by: Mike Snitzer --- drivers/md/dm-core.h | 13 ++++- drivers/md/dm-io-rewind.c | 25 +++++++++- drivers/md/dm.c | 121 +++++++++++++++++++++++++++++++++++----------- 3 files changed, 129 insertions(+), 30 deletions(-) diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 5793a27b2118..5a3fe5897b1e 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -22,6 +22,8 @@ #define DM_RESERVED_MAX_IOS 1024 +struct dm_io; + struct dm_kobject_holder { struct kobject kobj; struct completion completion; @@ -91,6 +93,14 @@ struct mapped_device { spinlock_t deferred_lock; struct bio_list deferred; + /* + * requeue work context is needed for cloning one new bio + * to represent the dm_io to be requeued, since each + * dm_io may point to the original bio from FS. + */ + struct work_struct requeue_work; + struct dm_io *requeue_list; + void *interface_ptr; /* @@ -275,7 +285,6 @@ struct dm_io { atomic_t io_count; struct mapped_device *md; - struct bio *split_bio; /* The three fields represent mapped part of original bio */ struct bio *orig_bio; unsigned int sector_offset; /* offset to end of orig_bio */ @@ -319,6 +328,6 @@ extern atomic_t dm_global_event_nr; extern wait_queue_head_t dm_global_eventq; void dm_issue_global_event(void); -void bio_rewind(struct bio *bio, unsigned bytes); +void dm_io_rewind(struct dm_io *io, struct bio_set *bs); #endif diff --git a/drivers/md/dm-io-rewind.c b/drivers/md/dm-io-rewind.c index fbeaa8a342ed..3ba7162f85fa 100644 --- a/drivers/md/dm-io-rewind.c +++ b/drivers/md/dm-io-rewind.c @@ -131,7 +131,7 @@ static inline void bio_rewind_iter(const struct bio *bio, * rewinding from end of bio and restoring its original position. * Caller is also responsibile for restoring bio's size. */ -void bio_rewind(struct bio *bio, unsigned bytes) +static void bio_rewind(struct bio *bio, unsigned bytes) { if (bio_integrity(bio)) bio_integrity_rewind(bio, bytes); @@ -141,3 +141,26 @@ void bio_rewind(struct bio *bio, unsigned bytes) bio_rewind_iter(bio, &bio->bi_iter, bytes); } + +void dm_io_rewind(struct dm_io *io, struct bio_set *bs) +{ + struct bio *orig = io->orig_bio; + struct bio *new_orig = bio_alloc_clone(orig->bi_bdev, orig, + GFP_NOIO, bs); + /* + * bio_rewind can restore to previous position since the end + * sector is fixed for original bio, but we still need to + * restore bio's size manually (using io->sectors). + */ + bio_rewind(new_orig, ((io->sector_offset << 9) - + orig->bi_iter.bi_size)); + bio_trim(new_orig, 0, io->sectors); + + bio_chain(new_orig, orig); + /* + * __bi_remaining was increased (by dm_split_and_process_bio), + * so must drop the one added in bio_chain. + */ + atomic_dec(&orig->__bi_remaining); + io->orig_bio = new_orig; +} diff --git a/drivers/md/dm.c b/drivers/md/dm.c index c987f9ad24a4..fa6839141118 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -590,7 +590,6 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) atomic_set(&io->io_count, 2); this_cpu_inc(*md->pending_io); io->orig_bio = bio; - io->split_bio = NULL; io->md = md; spin_lock_init(&io->lock); io->start_time = jiffies; @@ -880,13 +879,35 @@ static int __noflush_suspending(struct mapped_device *md) return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); } +static void dm_requeue_add_io(struct dm_io *io, bool first_stage) +{ + struct mapped_device *md = io->md; + + if (first_stage) { + struct dm_io *next = md->requeue_list; + + md->requeue_list = io; + io->next = next; + } else { + bio_list_add_head(&md->deferred, io->orig_bio); + } +} + +static void dm_kick_requeue(struct mapped_device *md, bool first_stage) +{ + if (first_stage) + queue_work(md->wq, &md->requeue_work); + else + queue_work(md->wq, &md->work); +} + /* * Return true if the dm_io's original bio is requeued. * io->status is updated with error if requeue disallowed. */ -static bool dm_handle_requeue(struct dm_io *io) +static bool dm_handle_requeue(struct dm_io *io, bool first_stage) { - struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio; + struct bio *bio = io->orig_bio; bool handle_requeue = (io->status == BLK_STS_DM_REQUEUE); bool handle_polled_eagain = ((io->status == BLK_STS_AGAIN) && (bio->bi_opf & REQ_POLLED)); @@ -912,8 +933,8 @@ static bool dm_handle_requeue(struct dm_io *io) spin_lock_irqsave(&md->deferred_lock, flags); if ((__noflush_suspending(md) && !WARN_ON_ONCE(dm_is_zone_write(md, bio))) || - handle_polled_eagain) { - bio_list_add_head(&md->deferred, bio); + handle_polled_eagain || first_stage) { + dm_requeue_add_io(io, first_stage); requeued = true; } else { /* @@ -926,19 +947,21 @@ static bool dm_handle_requeue(struct dm_io *io) } if (requeued) - queue_work(md->wq, &md->work); + dm_kick_requeue(md, first_stage); return requeued; } -static void dm_io_complete(struct dm_io *io) +static void __dm_io_complete(struct dm_io *io, bool first_stage) { - struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio; + struct bio *bio = io->orig_bio; struct mapped_device *md = io->md; blk_status_t io_error; bool requeued; - requeued = dm_handle_requeue(io); + requeued = dm_handle_requeue(io, first_stage); + if (requeued && first_stage) + return; io_error = io->status; if (dm_io_flagged(io, DM_IO_ACCOUNTED)) @@ -978,6 +1001,58 @@ static void dm_io_complete(struct dm_io *io) } } +static void dm_wq_requeue_work(struct work_struct *work) +{ + struct mapped_device *md = container_of(work, struct mapped_device, + requeue_work); + unsigned long flags; + struct dm_io *io; + + /* reuse deferred lock to simplify dm_handle_requeue */ + spin_lock_irqsave(&md->deferred_lock, flags); + io = md->requeue_list; + md->requeue_list = NULL; + spin_unlock_irqrestore(&md->deferred_lock, flags); + + while (io) { + struct dm_io *next = io->next; + + dm_io_rewind(io, &md->queue->bio_split); + + io->next = NULL; + __dm_io_complete(io, false); + io = next; + } +} + +/* + * Two staged requeue: + * + * 1) io->orig_bio points to the real original bio, and the part mapped to + * this io must be requeued, instead of other parts of the original bio. + * + * 2) io->orig_bio points to new cloned bio which matches the requeued dm_io. + */ +static void dm_io_complete(struct dm_io *io) +{ + bool first_requeue; + + /* + * Only dm_io that has been split needs two stage requeue, otherwise + * we may run into long bio clone chain during suspend and OOM could + * be triggered. + * + * Also flush data dm_io won't be marked as DM_IO_WAS_SPLIT, so they + * also aren't handled via the first stage requeue. + */ + if (dm_io_flagged(io, DM_IO_WAS_SPLIT)) + first_requeue = true; + else + first_requeue = false; + + __dm_io_complete(io, first_requeue); +} + /* * Decrements the number of outstanding ios that a bio has been * cloned into, completing the original io if necc. @@ -1256,6 +1331,7 @@ static size_t dm_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors) { struct dm_target_io *tio = clone_to_tio(bio); + struct dm_io *io = tio->io; unsigned bio_sectors = bio_sectors(bio); BUG_ON(dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO)); @@ -1271,8 +1347,9 @@ void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors) * __split_and_process_bio() may have already saved mapped part * for accounting but it is being reduced so update accordingly. */ - dm_io_set_flag(tio->io, DM_IO_WAS_SPLIT); - tio->io->sectors = n_sectors; + dm_io_set_flag(io, DM_IO_WAS_SPLIT); + io->sectors = n_sectors; + io->sector_offset = bio_sectors(io->orig_bio); } EXPORT_SYMBOL_GPL(dm_accept_partial_bio); @@ -1395,17 +1472,7 @@ static void setup_split_accounting(struct clone_info *ci, unsigned len) */ dm_io_set_flag(io, DM_IO_WAS_SPLIT); io->sectors = len; - } - - if (static_branch_unlikely(&stats_enabled) && - unlikely(dm_stats_used(&io->md->stats))) { - /* - * Save bi_sector in terms of its offset from end of - * original bio, only needed for DM-stats' benefit. - * - saved regardless of whether split needed so that - * dm_accept_partial_bio() doesn't need to. - */ - io->sector_offset = bio_end_sector(ci->bio) - ci->sector; + io->sector_offset = bio_sectors(ci->bio); } } @@ -1705,11 +1772,9 @@ static void dm_split_and_process_bio(struct mapped_device *md, * Remainder must be passed to submit_bio_noacct() so it gets handled * *after* bios already submitted have been completely processed. */ - WARN_ON_ONCE(!dm_io_flagged(io, DM_IO_WAS_SPLIT)); - io->split_bio = bio_split(bio, io->sectors, GFP_NOIO, - &md->queue->bio_split); - bio_chain(io->split_bio, bio); - trace_block_split(io->split_bio, bio->bi_iter.bi_sector); + bio_trim(bio, io->sectors, ci.sector_count); + trace_block_split(bio, bio->bi_iter.bi_sector); + bio_inc_remaining(bio); submit_bio_noacct(bio); out: /* @@ -1985,9 +2050,11 @@ static struct mapped_device *alloc_dev(int minor) init_waitqueue_head(&md->wait); INIT_WORK(&md->work, dm_wq_work); + INIT_WORK(&md->requeue_work, dm_wq_requeue_work); init_waitqueue_head(&md->eventq); init_completion(&md->kobj_holder.completion); + md->requeue_list = NULL; md->swap_bios = get_swap_bios(); sema_init(&md->swap_bios_semaphore, md->swap_bios); mutex_init(&md->swap_bios_lock);