Message ID | alpine.LRH.2.02.1708030938110.21391@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Aug 03, 2017 at 10:10:55AM -0400, Mikulas Patocka wrote: > That dm-crypt commit that uses bio integrity payload came 3 months before > 7c20f11680a441df09de7235206f70115fbf6290 and it was already present in > 4.12. And on it's own that isn't an argument if your usage is indeed wrong, and that's why we are having this discussion. > The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that > bio->bi_end_io is not changed and bio_integrity_endio is called directly > from bio_endio if the bio has integrity payload. The unintended > consequence of this change is that when a request with integrity payload > is finished and bio_endio is called for each cloned bio, the verification > routine bio_integrity_verify_fn is called for every level in the stack (it > used to be called only for the lowest level in 4.12). At different levels > in the stack, the bio may have different bi_sector value, so the repeated > verification can't succeed. We can simply add another bio flag to get back to the previous behavior. That being said thing to do in the end is to verify it at the top of the stack, and not the bottom eventuall. I can cook up a patch for that. > I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be > reverted and it should be reworked for the next merge window, so that it > calls bio_integrity_verify_fn only for the lowest level. And with this you will just reintroduce the memory leak for on-stack bios we've fixed. I'll take a look at not calling the verify function for every level, which is wrong, and in the meantime we can discuss the uses and abuses of the bio integrity code by dm in the separate subthread.
On Sat, 5 Aug 2017, Christoph Hellwig wrote: > > On Thu, Aug 03, 2017 at 10:10:55AM -0400, Mikulas Patocka wrote: > > That dm-crypt commit that uses bio integrity payload came 3 months before > > 7c20f11680a441df09de7235206f70115fbf6290 and it was already present in > > 4.12. > > And on it's own that isn't an argument if your usage is indeed wrong, > and that's why we are having this discussion. > > > The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that > > bio->bi_end_io is not changed and bio_integrity_endio is called directly > > from bio_endio if the bio has integrity payload. The unintended > > consequence of this change is that when a request with integrity payload > > is finished and bio_endio is called for each cloned bio, the verification > > routine bio_integrity_verify_fn is called for every level in the stack (it > > used to be called only for the lowest level in 4.12). At different levels > > in the stack, the bio may have different bi_sector value, so the repeated > > verification can't succeed. > > We can simply add another bio flag to get back to the previous > behavior. That being said thing to do in the end is to verify it > at the top of the stack, and not the bottom eventuall. I can cook > up a patch for that. The sector number in the integrity tag must match the physical sector number. So, it must be verified at the bottom. If the sector number in the integrity tag matched the logical sector number in the logical volume, it would create a lot of problems - for example, it would be impossible to move the logical volume and it would be impossible to copy the full physical volume. > > I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be > > reverted and it should be reworked for the next merge window, so that it > > calls bio_integrity_verify_fn only for the lowest level. > > And with this you will just reintroduce the memory leak for > on-stack bios we've fixed. > > I'll take a look at not calling the verify function for every level, > which is wrong, and in the meantime we can discuss the uses and abuses > of the bio integrity code by dm in the separate subthread. It would be usefull if dm-crypt can put the autenticated tag directly into the integrity data. But for that usage, the lowest level physical driver must not generate or verify the integrity data - it must just read them and write them. Mikulas
Christoph, > We can simply add another bio flag to get back to the previous > behavior. That being said thing to do in the end is to verify it > at the top of the stack, and not the bottom eventuall. I can cook > up a patch for that. Yeah, the original code was careful about only adding the verification hook to the top bio. A bio flag is probably the path of least resistance. It already exists, actually: BIP_BLOCK_INTEGRITY. But we'll need to make sure it gets masked out when a bio is cloned. And then we can key off of that and REQ_OP_READ in the endio function. I prefer that approach to reverting Christoph's commit.
Mikulas, > The sector number in the integrity tag must match the physical sector > number. So, it must be verified at the bottom. The ref tag seed matches the submitter block number (typically block layer sector for the top device) and is remapped to and from the LBA by the SCSI disk driver or HBA firmware. So the verification is designed to be done by the top level entity that attaches the protection information to the bio. In this case bio_integrity_prep().
On Sat, 5 Aug 2017, Martin K. Petersen wrote: > Mikulas, > > > The sector number in the integrity tag must match the physical sector > > number. So, it must be verified at the bottom. > > The ref tag seed matches the submitter block number (typically block > layer sector for the top device) and is remapped to and from the LBA by > the SCSI disk driver or HBA firmware. > > So the verification is designed to be done by the top level entity that > attaches the protection information to the bio. In this case > bio_integrity_prep(). bio_integrity_prep is called by blk_queue_bio - that is the lowest level physical disk, not the top level. It is not called by device mapper. Documentation/block/data-integrity.txt says that bio_integrity_prep() can be called by integrity-aware filesystem, but it seems that no such filesystem exists. If you create the integrity tag at or above device mapper level, you will run into problems because the same device can be accessed using device mapper and using physical volume /dev/sd*. If you create integrity tags at device mapper level, they will contain device mapper's logical sector number and the sector number won't match if you access the device directly using /dev/sd*. > -- > Martin K. Petersen Oracle Linux Engineering Mikulas
Mikulas, > If you create the integrity tag at or above device mapper level, you > will run into problems because the same device can be accessed using > device mapper and using physical volume /dev/sd*. If you create > integrity tags at device mapper level, they will contain device > mapper's logical sector number and the sector number won't match if > you access the device directly using /dev/sd*. For writes, the bip seed value is adjusted every time a bio is cloned, sliced and diced as it traverses partitioning/MD/DM. And then at the bottom of the stack, the ref tag values in the protection information buffer are verified against the adjusted seed value and remapped according to the starting logical block number. The reverse is taking place for reads. This is much faster than doing a remapping of the actual protection buffer values every time the I/O transitions from one address space to the other. In addition, some HBA hardware allows us to program the PI engine with the seed value. So the submitter value to LBA conversion can be done on the fly in hardware.
Hi, On 08/07/2017 05:48 PM, Martin K. Petersen wrote: > >> If you create the integrity tag at or above device mapper level, you >> will run into problems because the same device can be accessed using >> device mapper and using physical volume /dev/sd*. If you create >> integrity tags at device mapper level, they will contain device >> mapper's logical sector number and the sector number won't match if >> you access the device directly using /dev/sd*. > > For writes, the bip seed value is adjusted every time a bio is cloned, > sliced and diced as it traverses partitioning/MD/DM. And then at the > bottom of the stack, the ref tag values in the protection information > buffer are verified against the adjusted seed value and remapped > according to the starting logical block number. The reverse is taking > place for reads. > > This is much faster than doing a remapping of the actual protection > buffer values every time the I/O transitions from one address space to > the other. In addition, some HBA hardware allows us to program the PI > engine with the seed value. So the submitter value to LBA conversion can > be done on the fly in hardware. Yes, this is how enterprise storage works and how the PI was designed. For authenticated encryption with additional data (AEAD) we have to authenticate the sector number as well (as part or additional data - authenticated but not encrypted part). Unfortunately the whole design cannot work this way if the AEAD is implemented on some higher layer. (Higher layer has a big advantage that you do not need to trust underlying storage stack because you will detect even intentional tampering with data.) So, only the layer that own the keys (here dmcrypt) can calculate or verify auth. tags (and then decrypt data). Nobody else can "remap" this sector in tag - without the key you cannot recalculate auth. tags. In other words, we just treat the whole additional data (delivered in bio-integrity) as "Application tag" (if using DIF terminology...) Anyway, could we please fix the 4.13 kernel now to not crash with that dm-integrity use? This is urgent for people that already play with dm-integrity from 4.12. If you want a following discussion, I would really welcome to see what exactly is wrong because I think we used the bio-integrity interface through existing API and according to doc. Just we are unfortunately the first user for own profile (different than these defined in T10). And I think it can help in future to people that will try to implement bio-integrity-aware filesystem as well. Thanks, Milan
On Sat, Aug 05, 2017 at 04:19:30PM -0400, Martin K. Petersen wrote: > A bio flag is probably the path of least resistance. It already exists, > actually: BIP_BLOCK_INTEGRITY. But we'll need to make sure it gets > masked out when a bio is cloned. And then we can key off of that and > REQ_OP_READ in the endio function. bio_integrity_clone doesn't even clone it - it allocates a new bip and then only copies over bip_vec, bip_vcnt and bit_iter. I'll preparate a patch for this and will send it out together with the orginal hotfix for the verify_fn, which also needs a trivial line length fix anyway.
diff --git a/block/bio-integrity.c b/block/bio-integrity.c index f733beab6ca2..8df4eb103ba9 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -102,7 +102,7 @@ EXPORT_SYMBOL(bio_integrity_alloc); * Description: Used to free the integrity portion of a bio. Usually * called from bio_free(). */ -static void bio_integrity_free(struct bio *bio) +void bio_integrity_free(struct bio *bio) { struct bio_integrity_payload *bip = bio_integrity(bio); struct bio_set *bs = bio->bi_pool; @@ -120,8 +120,8 @@ static void bio_integrity_free(struct bio *bio) } bio->bi_integrity = NULL; - bio->bi_opf &= ~REQ_INTEGRITY; } +EXPORT_SYMBOL(bio_integrity_free); /** * bio_integrity_add_page - Attach integrity metadata @@ -326,6 +326,12 @@ bool bio_integrity_prep(struct bio *bio) offset = 0; } + /* Install custom I/O completion handler if read verify is enabled */ + if (bio_data_dir(bio) == READ) { + bip->bip_end_io = bio->bi_end_io; + bio->bi_end_io = bio_integrity_endio; + } + /* Auto-generate integrity metadata if this is a write */ if (bio_data_dir(bio) == WRITE) { bio_integrity_process(bio, &bio->bi_iter, @@ -369,12 +375,13 @@ static void bio_integrity_verify_fn(struct work_struct *work) bio->bi_status = BLK_STS_IOERR; } - bio_integrity_free(bio); + /* Restore original bio completion handler */ + bio->bi_end_io = bip->bip_end_io; bio_endio(bio); } /** - * __bio_integrity_endio - Integrity I/O completion function + * bio_integrity_endio - Integrity I/O completion function * @bio: Protected bio * @error: Pointer to errno * @@ -385,19 +392,27 @@ static void bio_integrity_verify_fn(struct work_struct *work) * in process context. This function postpones completion * accordingly. */ -bool __bio_integrity_endio(struct bio *bio) +void bio_integrity_endio(struct bio *bio) { - if (bio_op(bio) == REQ_OP_READ && !bio->bi_status) { - struct bio_integrity_payload *bip = bio_integrity(bio); + struct bio_integrity_payload *bip = bio_integrity(bio); - INIT_WORK(&bip->bip_work, bio_integrity_verify_fn); - queue_work(kintegrityd_wq, &bip->bip_work); - return false; + BUG_ON(bip->bip_bio != bio); + + /* In case of an I/O error there is no point in verifying the + * integrity metadata. Restore original bio end_io handler + * and run it. + */ + if (bio->bi_status) { + bio->bi_end_io = bip->bip_end_io; + bio_endio(bio); + + return; } - bio_integrity_free(bio); - return true; + INIT_WORK(&bip->bip_work, bio_integrity_verify_fn); + queue_work(kintegrityd_wq, &bip->bip_work); } +EXPORT_SYMBOL(bio_integrity_endio); /** * bio_integrity_advance - Advance integrity vector diff --git a/block/bio.c b/block/bio.c index 9cabf5d0be20..a6b225324a61 100644 --- a/block/bio.c +++ b/block/bio.c @@ -243,6 +243,9 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx, void bio_uninit(struct bio *bio) { bio_disassociate_task(bio); + + if (bio_integrity(bio)) + bio_integrity_free(bio); } EXPORT_SYMBOL(bio_uninit); @@ -1810,8 +1813,6 @@ void bio_endio(struct bio *bio) again: if (!bio_remaining_done(bio)) return; - if (!bio_integrity_endio(bio)) - return; /* * Need to have a real endio function for chained bios, otherwise diff --git a/block/blk.h b/block/blk.h index 3a3d715bd725..01ebb8185f6b 100644 --- a/block/blk.h +++ b/block/blk.h @@ -81,21 +81,10 @@ static inline void blk_queue_enter_live(struct request_queue *q) #ifdef CONFIG_BLK_DEV_INTEGRITY void blk_flush_integrity(void); -bool __bio_integrity_endio(struct bio *); -static inline bool bio_integrity_endio(struct bio *bio) -{ - if (bio_integrity(bio)) - return __bio_integrity_endio(bio); - return true; -} #else static inline void blk_flush_integrity(void) { } -static inline bool bio_integrity_endio(struct bio *bio) -{ - return true; -} #endif void blk_timeout_work(struct work_struct *work); diff --git a/include/linux/bio.h b/include/linux/bio.h index 7b1cf4ba0902..1eba19580185 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -320,6 +320,8 @@ struct bio_integrity_payload { struct bvec_iter bip_iter; + bio_end_io_t *bip_end_io; /* saved I/O completion fn */ + unsigned short bip_slab; /* slab the bip came from */ unsigned short bip_vcnt; /* # of integrity bio_vecs */ unsigned short bip_max_vcnt; /* integrity bio_vec slots */ @@ -737,8 +739,10 @@ struct biovec_slab { bip_for_each_vec(_bvl, _bio->bi_integrity, _iter) extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int); +extern void bio_integrity_free(struct bio *); extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int); extern bool bio_integrity_prep(struct bio *); +extern void bio_integrity_endio(struct bio *); extern void bio_integrity_advance(struct bio *, unsigned int); extern void bio_integrity_trim(struct bio *); extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t); @@ -768,6 +772,11 @@ static inline bool bio_integrity_prep(struct bio *bio) return true; } +static inline void bio_integrity_free(struct bio *bio) +{ + return; +} + static inline int bio_integrity_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp_mask) {