Message ID | alpine.LRH.2.02.2011301217010.17848@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] dm-crypt: fix deadlock when swapping to encrypted device | expand |
I am the first to admit I don't understand all the subtleties of the block layer, but I'm worried about a deadlock in close analogy to that described in bio_alloc_bioset. https://elixir.bootlin.com/linux/latest/source/block/bio.c#L463 Hopefully ya'll can clear up my understanding... Specifically, if a segment of a target blocks in its map function, I believe that risks deadlock due to memory congestion. Such a action blocks the rest of the IO being submitted by the current thread, and the other IO currently being submitted may potentially be to other segments which may make progress and free memory. As a example, suppose something submits a bio to a device; the device submits three child bios in its map / submit_bio function, some with their own data allocated, to different devices, which add up to all the reclaimable memory on the system. The first of these bios gets blocked by the target (perhaps because it needs memory, or is at capacity). Because submit_bio_noacct() traverses bio submittals depth-first, the other two child bios will not be submitted to their, different, devices until the first of these bios finishes being submitted; although were those other two bios allowed to make progress, they might complete and free memory. (Admittedly, they might need to make further memory allocations to make progress. But in theory the mempool / bio_set reserve should allow forward progress on these other bios, I think.). Even this limit only reduces, not eliminates, the problem. Consider a machine where fewer than 32768 bios exhausts the available memory; additional IOs will be accepted, but will be blocked by trying to allocate memory; that memory may be necessary to service the requests already in progress, causing a similar starvation of other memory-requiring work on the machine. It may be less likely on a well-tuned machine for there to be less memory than needed for dm-crypt to service 32768 IOs, but it's still possible. I think it would be safer to stash excess bios on a list for future work, and, when congestion support exists again (I don't think it currently does, but I haven't kept good track), perhaps use that mechanism to signal when the device is at capacity. But I am probably being too paranoid and missing some subtlety above. Less major, and hopefully clearer, thoughts: dm-crypt already has a concept of the max number of pages allocatable for currently active IOs -- specifically DM_CRYPT_MEMORY_PERCENT per cpu. If you're trying to scale by amount of memory on the system, perhaps going off DM_CRYPT_MEMORY_PERCENT is safer. I'm somewhat baffled why that mechanism isn't enough for the observed problem, tbh. Perhaps it would be better to add a maximum allocatable objects mechanism, if it's safe, to mempool, dm, or bioset. If it were in bio_alloc_bioset, it could take advantage of the same rescuer workqueue to keep from blocking, potentially. Your patch format doesn't work when I try to apply them via git apply patchfile for testing, while most patches do... Not sure if it's you or me, but it seems most patches from git send-email / git format-patch have some stats about inserts/deletes after the --- marker, which marker seems missing from your messages. (Also, traditionally, I think the changes between patch versions go after the --- marker, so they don't go in the change description of the commit.) Thanks! John Dorminy -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hi On Mon, 30 Nov 2020, John Dorminy wrote: > I am the first to admit I don't understand all the subtleties of the > block layer, but I'm worried about a deadlock in close analogy to that > described in bio_alloc_bioset. > https://elixir.bootlin.com/linux/latest/source/block/bio.c#L463 > Hopefully ya'll can clear up my understanding... > > Specifically, if a segment of a target blocks in its map function, I > believe that risks deadlock due to memory congestion. Such a action > blocks the rest of the IO being submitted by the current thread, and > the other IO currently being submitted may potentially be to other > segments which may make progress and free memory. All the request-based drivers block in their map function if there's more than 128 requests (it is tunable in /sys/block/*/queue/nr_requests). So, many of the device mapper targets block as well - for example, the linear target just passes a bio down to an underlying device - so it will block if the underlying device has more than 128 requests in flight. If you don't block, kswapd will just send another bio - and that is the problem. > As a example, suppose something submits a bio to a device; the device > submits three child bios in its map / submit_bio function, some with > their own data allocated, to different devices, which add up to all > the reclaimable memory on the system. The first of these bios gets > blocked by the target (perhaps because it needs memory, or is at > capacity). Because submit_bio_noacct() traverses bio submittals > depth-first, the other two child bios will not be submitted to their, > different, devices until the first of these bios finishes being > submitted; although were those other two bios allowed to make > progress, they might complete and free memory. (Admittedly, they might > need to make further memory allocations to make progress. But in > theory the mempool / bio_set reserve should allow forward progress on > these other bios, I think.). > > Even this limit only reduces, not eliminates, the problem. Consider a > machine where fewer than 32768 bios exhausts the available memory; > additional IOs will be accepted, but will be blocked by trying to > allocate memory; that memory may be necessary to service the requests > already in progress, causing a similar starvation of other > memory-requiring work on the machine. It may be less likely on a > well-tuned machine for there to be less memory than needed for > dm-crypt to service 32768 IOs, but it's still possible. > > I think it would be safer to stash excess bios on a list for future > work, and, when congestion support exists again (I don't think it But if you don't block, you'll get more bios - so the list would grow beyond limit. And the excessive amount of bios in flight will reduce system responsiveness. > currently does, but I haven't kept good track), perhaps use that > mechanism to signal when the device is at capacity. But I am probably > being too paranoid and missing some subtlety above. > > Less major, and hopefully clearer, thoughts: > > dm-crypt already has a concept of the max number of pages allocatable > for currently active IOs -- specifically DM_CRYPT_MEMORY_PERCENT per > cpu. If you're trying to scale by amount of memory on the system, > perhaps going off DM_CRYPT_MEMORY_PERCENT is safer. I'm somewhat > baffled why that mechanism isn't enough for the observed problem, tbh. Perhaps we could lower DM_CRYPT_MEMORY_PERCENT, perhaps we could remove it and make a limit using a semaphore. > Perhaps it would be better to add a maximum allocatable objects > mechanism, if it's safe, to mempool, dm, or bioset. If it were in > bio_alloc_bioset, it could take advantage of the same rescuer > workqueue to keep from blocking, potentially. dm-crypt is currently doing it - see "ret = mempool_init(&cc->page_pool, BIO_MAX_PAGES, crypt_alloc_page, crypt_free_page, cc);" > Your patch format doesn't work when I try to apply them via git apply > patchfile for testing, while most patches do... Not sure if it's you > or me, but it seems most patches from git send-email / git > format-patch have some stats about inserts/deletes after the --- > marker, which marker seems missing from your messages. (Also, > traditionally, I think the changes between patch versions go after the > --- marker, so they don't go in the change description of the commit.) I use quilt, not git, to work on my patches. > Thanks! > > John Dorminy Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Index: linux-2.6/drivers/md/dm-crypt.c =================================================================== --- linux-2.6.orig/drivers/md/dm-crypt.c +++ linux-2.6/drivers/md/dm-crypt.c @@ -214,16 +214,24 @@ struct crypt_config { mempool_t page_pool; struct bio_set bs; + + int bio_limit; + struct semaphore bio_limit_semaphore; + struct mutex bio_limit_lock; + struct mutex bio_alloc_lock; u8 *authenc_key; /* space for keys in authenc() format (if used) */ u8 key[]; }; +#define MAX_BIOS (16 * 1048576 / PAGE_SIZE) #define MIN_IOS 64 #define MAX_TAG_SIZE 480 #define POOL_ENTRY_SIZE 512 +static int bio_limit = MAX_BIOS; + static DEFINE_SPINLOCK(dm_crypt_clients_lock); static unsigned dm_crypt_clients_n = 0; static volatile unsigned long dm_crypt_pages_per_client; @@ -1713,6 +1721,8 @@ static void crypt_dec_pending(struct dm_ kfree(io->integrity_metadata); base_bio->bi_status = error; + if (bio_data_dir(base_bio) == WRITE) + up(&cc->bio_limit_semaphore); bio_endio(base_bio); } @@ -2567,6 +2577,7 @@ static void crypt_dtr(struct dm_target * kfree_sensitive(cc->cipher_auth); kfree_sensitive(cc->authenc_key); + mutex_destroy(&cc->bio_limit_lock); mutex_destroy(&cc->bio_alloc_lock); /* Must zero key material before freeing */ @@ -3007,6 +3018,7 @@ static int crypt_ctr(struct dm_target *t int key_size; unsigned int align_mask; unsigned long long tmpll; + int latch; int ret; size_t iv_size_padding, additional_req_size; char dummy; @@ -3106,6 +3118,12 @@ static int crypt_ctr(struct dm_target *t goto bad; } + latch = READ_ONCE(bio_limit); + if (unlikely(latch <= 0)) + latch = MAX_BIOS; + cc->bio_limit = latch; + sema_init(&cc->bio_limit_semaphore, latch); + mutex_init(&cc->bio_limit_lock); mutex_init(&cc->bio_alloc_lock); ret = -EINVAL; @@ -3234,6 +3252,25 @@ static int crypt_map(struct dm_target *t if (unlikely(bio->bi_iter.bi_size & (cc->sector_size - 1))) return DM_MAPIO_KILL; + if (bio_data_dir(bio) == WRITE) { + int latch = READ_ONCE(bio_limit); + if (unlikely(latch <= 0)) + latch = MAX_BIOS; + if (unlikely(cc->bio_limit != latch)) { + mutex_lock(&cc->bio_limit_lock); + while (latch < cc->bio_limit) { + down(&cc->bio_limit_semaphore); + cc->bio_limit--; + } + while (latch > cc->bio_limit) { + up(&cc->bio_limit_semaphore); + cc->bio_limit++; + } + mutex_unlock(&cc->bio_limit_lock); + } + down(&cc->bio_limit_semaphore); + } + io = dm_per_bio_data(bio, cc->per_bio_data_size); crypt_io_init(io, cc, bio, dm_target_offset(ti, bio->bi_iter.bi_sector)); @@ -3461,6 +3498,9 @@ static void __exit dm_crypt_exit(void) module_init(dm_crypt_init); module_exit(dm_crypt_exit); +module_param_named(max_bios_in_flight, bio_limit, int, S_IRUGO | S_IWUSR); +MODULE_PARM_DESC(max_bios_in_flight, "maximum number of bios in flight"); + MODULE_AUTHOR("Jana Saout <jana@saout.de>"); MODULE_DESCRIPTION(DM_NAME " target for transparent encryption / decryption"); MODULE_LICENSE("GPL");