diff mbox series

[v3] dm-crypt: fix deadlock when swapping to encrypted device

Message ID alpine.LRH.2.02.2011301217010.17848@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State New, archived
Headers show
Series [v3] dm-crypt: fix deadlock when swapping to encrypted device | expand

Commit Message

Mikulas Patocka Nov. 30, 2020, 5:30 p.m. UTC
Hi

This is the third version of the "swapping on dm-crypt" patch. The only 
change is that we define MAX_BIOS inversely proportional to page size:

#define MAX_BIOS       (16 * 1048576 / PAGE_SIZE)

I tested it on a machine with 8GiB ram with swapping on disk and ssd - on 
ssd, the system didn't lock-up with increased number of bios, but when the 
limit was removed, it triggered OOM prematurely and killed the process 
that allocated memory.

On rotational disk, I got soft lockup at 32768 bios or more. With smaller 
values, the machine was still somehow responsive during swapping.



From: Mikulas Patocka <mpatocka@redhat.com>

The system would deadlock when swapping to a dm-crypt device. The reason
is that for each incoming write bio, dm-crypt allocates memory that holds
encrypted data. These excessive allocations exhaust all the memory and the
result is either deadlock or OOM trigger.

This patch limits the number of in-flight bios, so that the memory
consumed by dm-crypt is limited. If we are over the limit, we block in the
function crypt_map, so that the caller will not attempt to send more bios.

This is similar to request-based drivers - they will also block when the
number of bios is over the limit.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

John Dorminy Nov. 30, 2020, 10:29 p.m. UTC | #1
I am the first to admit I don't understand all the subtleties of the
block layer, but I'm worried about a deadlock in close analogy to that
described in bio_alloc_bioset.
https://elixir.bootlin.com/linux/latest/source/block/bio.c#L463
Hopefully ya'll can clear up my understanding...

Specifically, if a segment of a target blocks in its map function, I
believe that risks deadlock due to memory congestion. Such a action
blocks the rest of the IO being submitted by the current thread, and
the other IO currently being submitted may potentially be to other
segments which may make progress and free memory.

As a example, suppose something submits a bio to a device; the device
submits three child bios in its map / submit_bio function, some with
their own data allocated, to different devices, which add up to all
the reclaimable memory on the system. The first of these bios gets
blocked by the target (perhaps because it needs memory, or is at
capacity). Because submit_bio_noacct() traverses bio submittals
depth-first, the other two child bios will not be submitted to their,
different, devices until the first of these bios finishes being
submitted; although were those other two bios allowed to make
progress, they might complete and free memory. (Admittedly, they might
need to make further memory allocations to make progress. But in
theory the mempool / bio_set reserve should allow forward progress on
these other bios, I think.).

Even this limit only reduces, not eliminates, the problem. Consider a
machine where fewer than 32768 bios exhausts the available memory;
additional IOs will be accepted, but will be blocked by trying to
allocate memory; that memory may be necessary to service the requests
already in progress, causing a similar starvation of other
memory-requiring work on the machine. It may be less likely on a
well-tuned machine for there to be less memory than needed for
dm-crypt to service 32768 IOs, but it's still possible.

I think it would be safer to stash excess bios on a list for future
work, and, when congestion support exists again (I don't think it
currently does, but I haven't kept good track), perhaps use that
mechanism to signal when the device is at capacity. But I am probably
being too paranoid and missing some subtlety above.

Less major, and hopefully clearer, thoughts:

dm-crypt already has a concept of the max number of pages allocatable
for currently active IOs -- specifically DM_CRYPT_MEMORY_PERCENT per
cpu. If you're trying to scale by amount of memory on the system,
perhaps going off DM_CRYPT_MEMORY_PERCENT is safer. I'm somewhat
baffled why that mechanism isn't enough for the observed problem, tbh.

Perhaps it would be better to add a maximum allocatable objects
mechanism, if it's safe, to mempool, dm, or bioset. If it were in
bio_alloc_bioset, it could take advantage of the same rescuer
workqueue to keep from blocking, potentially.

Your patch format doesn't work when I try to apply them via git apply
patchfile for testing, while most patches do... Not sure if it's you
or me, but it seems most patches from git send-email / git
format-patch have some stats about inserts/deletes after the ---
marker, which marker seems missing from your messages. (Also,
traditionally, I think the changes between patch versions go after the
--- marker, so they don't go in the change description of the commit.)

Thanks!

John Dorminy

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Jan. 8, 2021, 5:07 p.m. UTC | #2
Hi


On Mon, 30 Nov 2020, John Dorminy wrote:

> I am the first to admit I don't understand all the subtleties of the
> block layer, but I'm worried about a deadlock in close analogy to that
> described in bio_alloc_bioset.
> https://elixir.bootlin.com/linux/latest/source/block/bio.c#L463
> Hopefully ya'll can clear up my understanding...
> 
> Specifically, if a segment of a target blocks in its map function, I
> believe that risks deadlock due to memory congestion. Such a action
> blocks the rest of the IO being submitted by the current thread, and
> the other IO currently being submitted may potentially be to other
> segments which may make progress and free memory.

All the request-based drivers block in their map function if there's more 
than 128 requests (it is tunable in /sys/block/*/queue/nr_requests).

So, many of the device mapper targets block as well - for example, the 
linear target just passes a bio down to an underlying device - so it will 
block if the underlying device has more than 128 requests in flight.

If you don't block, kswapd will just send another bio - and that is the 
problem.

> As a example, suppose something submits a bio to a device; the device
> submits three child bios in its map / submit_bio function, some with
> their own data allocated, to different devices, which add up to all
> the reclaimable memory on the system. The first of these bios gets
> blocked by the target (perhaps because it needs memory, or is at
> capacity). Because submit_bio_noacct() traverses bio submittals
> depth-first, the other two child bios will not be submitted to their,
> different, devices until the first of these bios finishes being
> submitted; although were those other two bios allowed to make
> progress, they might complete and free memory. (Admittedly, they might
> need to make further memory allocations to make progress. But in
> theory the mempool / bio_set reserve should allow forward progress on
> these other bios, I think.).
> 
> Even this limit only reduces, not eliminates, the problem. Consider a
> machine where fewer than 32768 bios exhausts the available memory;
> additional IOs will be accepted, but will be blocked by trying to
> allocate memory; that memory may be necessary to service the requests
> already in progress, causing a similar starvation of other
> memory-requiring work on the machine. It may be less likely on a
> well-tuned machine for there to be less memory than needed for
> dm-crypt to service 32768 IOs, but it's still possible.
> 
> I think it would be safer to stash excess bios on a list for future
> work, and, when congestion support exists again (I don't think it

But if you don't block, you'll get more bios - so the list would grow 
beyond limit.

And the excessive amount of bios in flight will reduce system 
responsiveness.

> currently does, but I haven't kept good track), perhaps use that
> mechanism to signal when the device is at capacity. But I am probably
> being too paranoid and missing some subtlety above.
> 
> Less major, and hopefully clearer, thoughts:
> 
> dm-crypt already has a concept of the max number of pages allocatable
> for currently active IOs -- specifically DM_CRYPT_MEMORY_PERCENT per
> cpu. If you're trying to scale by amount of memory on the system,
> perhaps going off DM_CRYPT_MEMORY_PERCENT is safer. I'm somewhat
> baffled why that mechanism isn't enough for the observed problem, tbh.

Perhaps we could lower DM_CRYPT_MEMORY_PERCENT, perhaps we could remove it 
and make a limit using a semaphore.

> Perhaps it would be better to add a maximum allocatable objects
> mechanism, if it's safe, to mempool, dm, or bioset. If it were in
> bio_alloc_bioset, it could take advantage of the same rescuer
> workqueue to keep from blocking, potentially.

dm-crypt is currently doing it - see "ret = mempool_init(&cc->page_pool, 
BIO_MAX_PAGES, crypt_alloc_page, crypt_free_page, cc);"

> Your patch format doesn't work when I try to apply them via git apply
> patchfile for testing, while most patches do... Not sure if it's you
> or me, but it seems most patches from git send-email / git
> format-patch have some stats about inserts/deletes after the ---
> marker, which marker seems missing from your messages. (Also,
> traditionally, I think the changes between patch versions go after the
> --- marker, so they don't go in the change description of the commit.)

I use quilt, not git, to work on my patches.

> Thanks!
> 
> John Dorminy

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c
+++ linux-2.6/drivers/md/dm-crypt.c
@@ -214,16 +214,24 @@  struct crypt_config {
 	mempool_t page_pool;
 
 	struct bio_set bs;
+
+	int bio_limit;
+	struct semaphore bio_limit_semaphore;
+	struct mutex bio_limit_lock;
+
 	struct mutex bio_alloc_lock;
 
 	u8 *authenc_key; /* space for keys in authenc() format (if used) */
 	u8 key[];
 };
 
+#define MAX_BIOS	(16 * 1048576 / PAGE_SIZE)
 #define MIN_IOS		64
 #define MAX_TAG_SIZE	480
 #define POOL_ENTRY_SIZE	512
 
+static int bio_limit = MAX_BIOS;
+
 static DEFINE_SPINLOCK(dm_crypt_clients_lock);
 static unsigned dm_crypt_clients_n = 0;
 static volatile unsigned long dm_crypt_pages_per_client;
@@ -1713,6 +1721,8 @@  static void crypt_dec_pending(struct dm_
 		kfree(io->integrity_metadata);
 
 	base_bio->bi_status = error;
+	if (bio_data_dir(base_bio) == WRITE)
+		up(&cc->bio_limit_semaphore);
 	bio_endio(base_bio);
 }
 
@@ -2567,6 +2577,7 @@  static void crypt_dtr(struct dm_target *
 	kfree_sensitive(cc->cipher_auth);
 	kfree_sensitive(cc->authenc_key);
 
+	mutex_destroy(&cc->bio_limit_lock);
 	mutex_destroy(&cc->bio_alloc_lock);
 
 	/* Must zero key material before freeing */
@@ -3007,6 +3018,7 @@  static int crypt_ctr(struct dm_target *t
 	int key_size;
 	unsigned int align_mask;
 	unsigned long long tmpll;
+	int latch;
 	int ret;
 	size_t iv_size_padding, additional_req_size;
 	char dummy;
@@ -3106,6 +3118,12 @@  static int crypt_ctr(struct dm_target *t
 		goto bad;
 	}
 
+	latch = READ_ONCE(bio_limit);
+	if (unlikely(latch <= 0))
+		latch = MAX_BIOS;
+	cc->bio_limit = latch;
+	sema_init(&cc->bio_limit_semaphore, latch);
+	mutex_init(&cc->bio_limit_lock);
 	mutex_init(&cc->bio_alloc_lock);
 
 	ret = -EINVAL;
@@ -3234,6 +3252,25 @@  static int crypt_map(struct dm_target *t
 	if (unlikely(bio->bi_iter.bi_size & (cc->sector_size - 1)))
 		return DM_MAPIO_KILL;
 
+	if (bio_data_dir(bio) == WRITE) {
+		int latch = READ_ONCE(bio_limit);
+		if (unlikely(latch <= 0))
+			latch = MAX_BIOS;
+		if (unlikely(cc->bio_limit != latch)) {
+			mutex_lock(&cc->bio_limit_lock);
+			while (latch < cc->bio_limit) {
+				down(&cc->bio_limit_semaphore);
+				cc->bio_limit--;
+			}
+			while (latch > cc->bio_limit) {
+				up(&cc->bio_limit_semaphore);
+				cc->bio_limit++;
+			}
+			mutex_unlock(&cc->bio_limit_lock);
+		}
+		down(&cc->bio_limit_semaphore);
+	}
+
 	io = dm_per_bio_data(bio, cc->per_bio_data_size);
 	crypt_io_init(io, cc, bio, dm_target_offset(ti, bio->bi_iter.bi_sector));
 
@@ -3461,6 +3498,9 @@  static void __exit dm_crypt_exit(void)
 module_init(dm_crypt_init);
 module_exit(dm_crypt_exit);
 
+module_param_named(max_bios_in_flight, bio_limit, int, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(max_bios_in_flight, "maximum number of bios in flight");
+
 MODULE_AUTHOR("Jana Saout <jana@saout.de>");
 MODULE_DESCRIPTION(DM_NAME " target for transparent encryption / decryption");
 MODULE_LICENSE("GPL");