diff mbox series

[v3,3/3] dmapool: Use xarray for vaddr-to-block lookup

Message ID 20241122211144.4186080-1-bjohannesmeyer@gmail.com (mailing list archive)
State New
Headers show
Series [v3,1/3] dmapool: Move pool metadata into non-DMA memory | expand

Commit Message

Brian Johannesmeyer Nov. 22, 2024, 9:11 p.m. UTC
Optimize the performance of `dma_pool_free()` by implementing an xarray to
map a `vaddr` to its corresponding `block`. This eliminates the need to
iterate through the entire `page_list` for vaddr-to-block translation,
thereby improving performance.

Performance results from the `DMAPOOL_TEST` test show the improvement.
Before the patch:
```
dmapool test: size:16   align:16   blocks:8192 time:34432
dmapool test: size:64   align:64   blocks:8192 time:62262
dmapool test: size:256  align:256  blocks:8192 time:238137
dmapool test: size:1024 align:1024 blocks:2048 time:61386
dmapool test: size:4096 align:4096 blocks:1024 time:75342
dmapool test: size:68   align:32   blocks:8192 time:88243
```

After the patch:
```
dmapool test: size:16   align:16   blocks:8192 time:37954
dmapool test: size:64   align:64   blocks:8192 time:40036
dmapool test: size:256  align:256  blocks:8192 time:41942
dmapool test: size:1024 align:1024 blocks:2048 time:10964
dmapool test: size:4096 align:4096 blocks:1024 time:6101
dmapool test: size:68   align:32   blocks:8192 time:41307
```

This change reduces the runtime overhead, particularly for larger block
sizes.

Co-developed-by: Raphael Isemann <teemperor@gmail.com>
Signed-off-by: Raphael Isemann <teemperor@gmail.com>
Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com>
---
 mm/dmapool.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

Comments

Brian Johannesmeyer Nov. 22, 2024, 9:11 p.m. UTC | #1
We discovered a security-related issue in the DMA pool allocator.

V1 of our RFC was submitted to the Linux kernel security team. They
recommended submitting it to the relevant subsystem maintainers and the
hardening mailing list instead, as they did not consider this an explicit
security issue. Their rationale was that Linux implicitly assumes hardware
can be trusted.

**Threat Model**: While Linux drivers typically trust their hardware, there
may be specific drivers that do not operate under this assumption. Hence,
this threat model assumes a malicious peripheral device capable of
corrupting DMA data to exploit the kernel. In this scenario, the device
manipulates kernel-initialized data (similar to the attack described in the
Thunderclap paper [0]) to achieve arbitrary kernel memory corruption. 

**DMA pool background**. A DMA pool aims to reduce the overhead of DMA
allocations by creating a large DMA buffer --- the "pool" --- from which
smaller buffers are allocated as needed. Fundamentally, a DMA pool
functions like a heap: it is a structure composed of linked memory
"blocks", which, in this context, are DMA buffers. When a driver employs a
DMA pool, it grants the device access not only to these blocks but also to
the pointers linking them.

**Vulnerability**. Similar to traditional heap corruption vulnerabilities
--- where a malicious program corrupts heap metadata to e.g., hijack
control flow --- a malicious device may corrupt DMA pool metadata. This
corruption can trivially lead to arbitrary kernel memory corruption from
any driver that uses it. Indeed, because the DMA pool API is extensively
used, this vulnerability is not confined to a single instance. In fact,
every usage of the DMA pool API is potentially vulnerable. An exploit
proceeds with the following steps:

1. The DMA `pool` initializes its list of blocks, then points to the first
block.
2. The malicious device overwrites the first 8 bytes of the first block ---
which contain its `next_block` pointer --- to an arbitrary kernel address,
`kernel_addr`.
3. The driver makes its first call to `dma_pool_alloc()`, after which, the
pool should point to the second block. However, it instead points to
`kernel_addr`.
4. The driver again calls `dma_pool_alloc()`, which incorrectly returns
`kernel_addr`. Therefore, anytime the driver writes to this "block", it may
corrupt sensitive kernel data.

I have a PDF document that illustrates how these steps work. Please let me
know if you would like me to share it with you.

**Proposed mitigation**. To mitigate the corruption of DMA pool metadata
(i.e., the pointers linking the blocks), the metadata should be moved into
non-DMA memory, ensuring it cannot be altered by a device. I have included
a patch series that implements this change. I have tested the patches with
the `DMAPOOL_TEST` test and my own basic unit tests that ensure the DMA
pool allocator is not vulnerable.

**Performance**. I evaluated the patch set's performance by running the
`DMAPOOL_TEST` test with/without the patches applied. Here is its output
*without* the patches applied:
```
dmapool test: size:16   align:16   blocks:8192 time:11860
dmapool test: size:64   align:64   blocks:8192 time:11951
dmapool test: size:256  align:256  blocks:8192 time:12287
dmapool test: size:1024 align:1024 blocks:2048 time:3134
dmapool test: size:4096 align:4096 blocks:1024 time:1686
dmapool test: size:68   align:32   blocks:8192 time:12050
```

And here is its output *with* the patches applied:
```
dmapool test: size:16   align:16   blocks:8192 time:37954
dmapool test: size:64   align:64   blocks:8192 time:40036
dmapool test: size:256  align:256  blocks:8192 time:41942
dmapool test: size:1024 align:1024 blocks:2048 time:10964
dmapool test: size:4096 align:4096 blocks:1024 time:6101
dmapool test: size:68   align:32   blocks:8192 time:41307
```

The patch set results in a 2.2x--2.6x runtime overhead, as demonstrated in
the performance results. AFAICT, most of this overhead originates from the
change in dma_pool_free()'s vaddr-to-block translation. Previously, the
translation was a simple typecast, but with the patches applied, it now
requires a lookup in an xarray. As Keith noted [1], achieving baseline
performance would likely require changing the API.

**Changes**
- V2 -> V3: (i) Use an xarray for vaddr-to-block translations, which
  improves the
performance of free operations. (ii) Remove the minimum DMA block size
constraint, as it is no longer necessary.
- V1 -> V2: Submit to public mailing lists.

Thanks,

Brian Johannesmeyer

[0] Link: https://www.csl.sri.com/~neumann/ndss-iommu.pdf
[1] Link:
https://patchwork.kernel.org/project/linux-mm/cover/20241119205529.3871048-1-bjohannesmeyer@gmail.com/#26130533

Brian Johannesmeyer (3):
  dmapool: Move pool metadata into non-DMA memory
  dmapool: Use pool_find_block() in pool_block_err()
  Use xarray for efficient vaddr-to-block translation

 mm/dmapool.c | 92 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 57 insertions(+), 35 deletions(-)
diff mbox series

Patch

diff --git a/mm/dmapool.c b/mm/dmapool.c
index f2b96be25412..1cc2cc87ab93 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -35,6 +35,7 @@ 
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/wait.h>
+#include <linux/xarray.h>
 
 #ifdef CONFIG_SLUB_DEBUG_ON
 #define DMAPOOL_DEBUG 1
@@ -59,6 +60,7 @@  struct dma_pool {		/* the pool */
 	unsigned int boundary;
 	char name[32];
 	struct list_head pools;
+	struct xarray block_map;
 };
 
 struct dma_page {		/* cacheable header for 'allocation' bytes */
@@ -96,23 +98,7 @@  static DEVICE_ATTR_RO(pools);
 
 static struct dma_block *pool_find_block(struct dma_pool *pool, void *vaddr)
 {
-	struct dma_page *page;
-	size_t offset, index;
-
-	list_for_each_entry(page, &pool->page_list, page_list) {
-		if (vaddr < page->vaddr)
-			continue;
-		offset = vaddr - page->vaddr;
-		if (offset >= pool->allocation)
-			continue;
-
-		index = offset / pool->size;
-		if (index >= page->blocks_per_page)
-			return NULL;
-
-		return &page->blocks[index];
-	}
-	return NULL;
+	return xa_load(&pool->block_map, (unsigned long)vaddr);
 }
 
 #ifdef DMAPOOL_DEBUG
@@ -273,6 +259,7 @@  struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	retval->boundary = boundary;
 	retval->allocation = allocation;
 	INIT_LIST_HEAD(&retval->pools);
+	xa_init(&retval->block_map);
 
 	/*
 	 * pools_lock ensures that the ->dma_pools list does not get corrupted.
@@ -324,6 +311,12 @@  static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
 		block->dma = page->dma + offset;
 		block->next_block = NULL;
 
+		if (xa_err(xa_store(&pool->block_map, (unsigned long)block->vaddr,
+				    block, GFP_KERNEL))) {
+			pr_err("dma_pool: Failed to store block in xarray\n");
+			return;
+		}
+
 		if (last)
 			last->next_block = block;
 		else
@@ -385,6 +378,7 @@  void dma_pool_destroy(struct dma_pool *pool)
 	if (unlikely(!pool))
 		return;
 
+	xa_destroy(&pool->block_map);
 	mutex_lock(&pools_reg_lock);
 	mutex_lock(&pools_lock);
 	list_del(&pool->pools);