From patchwork Thu Aug 2 20:01:12 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tony Battersby X-Patchwork-Id: 10554143 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6F73D13BF for ; Thu, 2 Aug 2018 20:01:17 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 64FB72C256 for ; Thu, 2 Aug 2018 20:01:17 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 62DC52C472; Thu, 2 Aug 2018 20:01:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 32A522C48D for ; Thu, 2 Aug 2018 20:01:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731955AbeHBVxx (ORCPT ); Thu, 2 Aug 2018 17:53:53 -0400 Received: from mail.cybernetics.com ([173.71.130.66]:40012 "EHLO mail.cybernetics.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731948AbeHBVxx (ORCPT ); Thu, 2 Aug 2018 17:53:53 -0400 X-ASG-Debug-ID: 1533240072-0fb3b01fb33f5a40001-ziuLRu Received: from cybernetics.com ([10.157.1.126]) by mail.cybernetics.com with ESMTP id sRuXKKDmzkiURATG (version=SSLv3 cipher=DES-CBC3-SHA bits=112 verify=NO); Thu, 02 Aug 2018 16:01:12 -0400 (EDT) X-Barracuda-Envelope-From: tonyb@cybernetics.com X-ASG-Whitelist: Client Received: from [10.157.2.224] (account tonyb HELO [192.168.200.1]) by cybernetics.com (CommuniGate Pro SMTP 5.1.14) with ESMTPSA id 8317849; Thu, 02 Aug 2018 16:01:12 -0400 From: Tony Battersby Subject: [PATCH v2 8/9] dmapool: reduce footprint in struct page To: Matthew Wilcox , Christoph Hellwig , Marek Szyprowski , Sathya Prakash , Chaitra P B , Suganath Prabu Subramani , iommu@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, MPT-FusionLinux.pdl@broadcom.com X-ASG-Orig-Subj: [PATCH v2 8/9] dmapool: reduce footprint in struct page Message-ID: <0ccfd31b-0a3f-9ae8-85c8-e176cd5453a9@cybernetics.com> Date: Thu, 2 Aug 2018 16:01:12 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 Content-Language: en-US X-Barracuda-Connect: UNKNOWN[10.157.1.126] X-Barracuda-Start-Time: 1533240072 X-Barracuda-Encrypted: DES-CBC3-SHA X-Barracuda-URL: https://10.157.1.122:443/cgi-mod/mark.cgi X-Barracuda-Scan-Msg-Size: 13516 X-Virus-Scanned: by bsmtpd at cybernetics.com X-Barracuda-BRTS-Status: 1 Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This is my attempt to shrink 'dma_free_o' and 'dma_in_use' in 'struct page' (originally 'offset' and 'in_use' in 'struct dma_page') to 16-bit so that it is unnecessary to use the '_mapcount' field of 'struct page'. However, it adds complexity and makes allocating and freeing up to 20% slower for little gain, so I am NOT recommending that it be merged at this time. I am posting it just for reference in case someone finds it useful in the future. The main difficulty is supporting archs that have PAGE_SIZE > 64 KiB, for which a 16-bit byte offset is insufficient to cover the entire page. So I took the approach of converting everything from a "byte offset" into a "block index". That way the code can split any PAGE_SIZE into as many as 65535 blocks (one 16-bit index value is reserved for the list terminator). For example, with PAGE_SIZE of 1 MiB, you get 65535 blocks for 'size' <= 16. But that introduces a lot of ugly math due to the 'boundary' checking, which makes the code slower and more complex. I wrote a standalone program that iterates over all the combinations of PAGE_SIZE, 'size', and 'boundary', and performs a series of consistency checks on pool_blk_idx_to_offset(), pool_offset_to_blk_idx(), and pool_initialize_free_block_list(). The math may be ugly but I am pretty sure it is correct. One of the nice things about this is that dma_pool_free() can do some additional sanity checks: *) Check that the offset of the passed-in address corresponds to a valid block offset. *) With DMAPOOL_DEBUG enabled, check that the number of blocks in the freelist exactly matches the number that should be there. This improves the debug check I added in a previous patch by adding the calculation for pool->blks_per_alloc. NOT for merging. --- linux/include/linux/mm_types.h.orig 2018-08-01 12:25:25.000000000 -0400 +++ linux/include/linux/mm_types.h 2018-08-01 12:25:52.000000000 -0400 @@ -156,7 +156,8 @@ struct page { struct { /* dma_pool pages */ struct list_head dma_list; dma_addr_t dma; - unsigned int dma_free_o; + unsigned short dma_free_idx; + unsigned short dma_in_use; }; /** @rcu_head: You can use this to free a page by RCU. */ @@ -180,8 +181,6 @@ struct page { unsigned int active; /* SLAB */ int units; /* SLOB */ - - unsigned int dma_in_use; /* dma_pool pages */ }; /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */ --- linux/mm/dmapool.c.orig 2018-08-02 14:02:42.000000000 -0400 +++ linux/mm/dmapool.c 2018-08-02 14:03:31.000000000 -0400 @@ -51,16 +51,25 @@ #define DMAPOOL_DEBUG 1 #endif +/* + * This matches the type of struct page::dma_free_idx, which is 16-bit to + * conserve space in struct page. + */ +typedef unsigned short pool_idx_t; +#define POOL_IDX_MAX USHRT_MAX + struct dma_pool { /* the pool */ #define POOL_FULL_IDX 0 #define POOL_AVAIL_IDX 1 #define POOL_N_LISTS 2 struct list_head page_list[POOL_N_LISTS]; spinlock_t lock; - size_t size; struct device *dev; - size_t allocation; - size_t boundary; + unsigned int size; + unsigned int allocation; + unsigned int boundary_shift; + unsigned int blks_per_boundary; + unsigned int blks_per_alloc; char name[32]; struct list_head pools; }; @@ -103,9 +112,9 @@ show_pools(struct device *dev, struct de spin_unlock_irq(&pool->lock); /* per-pool info, no real statistics yet */ - temp = scnprintf(next, size, "%-16s %4u %4zu %4zu %2u\n", + temp = scnprintf(next, size, "%-16s %4u %4u %4u %2u\n", pool->name, blocks, - pages * (pool->allocation / pool->size), + pages * pool->blks_per_alloc, pool->size, pages); size -= temp; next += temp; @@ -141,6 +150,7 @@ static DEVICE_ATTR(pools, 0444, show_pool struct dma_pool *dma_pool_create(const char *name, struct device *dev, size_t size, size_t align, size_t boundary) { + unsigned int boundary_shift; struct dma_pool *retval; size_t allocation; bool empty = false; @@ -150,10 +160,10 @@ struct dma_pool *dma_pool_create(const c else if (align & (align - 1)) return NULL; - if (size == 0) + if (size == 0 || size > SZ_2G) return NULL; - else if (size < 4) - size = 4; + else if (size < sizeof(pool_idx_t)) + size = sizeof(pool_idx_t); if ((size % align) != 0) size = ALIGN(size, align); @@ -165,6 +175,9 @@ struct dma_pool *dma_pool_create(const c else if ((boundary < size) || (boundary & (boundary - 1))) return NULL; + boundary_shift = get_count_order_long(min(boundary, allocation)); + boundary = 1U << boundary_shift; + retval = kmalloc_node(sizeof(*retval), GFP_KERNEL, dev_to_node(dev)); if (!retval) return retval; @@ -177,8 +190,29 @@ struct dma_pool *dma_pool_create(const c INIT_LIST_HEAD(&retval->page_list[1]); spin_lock_init(&retval->lock); retval->size = size; - retval->boundary = boundary; retval->allocation = allocation; + retval->boundary_shift = boundary_shift; + retval->blks_per_boundary = boundary / size; + retval->blks_per_alloc = + (allocation / boundary) * retval->blks_per_boundary + + (allocation % boundary) / size; + if (boundary >= allocation || boundary % size == 0) { + /* + * When the blocks are packed together, an individual block + * will never cross the boundary, so the boundary doesn't + * matter in this case. Enable some faster codepaths that skip + * boundary calculations for a small speedup. + */ + retval->blks_per_boundary = 0; + } + if (retval->blks_per_alloc > POOL_IDX_MAX) { + /* + * This would only affect archs with large PAGE_SIZE. Limit + * the total number of blocks per allocation to avoid + * overflowing dma_in_use and dma_free_idx. + */ + retval->blks_per_alloc = POOL_IDX_MAX; + } INIT_LIST_HEAD(&retval->pools); @@ -214,20 +248,73 @@ struct dma_pool *dma_pool_create(const c } EXPORT_SYMBOL(dma_pool_create); +/* + * Convert the index of a block of size pool->size to its offset within an + * allocated chunk of memory of size pool->allocation. + */ +static unsigned int pool_blk_idx_to_offset(struct dma_pool *pool, + unsigned int blk_idx) +{ + unsigned int offset; + + if (pool->blks_per_boundary == 0) { + offset = blk_idx * pool->size; + } else { + offset = ((blk_idx / pool->blks_per_boundary) << + pool->boundary_shift) + + (blk_idx % pool->blks_per_boundary) * pool->size; + } + return offset; +} + +/* + * Convert an offset within an allocated chunk of memory of size + * pool->allocation to the index of the possibly-smaller block of size + * pool->size. If the given offset is not located at the beginning of a valid + * block, then the return value will be >= pool->blks_per_alloc. + */ +static unsigned int pool_offset_to_blk_idx(struct dma_pool *pool, + unsigned int offset) +{ + unsigned int blk_idx; + + if (pool->blks_per_boundary == 0) { + blk_idx = (likely(offset % pool->size == 0)) + ? (offset / pool->size) + : pool->blks_per_alloc; + } else { + unsigned int offset_within_boundary = + offset & ((1U << pool->boundary_shift) - 1); + unsigned int idx_within_boundary = + offset_within_boundary / pool->size; + + if (likely(offset_within_boundary % pool->size == 0 && + idx_within_boundary < pool->blks_per_boundary)) { + blk_idx = (offset >> pool->boundary_shift) * + pool->blks_per_boundary + + idx_within_boundary; + } else { + blk_idx = pool->blks_per_alloc; + } + } + return blk_idx; +} + static void pool_initialize_free_block_list(struct dma_pool *pool, void *vaddr) { + unsigned int next_boundary = 1U << pool->boundary_shift; unsigned int offset = 0; - unsigned int next_boundary = pool->boundary; + unsigned int i; + + for (i = 0; i < pool->blks_per_alloc; i++) { + *(pool_idx_t *)(vaddr + offset) = (pool_idx_t) i + 1; - do { - unsigned int next = offset + pool->size; - if (unlikely((next + pool->size) > next_boundary)) { - next = next_boundary; - next_boundary += pool->boundary; + offset += pool->size; + if (unlikely((offset + pool->size) > next_boundary)) { + offset = next_boundary; + next_boundary += 1U << pool->boundary_shift; } - *(int *)(vaddr + offset) = next; - offset = next; - } while (offset < pool->allocation); + } } static struct page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags) @@ -248,7 +335,7 @@ static struct page *pool_alloc_page(stru page = virt_to_page(vaddr); page->dma = dma; - page->dma_free_o = 0; + page->dma_free_idx = 0; page->dma_in_use = 0; return page; @@ -272,8 +359,8 @@ static void pool_free_page(struct dma_po page->dma_list.next = NULL; page->dma_list.prev = NULL; page->dma = 0; - page->dma_free_o = 0; - page_mapcount_reset(page); /* clear dma_in_use */ + page->dma_free_idx = 0; + page->dma_in_use = 0; if (busy) { dev_err(pool->dev, @@ -342,9 +429,10 @@ EXPORT_SYMBOL(dma_pool_destroy); void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, dma_addr_t *handle) { + unsigned int blk_idx; + unsigned int offset; unsigned long flags; struct page *page; - size_t offset; void *retval; void *vaddr; @@ -370,9 +458,10 @@ void *dma_pool_alloc(struct dma_pool *po ready: vaddr = page_to_virt(page); page->dma_in_use++; - offset = page->dma_free_o; - page->dma_free_o = *(int *)(vaddr + offset); - if (page->dma_free_o >= pool->allocation) { + blk_idx = page->dma_free_idx; + offset = pool_blk_idx_to_offset(pool, blk_idx); + page->dma_free_idx = *(pool_idx_t *)(vaddr + offset); + if (page->dma_free_idx >= pool->blks_per_alloc) { /* Move page from the "available" list to the "full" list. */ list_del(&page->dma_list); list_add(&page->dma_list, &pool->page_list[POOL_FULL_IDX]); @@ -383,8 +472,8 @@ void *dma_pool_alloc(struct dma_pool *po { int i; u8 *data = retval; - /* page->dma_free_o is stored in first 4 bytes */ - for (i = sizeof(page->dma_free_o); i < pool->size; i++) { + /* a pool_idx_t is stored at the beginning of the block */ + for (i = sizeof(pool_idx_t); i < pool->size; i++) { if (data[i] == POOL_POISON_FREED) continue; dev_err(pool->dev, @@ -426,6 +515,7 @@ void dma_pool_free(struct dma_pool *pool struct page *page; unsigned long flags; unsigned int offset; + unsigned int blk_idx; if (unlikely(!virt_addr_valid(vaddr))) { dev_err(pool->dev, @@ -438,21 +528,28 @@ void dma_pool_free(struct dma_pool *pool offset = offset_in_page(vaddr); if (unlikely((dma - page->dma) != offset)) { + bad_vaddr: dev_err(pool->dev, "dma_pool_free %s, %p (bad vaddr)/%pad (or bad dma)\n", pool->name, vaddr, &dma); return; } + blk_idx = pool_offset_to_blk_idx(pool, offset); + if (unlikely(blk_idx >= pool->blks_per_alloc)) + goto bad_vaddr; + spin_lock_irqsave(&pool->lock, flags); #ifdef DMAPOOL_DEBUG { void *page_vaddr = vaddr - offset; - unsigned int chain = page->dma_free_o; - size_t total_free = 0; + unsigned int chain_idx = page->dma_free_idx; + unsigned int n_free = 0; + + while (chain_idx < pool->blks_per_alloc) { + unsigned int chain_offset; - while (chain < pool->allocation) { - if (unlikely(chain == offset)) { + if (unlikely(chain_idx == blk_idx)) { spin_unlock_irqrestore(&pool->lock, flags); dev_err(pool->dev, "dma_pool_free %s, dma %pad already free\n", @@ -461,15 +558,15 @@ void dma_pool_free(struct dma_pool *pool } /* - * The calculation of the number of blocks per - * allocation is actually more complicated than this - * because of the boundary value. But this comparison - * does not need to be exact; it just needs to prevent - * an endless loop in case a buggy driver causes a - * circular loop in the freelist. + * A buggy driver could corrupt the freelist by + * use-after-free, buffer overflow, etc. Besides + * checking for corruption, this also prevents an + * endless loop in case corruption causes a circular + * loop in the freelist. */ - total_free += pool->size; - if (unlikely(total_free >= pool->allocation)) { + if (unlikely(++n_free + page->dma_in_use > + pool->blks_per_alloc)) { + freelist_corrupt: spin_unlock_irqrestore(&pool->lock, flags); dev_err(pool->dev, "dma_pool_free %s, freelist corrupted\n", @@ -477,20 +574,24 @@ void dma_pool_free(struct dma_pool *pool return; } - chain = *(int *)(page_vaddr + chain); + chain_offset = pool_blk_idx_to_offset(pool, chain_idx); + chain_idx = + *(pool_idx_t *) (page_vaddr + chain_offset); } + if (n_free + page->dma_in_use != pool->blks_per_alloc) + goto freelist_corrupt; } memset(vaddr, POOL_POISON_FREED, pool->size); #endif page->dma_in_use--; - if (page->dma_free_o >= pool->allocation) { + if (page->dma_free_idx >= pool->blks_per_alloc) { /* Move page from the "full" list to the "available" list. */ list_del(&page->dma_list); list_add(&page->dma_list, &pool->page_list[POOL_AVAIL_IDX]); } - *(int *)vaddr = page->dma_free_o; - page->dma_free_o = offset; + *(pool_idx_t *)vaddr = page->dma_free_idx; + page->dma_free_idx = blk_idx; /* * Resist a temptation to do * if (!is_page_busy(page)) pool_free_page(pool, page);