From patchwork Thu Aug 2 19:56:14 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tony Battersby X-Patchwork-Id: 10554085 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F09CE174A for ; Thu, 2 Aug 2018 19:56:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E87712C175 for ; Thu, 2 Aug 2018 19:56:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DB7BA2C186; Thu, 2 Aug 2018 19:56:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6A08D2C175 for ; Thu, 2 Aug 2018 19:56:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727111AbeHBVsy (ORCPT ); Thu, 2 Aug 2018 17:48:54 -0400 Received: from mail.cybernetics.com ([173.71.130.66]:39398 "EHLO mail.cybernetics.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727067AbeHBVsy (ORCPT ); Thu, 2 Aug 2018 17:48:54 -0400 X-ASG-Debug-ID: 1533239775-0fb3b01fb33f58e0001-ziuLRu Received: from cybernetics.com ([10.157.1.126]) by mail.cybernetics.com with ESMTP id l636EBhSmpk9vBiI (version=SSLv3 cipher=DES-CBC3-SHA bits=112 verify=NO); Thu, 02 Aug 2018 15:56:15 -0400 (EDT) X-Barracuda-Envelope-From: tonyb@cybernetics.com X-ASG-Whitelist: Client Received: from [10.157.2.224] (account tonyb HELO [192.168.200.1]) by cybernetics.com (CommuniGate Pro SMTP 5.1.14) with ESMTPSA id 8317826; Thu, 02 Aug 2018 15:56:15 -0400 From: Tony Battersby Subject: [PATCH v2 0/9] mpt3sas and dmapool scalability To: Matthew Wilcox , Christoph Hellwig , Marek Szyprowski , Sathya Prakash , Chaitra P B , Suganath Prabu Subramani , iommu@lists.linux-foundation.org, linux-mm , linux-scsi , MPT-FusionLinux.pdl@broadcom.com X-ASG-Orig-Subj: [PATCH v2 0/9] mpt3sas and dmapool scalability Message-ID: Date: Thu, 2 Aug 2018 15:56:14 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 Content-Language: en-US X-Barracuda-Connect: UNKNOWN[10.157.1.126] X-Barracuda-Start-Time: 1533239775 X-Barracuda-Encrypted: DES-CBC3-SHA X-Barracuda-URL: https://10.157.1.122:443/cgi-mod/mark.cgi X-Barracuda-Scan-Msg-Size: 3537 X-Virus-Scanned: by bsmtpd at cybernetics.com X-Barracuda-BRTS-Status: 1 Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Major changes since v1: *) Replaced the red-black tree with virt_to_page(), which takes us to O(n) instead of O(n * log n). The mpt3sas benchmarks only improved a little though (18 ms -> 17 ms on alloc and 19 ms -> 15 ms on free). *) Eliminated struct dma_page. dmapool private data are now stored directly in struct page. So this patchset will now reduce memory usage in addition to increasing speed. patches #1 - #7 are for merging patch #8 is not for merging patch #9 is up to the maintainers of mpt3sas --- drivers/scsi/mpt3sas is running into a scalability problem with the kernel's DMA pool implementation. With a LSI/Broadcom SAS 9300-8i 12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the equivalent of: chain_dma_pool = dma_pool_create(size = 128); for (i = 0; i < 373959; i++) { dma_addr[i] = dma_pool_alloc(chain_dma_pool); } And at rmmod, system shutdown, or system reboot, mpt3sas does the equivalent of: for (i = 0; i < 373959; i++) { dma_pool_free(chain_dma_pool, dma_addr[i]); } dma_pool_destroy(chain_dma_pool); With this usage, both dma_pool_alloc() and dma_pool_free() exhibit O(n^2) complexity, although dma_pool_free() is much worse due to implementation details. On my system, the dma_pool_free() loop above takes about 9 seconds to run. Note that the problem was even worse before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory consumption in driver."), where the dma_pool_free() loop could take ~30 seconds. mpt3sas also has some other DMA pools, but chain_dma_pool is the only one with so many allocations: cat /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/pools (manually cleaned up column alignment) poolinfo - 0.1 reply_post_free_array pool 1 21 192 1 reply_free pool 1 1 41728 1 reply pool 1 1 1335296 1 sense pool 1 1 970272 1 chain pool 373959 386048 128 12064 reply_post_free pool 12 12 166528 12 The first 8 patches in this series improve the scalability of the DMA pool implementation, which significantly reduces the running time of the DMA alloc/free loops. The last patch modifies mpt3sas to replace chain_dma_pool with direct calls to dma_alloc_coherent() and dma_free_coherent(), which reduces its overhead even further. The mpt3sas patch is independent of the dmapool patches; it can be used with or without them. If either the dmapool patches or the mpt3sas patch is applied, then "modprobe mpt3sas", "rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are significantly faster. Here are some benchmarks (of DMA alloc/free only, not the entire modprobe/rmmod): dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959 original: 350 ms ( 1x) dmapool patches: 17 ms (21x) mpt3sas patch: 7 ms (51x) dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959 original: 8901 ms ( 1x) dmapool patches: 15 ms ( 618x) mpt3sas patch: 2 ms (4245x) Considering that LSI/Broadcom offer an out-of-tree vendor driver that works across multiple kernel versions that won't get the dmapool patches, it may be worth it for them to patch mpt3sas to avoid the problem on older kernels. The downside is that the code is a bit more complicated. So I leave it to their judgement whether they think it is worth it to apply the mpt3sas patch.