From patchwork Fri Sep 16 03:35:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978052 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79C66ECAAD3 for ; Fri, 16 Sep 2022 03:35:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 51C6C8D0002; Thu, 15 Sep 2022 23:35:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4A3D18D0001; Thu, 15 Sep 2022 23:35:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 31E148D0002; Thu, 15 Sep 2022 23:35:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 1D1908D0001 for ; Thu, 15 Sep 2022 23:35:13 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DFBE01A06F4 for ; Fri, 16 Sep 2022 03:35:12 +0000 (UTC) X-FDA: 79916532864.28.2940C3C Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf26.hostedemail.com (Postfix) with ESMTP id D80BD1400D0 for ; Fri, 16 Sep 2022 03:35:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299311; x=1694835311; h=subject:from:to:cc:date:message-id:mime-version: content-transfer-encoding; bh=SLVv9pi7vEhZVM1xN6gyr4FHAmbLxVDLK3Xakj+D/Kc=; b=UOfpVeIBLshbQd7RzLwMDsuZ1g/HRtaxOBBH3RjKFRibe+j7rgkMmMkb FrXgtePIyVWSVUsJAdrEsvt6u72cBn5EUEBgkzckI7iwu8jeCHXM8YxxG YgEAHQo3FfUbZNwzNQHnHxapvuQ+CraJgT0w30EUFNTAIsfq6R9N8L2IA 7jG3dsvSiytYFtZA93wChpgWWyR7iftgCPneY3N+IBOO2Wg1QuJxaJzpk ySxvXrjUHjgnMUQLm20xw9Hoji2eJglNztXnb/WuxX5tCTgNjrud2Aiim +7YWhtsz7CTJ9FahJOPmYIRmvYajPub2ofe1vE+C4MTL1GkTOHti8vKUh A==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="385192331" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="385192331" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:10 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="679809134" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:09 -0700 Subject: [PATCH v2 00/18] Fix the DAX-gup mistake From: Dan Williams To: akpm@linux-foundation.org Cc: Jason Gunthorpe , Jan Kara , Christoph Hellwig , "Darrick J. Wong" , Matthew Wilcox , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:08 -0700 Message-ID: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299312; a=rsa-sha256; cv=none; b=w1/5roW4BpXTqW06azH71J0faWmWhoh1H4AbpJs15fpnVOWlL/EwUjyYNeItnJJBcKKU5C UEadNNsU6o0AqFscwbaUx7h/3rUzT2Z08ClKRUbeMkWduQJ2cegNOSlTTxfY8WPeYvB+UZ hT0ZKXKSW8I3C3XxWH40auTWB5h6TBw= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=UOfpVeIB; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf26.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299312; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=cK9rhBdhcu9fmtIzXR7l4WnEYyy1DeTdhEB79U/0dEg=; b=fqgtRd8IXwMLs+3bUMmXBHU1gcIQPhsVX+Jsb4syHKrUwSNoFTSnf4nlCnXdDDf/wc4Nj0 eIx8xab2H2ZJxqL28isrNZqt4tz5cyV1eon6/edWZizrKtBj06tBpDqTyEfKpRIq3BgFKc 3fyk94QWrMyB6WVRnal85yio9OQcRgQ= X-Rspam-User: Authentication-Results: imf26.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=UOfpVeIB; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf26.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: kf9srmsgz5yodwqf84fd8nd3erb74z6y X-Rspamd-Queue-Id: D80BD1400D0 X-Rspamd-Server: rspam09 X-HE-Tag: 1663299311-827989 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Changes since v1 [1]: - Jason rightly pointed out that the approach taken in v1 still did not properly handle the case of waiting for all page pins to drop to zero. The new approach in this set fixes that and more closely mirrors what happens for typical pages, details below. [1]: https://lore.kernel.org/nvdimm/166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com/ --- Typical pages have their reference count elevated when they are allocated and installed in the page cache, elevated again when they are mapped into userspace, and elevated for gup. The DAX-gup mistake is that page-references were only ever taken for gup and the device backing the memory was only pinned (get_dev_pagemap()) at gup time. That leaves a hole where the page is mapped for userspace access without a pin on the device. Rework the DAX page reference scheme to be more like typical pages. DAX pages start life at reference count 0, elevate their reference count at map time and gup time. Unlike typical pages that can be safely truncated from files while they are pinned for gup, DAX pages can only be truncated while their reference count is 0. The device is pinned via get_dev_pagemap() whenever a DAX page transitions from _refcount 0 -> 1, and unpinned only after the 1 -> 0 transition and being truncated from their host inode. To facilitate this reference counting and synchronization a new dax_zap_pages() operation is introduced before any truncate event. That dax_zap_pages() operation is carried out as a side effect of any 'break layouts' event. Effectively dax_zap_pages() and the new DAX_ZAP flag (in the DAX-inode i_pages entries), is mimicking what _mapcount tracks for typical pages. The zap state allows the Xarray to cache page->mapping information for entries until the page _refcount drops to zero and is finally truncated from the file / no longer in use. This hackery continues the status of DAX pages as special cases in the VM. The thought being carrying the Xarray / mapping infrastructure forward still allows for the continuation of the page-less DAX effort. Otherwise, the work to convert DAX pages to behave like typical vm_normal_page() needs more investigation to untangle transparent huge page assumptions. This passes the "ndctl:dax" suite of tests from the ndctl project. Thanks to Jason for the discussion on v1 to come up with this new approach. --- Dan Williams (18): fsdax: Wait on @page not @page->_refcount fsdax: Use dax_page_idle() to document DAX busy page checking fsdax: Include unmapped inodes for page-idle detection ext4: Add ext4_break_layouts() to the inode eviction path xfs: Add xfs_break_layouts() to the inode eviction path fsdax: Rework dax_layout_busy_page() to dax_zap_mappings() fsdax: Update dax_insert_entry() calling convention to return an error fsdax: Cleanup dax_associate_entry() fsdax: Rework dax_insert_entry() calling convention fsdax: Manage pgmap references at entry insertion and deletion devdax: Minor warning fixups devdax: Move address_space helpers to the DAX core dax: Prep mapping helpers for compound pages devdax: add PUD support to the DAX mapping infrastructure devdax: Use dax_insert_entry() + dax_delete_mapping_entry() mm/memremap_pages: Support initializing pages to a zero reference count fsdax: Delete put_devmap_managed_page_refs() mm/gup: Drop DAX pgmap accounting .clang-format | 1 drivers/Makefile | 2 drivers/dax/Kconfig | 5 drivers/dax/Makefile | 1 drivers/dax/bus.c | 15 + drivers/dax/dax-private.h | 2 drivers/dax/device.c | 74 ++- drivers/dax/mapping.c | 1055 +++++++++++++++++++++++++++++++++++++++++++++ drivers/dax/super.c | 6 drivers/nvdimm/Kconfig | 1 drivers/nvdimm/pmem.c | 2 fs/dax.c | 1049 ++------------------------------------------- fs/ext4/inode.c | 17 + fs/fuse/dax.c | 9 fs/xfs/xfs_file.c | 16 - fs/xfs/xfs_inode.c | 7 fs/xfs/xfs_inode.h | 6 fs/xfs/xfs_super.c | 22 + include/linux/dax.h | 128 ++++- include/linux/huge_mm.h | 23 - include/linux/memremap.h | 29 + include/linux/mm.h | 30 - mm/gup.c | 89 +--- mm/huge_memory.c | 54 -- mm/memremap.c | 46 +- mm/page_alloc.c | 2 mm/swap.c | 2 27 files changed, 1415 insertions(+), 1278 deletions(-) create mode 100644 drivers/dax/mapping.c base-commit: 1c23f9e627a7b412978b4e852793c5e3c3efc555