From patchwork Wed Sep 18 17:52:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Duyck X-Patchwork-Id: 11151057 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5D95D196C for ; Wed, 18 Sep 2019 17:52:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 32BE521907 for ; Wed, 18 Sep 2019 17:52:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="qyqlBWw1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727109AbfIRRw3 (ORCPT ); Wed, 18 Sep 2019 13:52:29 -0400 Received: from mail-oi1-f195.google.com ([209.85.167.195]:44344 "EHLO mail-oi1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725899AbfIRRw3 (ORCPT ); Wed, 18 Sep 2019 13:52:29 -0400 Received: by mail-oi1-f195.google.com with SMTP id w6so311391oie.11; Wed, 18 Sep 2019 10:52:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:from:to:cc:date:message-id:user-agent:mime-version :content-transfer-encoding; bh=mGxYN8fKLFN6z5avB+andfhVV4A+WX96NEccYZ4KSJ0=; b=qyqlBWw1TSg7WCQxN8G2wS400TI3RlB/WNNQy/BASpbQH7dai4KuaUPYwfRbQJqr83 tymtB88GT+36Ujs21eCz4GigdJaiPddygAeTaQj1gzZJ+74zk4+UB0saoe9kz5DZz2M6 G0BSXzcf3Q2OPAQHUYbpfqWO4wu8YWfOR9EIHVjmO3q4XxLsPDkl0HprogQDrFPkYkfP /lr+jzZfDtW7yiTnakt1uKvk8Ma3MPDNgQgzxsp30RXU8TY7OJjOFvZJ3qs41Q5uHrPA 0InuSyusYyAZKzh4u57bf27Dr8GqXXXVKWk3VzKLHtEQr4qbjklhAn9g1vtqdMpKtLI3 E6Gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:date:message-id:user-agent :mime-version:content-transfer-encoding; bh=mGxYN8fKLFN6z5avB+andfhVV4A+WX96NEccYZ4KSJ0=; b=X+lisZ+OG3i2tElSwm0EwFSASDhirJjWGsg8XIeFIuNPdDHsQVhzmV14wfLURPiOyy jz417QWhtyqttHmn30DY438fu8tt+wUrRXVYZgyMn2t8Dm2YO7AV2rjlpDDICiK16pu0 5jVn/wFIOOjech5rQVuTAL7Y4QZOHypvj8co96OUdEIq4ogsYl3Q0OHUkNSDFnbXfB1W OYhJxgXxVZQFmYiMtKgmd+ys9tMI2e8ZHQ1UZ6bmwSXRCwl4pXNE9moDNoE1yH+5HTFb UnwNmd7KKYoKuAfoWXfEgy0LEuR5vTSkED7uFdG4QFE43CgSJ2nhqMN8PYssTk90Own3 EfiQ== X-Gm-Message-State: APjAAAXdGq+k5dsXO8VQeTpFzVWHEdBwoAVMTXPopHkfRIq8NRY5mNUI AKv9LZydGa15fOA7uZ2aCTw= X-Google-Smtp-Source: APXvYqx3VRfarQJCxejKo+vbLPE/nAA+u3uNVEGvbb5/nNYHkpfJs6nxS+UG8lCLMNOa7yifkgs3Cg== X-Received: by 2002:a54:4f8a:: with SMTP id g10mr3314526oiy.147.1568829148372; Wed, 18 Sep 2019 10:52:28 -0700 (PDT) Received: from localhost.localdomain ([2001:470:b:9c3:9e5c:8eff:fe4f:f2d0]) by smtp.gmail.com with ESMTPSA id r7sm1994572oih.41.2019.09.18.10.52.26 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Sep 2019 10:52:27 -0700 (PDT) Subject: [PATCH v10 0/6] mm / virtio: Provide support for unused page reporting From: Alexander Duyck To: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, mst@redhat.com, david@redhat.com, dave.hansen@intel.com, linux-kernel@vger.kernel.org, willy@infradead.org, mhocko@kernel.org, linux-mm@kvack.org, vbabka@suse.cz, akpm@linux-foundation.org, mgorman@techsingularity.net, linux-arm-kernel@lists.infradead.org, osalvador@suse.de Cc: yang.zhang.wz@gmail.com, pagupta@redhat.com, konrad.wilk@oracle.com, nitesh@redhat.com, riel@surriel.com, lcapitulino@redhat.com, wei.w.wang@intel.com, aarcange@redhat.com, pbonzini@redhat.com, dan.j.williams@intel.com, alexander.h.duyck@linux.intel.com Date: Wed, 18 Sep 2019 10:52:25 -0700 Message-ID: <20190918175109.23474.67039.stgit@localhost.localdomain> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This series provides an asynchronous means of reporting to a hypervisor that a guest page is no longer in use and can have the data associated with it dropped. To do this I have implemented functionality that allows for what I am referring to as unused page reporting. The advantage of unused page reporting is that we can support a significant amount of memory over-commit with improved performance as we can avoid having to write/read memory from swap as the VM will instead actively participate in freeing unused memory so it doesn't have to be written. The functionality for this is fairly simple. When enabled it will allocate statistics to track the number of reported pages in a given free area. When the number of free pages exceeds this value plus a high water value, currently 32, it will begin performing page reporting which consists of pulling non-reported pages off of the free lists of a given zone and placing them into a scatterlist. The scatterlist is then given to the page reporting device and it will perform the required action to make the pages "reported", in the case of virtio-balloon this results in the pages being madvised as MADV_DONTNEED. After this they are placed back on their original free list. If they are not merged in freeing an additional bit is set indicating that they are a "reported" buddy page instead of a standard buddy page. The cycle then repeats with additional non-reported pages being pulled until the free areas all consist of reported pages. In order to try and keep the time needed to find a non-reported page to a minimum we maintain a "reported_boundary" pointer. This pointer is used by the get_unreported_pages iterator to determine at what point it should resume searching for non-reported pages. In order to guarantee pages do not get past the scan I have modified add_to_free_list_tail so that it will not insert pages behind the reported_boundary. If another process needs to perform a massive manipulation of the free list, such as compaction, it can either reset a given individual boundary which will push the boundary back to the list_head, or it can clear the bit indicating the zone is actively processing which will result in the reporting process resetting all of the boundaries for a given zone. I am leaving a number of things hard-coded such as limiting the lowest order processed to pageblock_order, and have left it up to the guest to determine what the limit is on how many pages it wants to allocate to process the hints. The upper limit for this is based on the size of the queue used to store the scatterlist. I wanted to avoid gaming the performance testing for this. As far as possible gain a significant performance improvement should be visible in cases where guests are forced to write/read from swap. As such, testing it would be more of a benchmark of copying a page from swap versus just allocating a zero page. I have been verifying that the memory is being freed using memhog to allocate all the memory on the guest, and then watching /proc/meminfo to verify the host sees the memory returned after the test completes. As far as possible regressions I have focused on cases where performing the hinting would be non-optimal, such as cases where the code isn't needed as memory is not over-committed, or the functionality is not in use. I have been using the will-it-scale/page_fault1 test running with 16 vcpus and have modified it to use Transparent Huge Pages. With this I see almost no difference with the patches applied and the feature disabled. Likewise I see almost no difference with the feature enabled, but the madvise disabled in the hypervisor due to a device being assigned. With the feature fully enabled in both guest and hypervisor I see a regression between -1.86% and -8.84% versus the baseline. I found that most of the overhead was due to the page faulting/zeroing that comes as a result of the pages having been evicted from the guest. For info on earlier versions you will need to follow the links provided with the respective versions. Changes from v9: https://lore.kernel.org/lkml/20190907172225.10910.34302.stgit@localhost.localdomain/ Updated cover page Dropped per-cpu page randomization entropy patch Added "to_tail" boolean value to __free_one_page to improve readability Renamed __shuffle_pick_tail to shuffle_pick_tail, avoiding extra inline function Dropped arm64 HUGLE_TLB_ORDER movement patch since it is no longer needed Significant rewrite of page reporting functionality Updated logic to support interruptions from compaction get_unreported_page will now walk through reported sections Moved free_list manipulators out of mmzone.h and into page_alloc.c Removed page_reporting.h include from mmzone.h Split page_reporting.h between include/linux/ and mm/ Added #include " to mm/page_reporting.h Renamed page_reporting_startup/shutdown to page_reporting_register/unregister Updated comments related to virtio page poison tracking feature --- Alexander Duyck (6): mm: Adjust shuffle code to allow for future coalescing mm: Use zone and order instead of free area in free_list manipulators mm: Introduce Reported pages mm: Add device side and notifier for unused page reporting virtio-balloon: Pull page poisoning config out of free page hinting virtio-balloon: Add support for providing unused page reports to host drivers/virtio/Kconfig | 1 drivers/virtio/virtio_balloon.c | 87 ++++++++- include/linux/mmzone.h | 60 ++---- include/linux/page-flags.h | 11 + include/linux/page_reporting.h | 31 +++ include/uapi/linux/virtio_balloon.h | 1 mm/Kconfig | 11 + mm/Makefile | 1 mm/compaction.c | 5 + mm/memory_hotplug.c | 2 mm/page_alloc.c | 194 +++++++++++++++---- mm/page_reporting.c | 350 +++++++++++++++++++++++++++++++++++ mm/page_reporting.h | 224 ++++++++++++++++++++++ mm/shuffle.c | 12 + mm/shuffle.h | 6 + 15 files changed, 893 insertions(+), 103 deletions(-) create mode 100644 include/linux/page_reporting.h create mode 100644 mm/page_reporting.c create mode 100644 mm/page_reporting.h --