From patchwork Wed Sep 12 00:43:53 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 10596519
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E946B109C
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 12 Sep 2018 00:44:06 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D884029AC1
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 12 Sep 2018 00:44:06 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id CC37729AD2; Wed, 12 Sep 2018 00:44:06 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 22AE129AC1
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 12 Sep 2018 00:44:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 999278E0002; Tue, 11 Sep 2018 20:44:04 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 920138E0001; Tue, 11 Sep 2018 20:44:04 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7E7C68E0002; Tue, 11 Sep 2018 20:44:04 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pl1-f200.google.com (mail-pl1-f200.google.com
 [209.85.214.200])
	by kanga.kvack.org (Postfix) with ESMTP id 3992B8E0001
	for <linux-mm@kvack.org>; Tue, 11 Sep 2018 20:44:04 -0400 (EDT)
Received: by mail-pl1-f200.google.com with SMTP id 90-v6so101219pla.18
        for <linux-mm@kvack.org>; Tue, 11 Sep 2018 17:44:04 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:mime-version:content-transfer-encoding;
        bh=JrnyRhT8h3Qll6dmZE52fpFmr7n4wHFzv23DcnKxrXU=;
        b=foCRpGgGeuSFruWc73Zoij5MW7me4GTshwGvrz3pptdirYfGfSYwiZvvKYGmNnRkdx
         ieBOqNd5Ey9y8xNjzht2bCCtmEPkWKUXefCS20AhBa3+tELXlyjFUi/xX+antLC5tIu3
         RPYLrdxktnCJHRDBfApI25YgPPGefq26hNlJKtqBWWjXKZiLQAOJnS44jDjPOL+ot97M
         yQitYKU++6eZyVRT0LJlAc9CjM1Crk1Fj2WbWEOSQml7pC5uHZmtxiSJ2ctfm/Fgb9bb
         sMeuDhTEwuWyn8G+dt4ygovVN4M1E++qaaMLd0kNicdMsVn8vfF2GC/OTpx2hp8jvbD0
         xvrg==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ying.huang@intel.com designates
 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
X-Gm-Message-State: APzg51D0QFDMcCDGrlNFyqVnaZ8NOFLJaJwzN38wY6ahAoY1ZExVt8SO
	CR1BC60JOi6BO1TUBUTR13FA6Q+iaAl+tN0mcr4IYeVSSAUCkcHRGFB5mu2VfC8yKYoG4UmGuFr
	jlkRDnXpxBtvfiqkrV/BaZDmFBRqNdCdGb/MT2AWYVLjKlLAIg2Et2rFvA5ks8+XsVg==
X-Received: by 2002:a17:902:8697:: with SMTP id
 g23-v6mr29925651plo.292.1536713043868;
        Tue, 11 Sep 2018 17:44:03 -0700 (PDT)
X-Google-Smtp-Source: 
 ANB0VdaXDVdHq8ulP+m7bx4hXm11Y5LATAPYdCHXcOHNASbto+TqpRNdzflpuZXuXHq0uxRsdiD+
X-Received: by 2002:a17:902:8697:: with SMTP id
 g23-v6mr29925594plo.292.1536713042603;
        Tue, 11 Sep 2018 17:44:02 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1536713042; cv=none;
        d=google.com; s=arc-20160816;
        b=dYFYzOnzT7pk34QX1xpP/IB+q1Ityr1BBlzi01VcNGwxbnNrOrcUdk/ywR8CCQNSNa
         LbHosIgjOc4bxd/AcLJ89Srvr5roZovZpJOM0aHUe4AGInsBkA23GQ1uNpMT5EEUeyB2
         jN0teGRCR/dAId+RXE/FKBJBQqYOli/5+5QIXHcivTCLOYSFxw7ggCg0hW0O9wZk/V82
         yOylOTA6FZ/obJZojdlDXf09UX8vAFv9U3FCi1fFK+5sOJIC696eEuwm3gH1+J/GEMPZ
         FhHe5D6dN8l1J6Rez4gwu01sj3pBdb2/QLvTenL2+9PFtm2BSUGc8eDU88GAUdeQSUz1
         wEzg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from;
        bh=JrnyRhT8h3Qll6dmZE52fpFmr7n4wHFzv23DcnKxrXU=;
        b=m9ybeeggZpPLt5IFfgbLfXYX9F+YP60BcjnPaj/ZXFHJPf6ddKFbS2rK90B+7DUXLh
         ++jDtNa0jdCr2GticxHhdXpbTvw9HZxoJPeEFZ228N19atwL5wXfjZgcoSICPOOQbDk/
         GoeQcGGA05RDT3RNpgZIhsYdQnxFccToMW6XKsl+v3dcICIuf6sQZCPQA3y/INdXzxBt
         b2hXBcViln+qm1FFu4yylQHLGeFFNk3gYwilqD9TNJS8FEbOAgyV/prmVkVl9J0N78im
         aiW7s0ulIecZfoh99Hu+ldhVl95dHZyIT5ZviMtCEwaqnOVAv30cDKfIJxYf3y2ZGYxX
         XvJA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ying.huang@intel.com designates
 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24])
        by mx.google.com with ESMTPS id
 p21-v6si20648717plq.338.2018.09.11.17.44.02
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 11 Sep 2018 17:44:02 -0700 (PDT)
Received-SPF: pass (google.com: domain of ying.huang@intel.com designates
 134.134.136.24 as permitted sender) client-ip=134.134.136.24;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ying.huang@intel.com designates
 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
  by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 11 Sep 2018 17:44:01 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.53,362,1531810800";
   d="scan'208";a="69283776"
Received: from unknown (HELO yhuang-mobile.sh.intel.com) ([10.239.198.87])
  by fmsmga007.fm.intel.com with ESMTP; 11 Sep 2018 17:43:56 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Michal Hocko <mhocko@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shaohua Li <shli@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Daniel Jordan <daniel.m.jordan@oracle.com>
Subject: [PATCH -V5 RESEND 00/21] swap: Swapout/swapin THP in one piece
Date: Wed, 12 Sep 2018 08:43:53 +0800
Message-Id: <20180912004414.22583-1-ying.huang@intel.com>
X-Mailer: git-send-email 2.16.4
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [02/21], [03/21], [04/21],
[05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
[12/21], [20/21], [21/21].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
[15/21], [16/21], [17/21], [18/21], [19/21], [20/21].

Hi, Johannes and Michal, could you help me to review the cgroup part
of the patchset?  Especially [14/21].

And for all, Any comment is welcome!

This patchset is based on the 2018-09-04 head of mmotm/master.

This is the final step of THP (Transparent Huge Page) swap
optimization.  After the first and second step, the splitting huge
page is delayed from almost the first step of swapout to after swapout
has been finished.  In this step, we avoid splitting THP for swapout
and swapout/swapin the THP in one piece.

We tested the patchset with vm-scalability benchmark swap-w-seq test
case, with 16 processes.  The test case forks 16 processes.  Each
process allocates large anonymous memory range, and writes it from
begin to end for 8 rounds.  The first round will swapout, while the
remaining rounds will swapin and swapout.  The test is done on a Xeon
E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
memory) device.  The test result is as follow,

            base                  optimized
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1417897 ±  2%    +992.8%   15494673        vm-scalability.throughput
   1020489 ±  4%   +1091.2%   12156349        vmstat.swap.si
   1255093 ±  3%    +940.3%   13056114        vmstat.swap.so
   1259769 ±  7%   +1818.3%   24166779        meminfo.AnonHugePages
  28021761           -10.7%   25018848 ±  2%  meminfo.AnonPages
  64080064 ±  4%     -95.6%    2787565 ± 33%  interrupts.CAL:Function_call_interrupts
     13.91 ±  5%     -13.8        0.10 ± 27%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath

Where, the score of benchmark (bytes written per second) improved
992.8%.  The swapout/swapin throughput improved 1008% (from about
2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
kernel, for the first round of writing, the THP is swapout and split,
so in the remaining rounds, there is only normal page swapin and
swapout.  While in optimized kernel, the THP is kept after first
swapout, so THP swapin and swapout is used in the remaining rounds.
This shows the key benefit to swapout/swapin THP in one piece, the THP
will be kept instead of being split.  meminfo information verified
this, in base kernel only 4.5% of anonymous page are THP during the
test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
(represented as interrupts.CAL:Function_call_interrupts) reduced
95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
are performance benefit of THP swapout/swapin too.

Below is the description for all steps of THP swap optimization.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swapping even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages to swapout/swapin a THP in one piece include:

- Batch various swap operations for the THP.  Many operations need to
  be done once per THP instead of per normal page, for example,
  allocating/freeing the swap space, writing/reading the swap space,
  flushing TLB, page fault, etc.  This will improve the performance of
  the THP swap greatly.

- The THP swap space read/write will be large sequential IO (2M on
  x86_64).  It is particularly helpful for the swapin, which are
  usually 4k random IO.  This will improve the performance of the THP
  swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The THP order pages will be free
  up after THP swapout.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapout, it will take quite long time for the normal pages to
  collapse back into the THP after being swapin.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swapin, mainly because possible
enlarged read/write IO size (for swapout/swapin) may put more overhead
on the storage device.  To deal with that, the THP swapin is turned on
only when necessary.  A new sysfs interface:
/sys/kernel/mm/transparent_hugepage/swapin_enabled is added to
configure it.  It uses "always/never/madvise" logic, to be turned on
globally, turned off globally, or turned on only for VMA with
MADV_HUGEPAGE, etc.
GE, etc.

Changelog
---------

v5:

- Rebased on 9/4 HEAD of mmotm/master

- Merged the swap operations implementation for the huge and the
  normal swap entries when possible

- Added more code comments to improve code readability

- Changed function parameter style to avoid to use Boolean parameter
  as much as possible

- Fixed a deadlock issue in do_huge_pmd_swap_page(), thanks 0-Day and sparse

v4:

- Rebased on 6/14 HEAD of mmotm/master

- Fixed one build bug and several coding style issues, Thanks Daniel Jordon

v3:

- Rebased on 5/18 HEAD of mmotm/master

- Fixed a build bug, Thanks 0-Day!

v2:

- Fixed several build bugs, Thanks 0-Day!

- Improved documentation as suggested by Randy Dunlap.

- Fixed several bugs in reading huge swap cluster