From patchwork Tue Dec 11 17:37:58 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Josef Bacik <josef@toxicpanda.com>
X-Patchwork-Id: 10724299
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CAC22159A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 11 Dec 2018 17:38:32 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B2CE62B59D
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 11 Dec 2018 17:38:32 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id A71F12B59F; Tue, 11 Dec 2018 17:38:32 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 40DEE2B59D
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 11 Dec 2018 17:38:32 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726811AbeLKRiH (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 11 Dec 2018 12:38:07 -0500
Received: from mail-yw1-f66.google.com ([209.85.161.66]:45208 "EHLO
        mail-yw1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726803AbeLKRiH (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 11 Dec 2018 12:38:07 -0500
Received: by mail-yw1-f66.google.com with SMTP id d190so5775114ywd.12
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 11 Dec 2018 09:38:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=toxicpanda-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:subject:date:message-id;
        bh=TqxTYYL678habTdU2pMfSU1dK9Ap6pTHGuRGBvTv1I4=;
        b=vOKL819ri/0ZFDVzSa9awoHdgDpkol8NKXosA/YMYh7+LcCtNQihvtDe7BEj3is1yZ
         Kxor6NV7VcXhtKu/AgNNSbNc7p4I+5Z7OjnQSsTynBjY5idkKMgtyx3dnYDyPne+LuO7
         IrVmfsZezyFy9iQIPFq6huRUR0ltDEEaxKmBswGIMmdpT5whFM1eSHDKwu9542lr4eOF
         1myZf04jVmRfz5iROSVLd0NEO+PJzAfGn0/m68ImvZ/elZZTtQkvWlQEJmu4rEeSOBj8
         UTdiKwMGm5hL/B5tpA2CJnhWW5+XrjRm3EKazFhE3O67YXRDEKwdjNTuGvCio7ErriMm
         Dq2w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:subject:date:message-id;
        bh=TqxTYYL678habTdU2pMfSU1dK9Ap6pTHGuRGBvTv1I4=;
        b=ZDrkQdGaIS9bDxjBRHG+nLwbnJG+s6xWc8L6hoRnVLqdRnRQ2tC7mRYPASD/gp9noi
         GF+k4xKzhZM3BfOr/3MuMKDtEkPiP5V1DNaqydFtb57L3fOSxJXbDxq6VQZlKsfo8F5h
         OGnXmKl6Um5onZyLW6F7H6OxAQ7y2/OGIJCuCvfopNDq0Aw7YVBjdLTltRFd/Mzt+1ss
         HYwIfR4scJd95kEwJ7YoCR2dHnk6iNRVQdgwVFDkadecJn8EGZQ5zQBYAu59RoibhaCa
         65hlkInC0UzvfMHBTuV2ArX9efxbI6tb5wFFaxHCkHT4gEG+zr1k62v3JtCstDniubHV
         CR2g==
X-Gm-Message-State: AA+aEWaJnLqahqyarsuQ0PFxS8FRV2slTw1Khi6+JLr/fBRwOI2CHLwX
        r4VG2RBKPtDdUFeRlfDWDd6tTg==
X-Google-Smtp-Source: 
 AFSGD/XomaM/Mo7pPyxnXqFG9XSwFtveYlG3CDEZ/MYoH51XeWiL7okZe7yzhc7olWc2SMNr/VdkSg==
X-Received: by 2002:a81:ac56:: with SMTP id z22mr18220920ywj.40.1544549885863;
        Tue, 11 Dec 2018 09:38:05 -0800 (PST)
Received: from localhost ([107.15.81.208])
        by smtp.gmail.com with ESMTPSA id
 d138sm6079424ywb.44.2018.12.11.09.38.04
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Tue, 11 Dec 2018 09:38:04 -0800 (PST)
From: Josef Bacik <josef@toxicpanda.com>
To: kernel-team@fb.com, hannes@cmpxchg.org,
        linux-kernel@vger.kernel.org, tj@kernel.org, david@fromorbit.com,
        akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, riel@redhat.com, jack@suse.cz
Subject: [PATCH 0/3][V5] drop the mmap_sem when doing IO in the fault path
Date: Tue, 11 Dec 2018 12:37:58 -0500
Message-Id: <20181211173801.29535-1-josef@toxicpanda.com>
X-Mailer: git-send-email 2.14.3
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Here's the latest version, slimmed down a bit from my last submission with more
details in the changelogs as requested.

v4->v5:
- dropped the cached_page infrastructure and the addition of the
  handle_mm_fault_cacheable helper as it had no discernable bearing on
  performance in my performance testing.
- reworked the page lock dropping logic in order to be it's own helper, which
  comments describing how to use it.
- added more details to the changelog for the fpin patch.
- added a patch to cleanup the arguments for the readahead functions for mmap as
  per Jan's suggestion.

v3->v4:
- dropped the ->page_mkwrite portion of these patches, we don't actually see
  issues with mkwrite in production, and I kept running into corner cases where
  I missed something important.  I want to wait on that part until I have a real
  reason to do the work so I can have a solid test in place.
- completely reworked how we drop the mmap_sem in filemap_fault and cleaned it
  up a bit.  Once I started actually testing this with our horrifying reproducer
  I saw a bunch of places where we still ended up doing IO under the mmap_sem
  because I had missed a few corner cases.  Fixed this by reworking
  filemap_fault to only return RETRY once it has a completely uptodate page
  ready to be used.
- lots more testing, including production testing.

v2->v3:
- dropped the RFC, ready for a real review.
- fixed a kbuild error for !MMU configs.
- dropped the swapcache patches since Johannes is still working on those parts.

v1->v2:
- reworked so it only affects x86, since its the only arch I can build and test.
- fixed the fact that do_page_mkwrite wasn't actually sending ALLOW_RETRY down
  to ->page_mkwrite.
- fixed error handling in do_page_mkwrite/callers to explicitly catch
  VM_FAULT_RETRY.
- fixed btrfs to set ->cached_page properly.

-- Original message --

Now that we have proper isolation in place with cgroups2 we have started going
through and fixing the various priority inversions.  Most are all gone now, but
this one is sort of weird since it's not necessarily a priority inversion that
happens within the kernel, but rather because of something userspace does.

We have giant applications that we want to protect, and parts of these giant
applications do things like watch the system state to determine how healthy the
box is for load balancing and such.  This involves running 'ps' or other such
utilities.  These utilities will often walk /proc/<pid>/whatever, and these
files can sometimes need to down_read(&task->mmap_sem).  Not usually a big deal,
but we noticed when we are stress testing that sometimes our protected
application has latency spikes trying to get the mmap_sem for tasks that are in
lower priority cgroups.

This is because any down_write() on a semaphore essentially turns it into a
mutex, so even if we currently have it held for reading, any new readers will
not be allowed on to keep from starving the writer.  This is fine, except a
lower priority task could be stuck doing IO because it has been throttled to the
point that its IO is taking much longer than normal.  But because a higher
priority group depends on this completing it is now stuck behind lower priority
work.

In order to avoid this particular priority inversion we want to use the existing
retry mechanism to stop from holding the mmap_sem at all if we are going to do
IO.  This already exists in the read case sort of, but needed to be extended for
more than just grabbing the page lock.  With io.latency we throttle at
submit_bio() time, so the readahead stuff can block and even page_cache_read can
block, so all these paths need to have the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and simply
cache the page and verify the page is still setup properly the next pass through
->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.  Let me
know what you think.  Thanks,

Josef