From patchwork Thu Apr 23 14:52:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 11505837 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 556141667 for ; Thu, 23 Apr 2020 14:52:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 15CF92074F for ; Thu, 23 Apr 2020 14:52:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lmboYA1+" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 15CF92074F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 774DE8E0006; Thu, 23 Apr 2020 10:52:23 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 6D6CD8E0003; Thu, 23 Apr 2020 10:52:23 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5756D8E0006; Thu, 23 Apr 2020 10:52:23 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0031.hostedemail.com [216.40.44.31]) by kanga.kvack.org (Postfix) with ESMTP id 3D1348E0003 for ; Thu, 23 Apr 2020 10:52:23 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 00225841E for ; Thu, 23 Apr 2020 14:52:22 +0000 (UTC) X-FDA: 76739410566.26.bun76_4ef66d43def19 X-Spam-Summary: 2,0,0,0288d16c18a2931f,d41d8cd98f00b204,minchan.kim@gmail.com,,RULES_HIT:1:2:41:355:379:541:800:960:966:973:982:988:989:1260:1311:1314:1345:1359:1431:1437:1515:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2540:2553:2559:2562:2740:2892:2901:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4049:4250:4321:4385:4605:5007:6119:6261:6653:6742:6743:7514:7903:8603:8660:8784:8957:9592:10004:11026:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12683:12895:12986:13148:13149:13153:13206:13228:13229:13230:13894:14096:14394:21080:21094:21323:21324:21444:21451:21627:21664:21789:21795:21987:21990:30003:30034:30051:30054:30056:30070:30075:30080:30090,0,RBL:209.85.216.68:@gmail.com:.lbl8.mailshell.net-66.100.201.100 62.18.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: bun76_4ef66d43def19 X-Filterd-Recvd-Size: 10033 Received: from mail-pj1-f68.google.com (mail-pj1-f68.google.com [209.85.216.68]) by imf30.hostedemail.com (Postfix) with ESMTP for ; Thu, 23 Apr 2020 14:52:22 +0000 (UTC) Received: by mail-pj1-f68.google.com with SMTP id hi11so2604158pjb.3 for ; Thu, 23 Apr 2020 07:52:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=l19W12T+vpO5G1S8KEHGV19VXsWkr87TVQtxqZe44B8=; b=lmboYA1+vzBsv+HLT/9BtLQqcmcWD0WW1hLvsChS4ltBsk6y5psWZaPynrLm6AHyXt z42eSUSb6hkdHAbsPzlRwDDE5PDVg0bTPqA018Y5KYszpld3dOWB+auJWO1E2u7NOue9 v/wsEzfX0UPGQQbFcnvnGjAZlTiqvY4DBYHqVrp29zLEI6lSt1qdyTAcvcwGC6l4thqE gEWGlBpnKaGIw3VyfMIhxjZ1qXClJXKvVcjG/qM4rxwg0tYBEjIxNCT4XT50oz4F/B1K NYAwbUOhII6V+GCyYLjdvP+ii4TYBsbJQON7NuyqWBzZCQFgskjX+Zm1SrGwPZbeNN66 Iesw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=l19W12T+vpO5G1S8KEHGV19VXsWkr87TVQtxqZe44B8=; b=cyyOl3inSWDTE02u30Poz5K7sqxSHqHbYClEUS+PHJbkyJIC5x7qJ5Bi9mLWCeVA8E yLeORx8oL5lSj09nSpif+OZwtBkEWbuUSoUvjptlBvr+naEMp3rUMIW7YvoT8d5YbVs2 /cNNvgZ0jSEHJM+C+1TUv2enMNiNTfeGtxms4os99YCewXXNiFoXGQiN83Y/JehlR4Sq YvcJyMvwPks6uqMpjQEykrvjUvWOd0JbQyvse8DErABQa2CGb1TPWA3vN+nMozHFK7fJ hK9dsp7gD67Ze6l76fKx4nJsS+ELfM4MZxE1vQhTPwdMhwow837iqCsfpzVRfqRefmwY VXzw== X-Gm-Message-State: AGi0PublzMksBI4MOG2PF2woX6XwmWRTexSFAr+T74F3J5+5odL17NjH 398GqF+ludIyVIOa4ukhnQY= X-Google-Smtp-Source: APiQypKFAb7YE4sDKHB2ge4+p9qCKHEXPiZKZGUunv+f3H5EPBF8AW7ZQ2ABl/151V499KfTGlUVnQ== X-Received: by 2002:a17:90a:d3ca:: with SMTP id d10mr1149295pjw.24.1587653541444; Thu, 23 Apr 2020 07:52:21 -0700 (PDT) Received: from bbox-1.mtv.corp.google.com ([2620:15c:211:1:3e01:2939:5992:52da]) by smtp.gmail.com with ESMTPSA id c2sm2824100pfp.118.2020.04.23.07.52.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Apr 2020 07:52:20 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: linux-mm , Minchan Kim , Jens Axboe , Jann Horn , David Rientjes , Arjun Roy , Tim Murray , Daniel Colascione , Sonny Rao , Brian Geffon , Shakeel Butt , John Dias , Joel Fernandes , SeongJae Park , Oleksandr Natalenko , Suren Baghdasaryan , Sandeep Patil , Michal Hocko , Johannes Weiner , Vlastimil Babka , linux-man@vger.kernel.org Subject: [PATCH 2/2] mm: support vector address ranges for process_madvise Date: Thu, 23 Apr 2020 07:52:15 -0700 Message-Id: <20200423145215.72666-2-minchan@kernel.org> X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9-goog In-Reply-To: <20200423145215.72666-1-minchan@kernel.org> References: <20200423145215.72666-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch extends a) process_madvise(2) support vector address ranges in a system call and then b) support the vector address ranges to local process as well as external process. Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma. (With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost of TLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. So finally, the API is as follows, ssize_t process_madvise(idtype_t idtype, id_t id, const struct iovec *iovec, unsigned long vlen, int advice, unsigned long flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The idtype and id arguments select the target process to be advised as follows: idtype == P_PID select the process whose process ID matches id idtype == P_PIDFD select the process referred to by the PID file descriptor specified in id. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by idtype and id is external. MADV_COLD MADV_PAGEOUT MADV_MERGEABLE MADV_UNMERGEABLE Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. Cc: David Rientjes Cc: Arjun Roy Cc: Tim Murray Cc: Daniel Colascione Cc: Sonny Rao Cc: Brian Geffon Cc: Shakeel Butt Cc: John Dias Cc: Joel Fernandes Cc: SeongJae Park Cc: Oleksandr Natalenko Cc: Suren Baghdasaryan Cc: Sandeep Patil Cc: Michal Hocko Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Signed-off-by: Minchan Kim Signed-off-by: Minchan Kim --- mm/madvise.c | 47 ++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 40 insertions(+), 7 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 097506466fdc..3082d7fa64ee 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1195,20 +1195,39 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) return do_madvise(current, current->mm, start, len_in, behavior); } -SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start, - size_t, len_in, int, behavior, unsigned long, flags) +static int do_process_madvise(struct task_struct *target_task, + struct mm_struct *mm, struct iov_iter *iter, int behavior) { - int ret; + struct iovec iovec; + int ret = 0; + + while (iov_iter_count(iter)) { + iovec = iov_iter_iovec(iter); + ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base, + iovec.iov_len, behavior); + if (ret < 0) + break; + iov_iter_advance(iter, iovec.iov_len); + } + + return ret; +} + +SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, + const struct iovec __user *, vec, unsigned long, vlen, + int, behavior, unsigned long, flags) +{ + ssize_t ret; struct pid *pid; struct task_struct *task; struct mm_struct *mm; + struct iovec iovstack[UIO_FASTIOV]; + struct iovec *iov = iovstack; + struct iov_iter iter; if (flags != 0) return -EINVAL; - if (!process_madvise_behavior_valid(behavior)) - return -EINVAL; - switch (which) { case P_PID: if (upid <= 0) @@ -1236,13 +1255,27 @@ SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start, goto put_pid; } + if (task->mm != current->mm && + !process_madvise_behavior_valid(behavior)) { + ret = -EINVAL; + goto release_task; + } + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); if (IS_ERR_OR_NULL(mm)) { ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; goto release_task; } - ret = do_madvise(task, mm, start, len_in, behavior); + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter); + if (ret >= 0) { + size_t total_len = iov_iter_count(&iter); + + ret = do_process_madvise(task, mm, &iter, behavior); + if (ret >= 0) + ret = total_len - iov_iter_count(&iter); + kfree(iov); + } mmput(mm); release_task: put_task_struct(task);