From patchwork Thu Jan 13 23:39:36 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Peter Oskolkov <posk@google.com>
X-Patchwork-Id: 12713190
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CB30FC4332F
	for <linux-mm@archiver.kernel.org>; Thu, 13 Jan 2022 23:40:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3087B6B0073; Thu, 13 Jan 2022 18:40:06 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2910B6B0074; Thu, 13 Jan 2022 18:40:06 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 133936B0075; Thu, 13 Jan 2022 18:40:06 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24])
	by kanga.kvack.org (Postfix) with ESMTP id 027706B0073
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 18:40:06 -0500 (EST)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id BC9BE234B3
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:05 +0000 (UTC)
X-FDA: 79026884370.01.65EACA1
Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com
 [209.85.219.201])
	by imf31.hostedemail.com (Postfix) with ESMTP id EC9A420004
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:04 +0000 (UTC)
Received: by mail-yb1-f201.google.com with SMTP id
 v66-20020a256145000000b006115377709aso14765457ybb.3
        for <linux-mm@kvack.org>; Thu, 13 Jan 2022 15:40:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=FMGfyZNGg16kQgWmzvaNooZwk13glosUs6aVPqZ/nrc=;
        b=dPjeDOi3O6ASy2ii197CAuCJWdyt8duA1D7+djoQotRqDWEhrzMmCDnirA/nQ9hobQ
         b0jwGR5DttHhc7TO0zCq9P83NNX8HIuOu7C3PuUi1naYFMUSXJLqh/bj7/0/HUMide1a
         CQ7+ws5aRRjDBPW3sVeq/GIKcXNaoWg9GlsmsDhG0RZFV53D9DAEMUYsppjTQW/+r9pl
         hPifaqFSqER+Dfj0ncmOx3oxCYdMKfxadHa0FxG4lAh+xyj5K3yQN35aVqZ7Hq8hNZDe
         mLUDfq3b+sMRtPRShVuPrMejCh12/LAjHxZAbOMHfdK+fhggZQjwS9ZxWf50K3mVt1lr
         7F/Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=FMGfyZNGg16kQgWmzvaNooZwk13glosUs6aVPqZ/nrc=;
        b=1RRsg7DEe+o6FQkyT9OOsGcEIRKK+fS4gDoBp8gBd+1LU7N98dW1YTw51qeD5YNQzA
         hGh3oYhEXa/nH6zGEnJtWhXllAgSvK6SqVLqaEzgvSd7jNVB0cRNcHMr44bJT7hdeHO2
         zsOWk8bGyQ1cEQ86DIDEHjVOes3yOpidBjlnTvDAl9eD6u5URpwmFmFtBSr6lzx2mS8M
         voTYRmR3aKv4i3HFYEVG2z+6cyfSCmFp9pFALrrIPZMx3zr9AYxjpnmofEpIj28R+C/x
         rILzJt0i0dq7084cZ+hXmJPR+lysdccojaNemV7jMTA1n4nh1ZQdKOpuew9day+cwEM8
         l3Iw==
X-Gm-Message-State: AOAM5318lFYwE8Qg18ks+h04B1jyBd8eo5pPIcZwATbVJyUbta+pMXe/
	to5vkKP7btxJoFmHxUhPYDyYlZAj
X-Google-Smtp-Source: 
 ABdhPJz5otcxfEhE6MWpPlPMKjfGbxs3qBpBtoEpYWyLi1d9W5GUhm/RULyKZYaWhqebI5v0hyA4l39c
X-Received: from posk.svl.corp.google.com
 ([2620:15c:2cd:202:c548:e79f:8954:121f])
 (user=posk job=sendgmr) by 2002:a25:9a02:: with SMTP id
 x2mr9180866ybn.701.1642117204312;
 Thu, 13 Jan 2022 15:40:04 -0800 (PST)
Date: Thu, 13 Jan 2022 15:39:36 -0800
In-Reply-To: <20220113233940.3608440-1-posk@google.com>
Message-Id: <20220113233940.3608440-2-posk@google.com>
Mime-Version: 1.0
References: <20220113233940.3608440-1-posk@google.com>
X-Mailer: git-send-email 2.34.1.703.g22d0c6ccf7-goog
Subject: [RFC PATCH v2 1/5] sched/umcg: add WF_CURRENT_CPU and externise ttwu
From: Peter Oskolkov <posk@google.com>
To: Peter Zijlstra <peterz@infradead.org>, mingo@redhat.com,
 tglx@linutronix.de,
	juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org, x86@kernel.org, pjt@google.com, posk@google.com,
	avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, posk@posk.io
X-Rspamd-Queue-Id: EC9A420004
X-Stat-Signature: te11r6k7zkjdj47dfxmjdokgdpxcs5oi
Authentication-Results: imf31.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=dPjeDOi3;
	spf=pass (imf31.hostedemail.com: domain of
 3VLjgYQQKCF0KJNFBJJBG9.7JHGDIPS-HHFQ57F.JMB@flex--posk.bounces.google.com
 designates 209.85.219.201 as permitted sender)
 smtp.mailfrom=3VLjgYQQKCF0KJNFBJJBG9.7JHGDIPS-HHFQ57F.JMB@flex--posk.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam07
X-HE-Tag: 1642117204-867058
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Peter Oskolkov <posk@posk.io>

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211122211327.5931-2-posk@google.com
---
 kernel/sched/core.c  |  3 +--
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h | 15 +++++++++------
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 83872f95a1ea..04525933de94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3980,8 +3980,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..4b70cf8f1ec3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6838,6 +6838,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de53be905739..77f67d09b946 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2052,13 +2052,14 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }
 
 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
 
-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU   0x40 /* Wakee is on_cpu */
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU       0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU  0x80 /* Prefer to move the wakee to the current CPU. */
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3112,6 +3113,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);

From patchwork Thu Jan 13 23:39:37 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Peter Oskolkov <posk@google.com>
X-Patchwork-Id: 12713191
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5B887C43217
	for <linux-mm@archiver.kernel.org>; Thu, 13 Jan 2022 23:40:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DDE146B0074; Thu, 13 Jan 2022 18:40:08 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D67066B0075; Thu, 13 Jan 2022 18:40:08 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BE1D36B0078; Thu, 13 Jan 2022 18:40:08 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0237.hostedemail.com
 [216.40.44.237])
	by kanga.kvack.org (Postfix) with ESMTP id B01E36B0074
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 18:40:08 -0500 (EST)
Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 6D001181D12E2
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:08 +0000 (UTC)
X-FDA: 79026884496.12.C8D9F43
Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com
 [209.85.219.201])
	by imf18.hostedemail.com (Postfix) with ESMTP id 000011C0008
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:07 +0000 (UTC)
Received: by mail-yb1-f201.google.com with SMTP id
 s89-20020a25aa62000000b00611afc92630so10514541ybi.17
        for <linux-mm@kvack.org>; Thu, 13 Jan 2022 15:40:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=wwJdGAN5osUC9HM7kKzj2Htmsov37Cc7r1NsvxZMlRg=;
        b=EqRlBPehovANSdBzY2evi/EKQYY+q7HLOecxR/MagzYR2LxVFJ2uK4VkjufpfGFEzx
         KIeCSUfIoiL1qoE5OPQTr4qRMnceLv7+32QM4Hl7gPahtlj2cTemKA31nN2Df8LfZhZa
         El2EqLCJCO1yQ/3YpQ/0XwnC8W2XEBXUm3dbCBExBLp/IGMMCLUgBq+B94A1I13rRSAN
         ukHIOytgnjWR7LSixoc/8BtU4PvqqLx30PpXNj3fgwOdLAk53TWQvhTQkjw3G0wKpYC8
         PUt8wipSgOqEFQ+Y2EWyJvcrby1YZ8yF2OF7TvyETom5WyTQmwBDMAfStsNe41S0zUdC
         0UTA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=wwJdGAN5osUC9HM7kKzj2Htmsov37Cc7r1NsvxZMlRg=;
        b=fhpF33jnVdPobIQoiIuzyj1rmUHpy298hpUgD/b9IhtRlQO9jh/ftx6iDQ8w+5L55y
         00xPvmUEu6tUkBon0ptO2yJw62CFhJ0YByZe3qDktVfg7udqCvdOTxbSXUUoNqsrOzTM
         PEnGAFw29I/ooGZAFqbBNdjiLXbtVbjSnsNDJaDZAPJu2Vbk8ZRW6szCh5AXkxHswLC2
         EsUodg8UBEJRkhiCwGD0VYmxzVSPTRogHIMGgcnkIr7q96YmR3ja4uQAlApZkxvhKwgl
         UOX2MJD1/m+1IXGi1bVEB5ORMSCpOmo/KtPB97SKSnRj0UExjAdetQ1gX9NZT1J989Rh
         JNDQ==
X-Gm-Message-State: AOAM532fM9iGQlyBzfitXiJb/OQA5q1dq2Ae/fPtJuOz0I+jyfqehzqG
	/imWON8Zw55NAApppzbDs1c9/FXS
X-Google-Smtp-Source: 
 ABdhPJzzGWHCGjZGF5k35EtO1RjDkQh8pxMnUpvwv8nSbHJHv8giYUEy9UlBMk+wUGWLFmwcfXW/lLPr
X-Received: from posk.svl.corp.google.com
 ([2620:15c:2cd:202:c548:e79f:8954:121f])
 (user=posk job=sendgmr) by 2002:a25:6d09:: with SMTP id
 i9mr4958623ybc.703.1642117206900;
 Thu, 13 Jan 2022 15:40:06 -0800 (PST)
Date: Thu, 13 Jan 2022 15:39:37 -0800
In-Reply-To: <20220113233940.3608440-1-posk@google.com>
Message-Id: <20220113233940.3608440-3-posk@google.com>
Mime-Version: 1.0
References: <20220113233940.3608440-1-posk@google.com>
X-Mailer: git-send-email 2.34.1.703.g22d0c6ccf7-goog
Subject: [RFC PATCH v2 2/5] x86/uaccess: Implement unsafe_try_cmpxchg_user()
From: Peter Oskolkov <posk@google.com>
To: Peter Zijlstra <peterz@infradead.org>, mingo@redhat.com,
 tglx@linutronix.de,
	juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org, x86@kernel.org, pjt@google.com, posk@google.com,
	avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, posk@posk.io
X-Stat-Signature: eugh5uftdre4cqje8199oqih1wpd8csz
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=EqRlBPeh;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf18.hostedemail.com: domain of
 3VrjgYQQKCF8MLPHDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--posk.bounces.google.com
 designates 209.85.219.201 as permitted sender)
 smtp.mailfrom=3VrjgYQQKCF8MLPHDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--posk.bounces.google.com
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 000011C0008
X-HE-Tag: 1642117207-485152
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Peter Zijlstra <peterz@infradead.org>

Do try_cmpxchg() loops on userspace addresses.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/uaccess.h | 57 ++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index ac96f9b2d64b..8277ec05be02 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -342,6 +342,24 @@ do {									\
 		     : [umem] "m" (__m(addr))				\
 		     : : label)
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm_volatile_goto("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     _ASM_EXTABLE_UA(1b, %l[label])			\
+		     : CC_OUT(z) (success),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new)				\
+		     : "memory", "cc"					\
+		     : label);						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
 #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 #ifdef CONFIG_X86_32
@@ -407,6 +425,30 @@ do {									\
 		     : [umem] "m" (__m(addr)),				\
 		       "0" (err))
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	int __err = 0;							\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm volatile("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     CC_SET(z)						\
+		     "2:\n"						\
+		     _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG,	\
+					   %[errout])			\
+		     : CC_OUT(z) (success),				\
+		       [errout] "+r" (__err),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new)				\
+		     : "memory", "cc");					\
+	if (unlikely(__err))						\
+		goto label;						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 /* FIXME: this hack is definitely wrong -AK */
@@ -501,6 +543,21 @@ do {										\
 } while (0)
 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
+extern void __try_cmpxchg_user_wrong_size(void);
+
+#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({		\
+	__typeof__(*(_ptr)) __ret;					\
+	switch (sizeof(__ret)) {					\
+	case 4:	__ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	case 8:	__ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	default: __try_cmpxchg_user_wrong_size();			\
+	}								\
+	__ret;						})
+
 /*
  * We want the unsafe accessors to always be inlined and use
  * the error labels - thus the macro games.

From patchwork Thu Jan 13 23:39:38 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Peter Oskolkov <posk@google.com>
X-Patchwork-Id: 12713192
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9A39FC433FE
	for <linux-mm@archiver.kernel.org>; Thu, 13 Jan 2022 23:40:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2E2F16B0075; Thu, 13 Jan 2022 18:40:11 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 246C86B0078; Thu, 13 Jan 2022 18:40:11 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F40AA6B007B; Thu, 13 Jan 2022 18:40:10 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0112.hostedemail.com
 [216.40.44.112])
	by kanga.kvack.org (Postfix) with ESMTP id DBEB96B0075
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 18:40:10 -0500 (EST)
Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 948AD92746
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:10 +0000 (UTC)
X-FDA: 79026884580.23.9436F4A
Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com
 [209.85.219.201])
	by imf06.hostedemail.com (Postfix) with ESMTP id 21079180002
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:09 +0000 (UTC)
Received: by mail-yb1-f201.google.com with SMTP id
 g6-20020a25db06000000b00611ca09ecd0so4587754ybf.6
        for <linux-mm@kvack.org>; Thu, 13 Jan 2022 15:40:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=Dz0tsg36BPE3seUqLZfpg1x7OYgvQMJbzgKuSruDbXI=;
        b=gK1KNTq0ywitzOYlfv29fSwqgMq8zUmc4ZgddtkYxvIfoVnWGaUuf1aDC1Di0QuVEy
         uN/G6pAYNZPBLOxjhmKSkNCQsf4Daspid/VFt3gXb+5+0heugJeiN+3jmqF30TgQESlD
         CvbNqHzX1aHB5C3YqoCrK8IMFphXlzNRsIXJ6n+h+rIpF7+Kh00yjRJwVS4ydj4pvPE1
         rjoerGWt+yGTjdFUSbRHl2kC3hks0iVTRaW5tmmY2/8vPXtUeSN6eW3I63rbEfH97i15
         RHSnyQZt4kozrS875pa2D3vxrimvaA0zCGVv89V29uzWrbW/IVRZd5MRRWVztzwRgLsI
         g1Vw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=Dz0tsg36BPE3seUqLZfpg1x7OYgvQMJbzgKuSruDbXI=;
        b=gPm3e0FpFkY80r+zC/Z4pk0Em073Nc/aH5HbEWW4BsPhtIGTktF3dTApL8IOiEBD6v
         rL2mQLGCIk8B8dDsnMg9mICJd1oTz/DuoE555m53exu5k0ijkkFVyWSawiXtSc3L3ogQ
         IGgVcM7DgekSPg2BiILJS5B1HZgs2pVtwn9AVOi6QO/sqzAIsP85qldlJ3bI8EXbVBD7
         dANTITLDMtC55Kk7Y81OKu9ykKHZMFLWuWwGgFZXSN9R7WkfzVb6wI5yu6bdA2NWiCUh
         JeBxpIMoAIgzA3YJCUNZmuFZAdgqIHWsPZlHJPBuD89v1K0tcftkVAL1aOKAt/YmvUji
         tTDQ==
X-Gm-Message-State: AOAM531PcEj3JxZJiiuaEBthS/pMPk6CUNkeja/3sVf7Dv4olggC90en
	1Q9S9L2/8kAFMkvuiXt/lPWWdF0r
X-Google-Smtp-Source: 
 ABdhPJx6Y2+cWOHE/DllE9CerzdeIUaZni8LNVS/NieNjzRhMbMXHKoMZnxVX+cF/w0gh2AXV4Tbkhis
X-Received: from posk.svl.corp.google.com
 ([2620:15c:2cd:202:c548:e79f:8954:121f])
 (user=posk job=sendgmr) by 2002:a05:6902:110:: with SMTP id
 o16mr9141584ybh.385.1642117209384; Thu, 13 Jan 2022 15:40:09 -0800 (PST)
Date: Thu, 13 Jan 2022 15:39:38 -0800
In-Reply-To: <20220113233940.3608440-1-posk@google.com>
Message-Id: <20220113233940.3608440-4-posk@google.com>
Mime-Version: 1.0
References: <20220113233940.3608440-1-posk@google.com>
X-Mailer: git-send-email 2.34.1.703.g22d0c6ccf7-goog
Subject: [RFC PATCH v2 3/5] sched: User Mode Concurency Groups
From: Peter Oskolkov <posk@google.com>
To: Peter Zijlstra <peterz@infradead.org>, mingo@redhat.com,
 tglx@linutronix.de,
	juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org, x86@kernel.org, pjt@google.com, posk@google.com,
	avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, posk@posk.io
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 21079180002
X-Stat-Signature: 55kxentahgf6nrjx4b4aat6jb43ggz4n
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=gK1KNTq0;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of
 3WbjgYQQKCGIPOSKGOOGLE.COMLINUX-MMKVACK.ORG@flex--posk.bounces.google.com
 designates 209.85.219.201 as permitted sender)
 smtp.mailfrom=3WbjgYQQKCGIPOSKGOOGLE.COMLINUX-MMKVACK.ORG@flex--posk.bounces.google.com
X-HE-Tag: 1642117209-929685
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Peter Zijlstra <peterz@infradead.org>

User Managed Concurrency Groups is an M:N threading toolkit that allows
constructing user space schedulers designed to efficiently manage
heterogeneous in-process workloads while maintaining high CPU
utilization (95%+).

XXX moar changelog explaining how this is moar awesome than
traditional user-space threading.

The big thing that's missing is the SMP wake-to-remote-idle.

The big assumption this whole thing is build on is that
pin_user_pages() preserves user mappings in so far that
pagefault_disable() will never generate EFAULT (unless the user does
munmap() in which case it can keep the pieces).

shrink_page_list() does page_maybe_dma_pinned() before try_to_unmap()
and as such seems to respect this constraint.

unmap_and_move() however seems willing to unmap otherwise pinned (and
hence unmigratable) pages. This might need fixing.

Originally-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/Kconfig                       |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   3 +
 arch/x86/include/asm/thread_info.h     |   2 +
 fs/exec.c                              |   1 +
 include/linux/entry-common.h           |   6 +
 include/linux/sched.h                  |  86 +++
 include/linux/syscalls.h               |   4 +
 include/linux/thread_info.h            |   2 +
 include/uapi/asm-generic/unistd.h      |   9 +-
 include/uapi/linux/umcg.h              | 143 ++++
 init/Kconfig                           |  15 +
 kernel/entry/common.c                  |  18 +-
 kernel/exit.c                          |   5 +
 kernel/sched/Makefile                  |   1 +
 kernel/sched/core.c                    |   9 +-
 kernel/sched/umcg.c                    | 868 +++++++++++++++++++++++++
 kernel/sys_ni.c                        |   5 +
 17 files changed, 1171 insertions(+), 7 deletions(-)
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 976dd6b532bf..34a398f5a57b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -248,6 +248,7 @@ config X86
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UMCG			if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select HOTPLUG_SMT			if SMP
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index fe8f8dd157b4..3d96af7e67cc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -371,6 +371,9 @@
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
+450	common	umcg_ctl		sys_umcg_ctl
+451	common	umcg_wait		sys_umcg_wait
+452	common	umcg_kick		sys_umcg_kick
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index ebec69c35e95..f480e43c8bdf 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
 #define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_UMCG		6	/* UMCG return to user hook */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
+#define _TIF_UMCG		(1 << TIF_UMCG)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
 #define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..1749f0f74fed 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binprm *bprm,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
+	umcg_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	return retval;
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..6318b0461cd2 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -22,6 +22,10 @@
 # define _TIF_UPROBE			(0)
 #endif
 
+#ifndef _TIF_UMCG
+# define _TIF_UMCG			(0)
+#endif
+
 /*
  * SYSCALL_WORK flags handled in syscall_enter_from_user_mode()
  */
@@ -42,11 +46,13 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_UMCG |		\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_UMCG |		\
 				 SYSCALL_WORK_SYSCALL_EXIT_TRAP	|	\
 				 ARCH_SYSCALL_WORK_EXIT)
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0cd3d9c2e864..6172594282ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct umcg_task;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1294,6 +1295,23 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif
 
+#ifdef CONFIG_UMCG
+	/* setup by sys_umcg_ctrl() */
+	clockid_t		umcg_clock;
+	struct umcg_task __user	*umcg_task;
+
+	/* setup by umcg_pin_enter() */
+	struct page		*umcg_worker_page;
+
+	struct task_struct	*umcg_server;
+	struct umcg_task __user *umcg_server_task;
+	struct page		*umcg_server_page;
+
+	struct task_struct	*umcg_next;
+	struct umcg_task __user	*umcg_next_task;
+	struct page		*umcg_next_page;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	union {
@@ -1687,6 +1705,13 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+
+#ifdef CONFIG_UMCG
+#define PF_UMCG_WORKER		0x01000000	/* UMCG worker */
+#else
+#define PF_UMCG_WORKER		0x00000000
+#endif
+
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
@@ -2296,6 +2321,67 @@ static inline void rseq_execve(struct task_struct *t)
 
 #endif
 
+#ifdef CONFIG_UMCG
+
+extern void umcg_sys_enter(struct pt_regs *regs, long syscall);
+extern void umcg_sys_exit(struct pt_regs *regs);
+extern void umcg_notify_resume(struct pt_regs *regs);
+extern void umcg_worker_exit(void);
+extern void umcg_clear_child(struct task_struct *tsk);
+
+/* Called by bprm_execve() in fs/exec.c. */
+static inline void umcg_execve(struct task_struct *tsk)
+{
+	if (tsk->umcg_task)
+		umcg_clear_child(tsk);
+}
+
+/* Called by do_exit() in kernel/exit.c. */
+static inline void umcg_handle_exit(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_worker_exit();
+}
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+extern void umcg_wq_worker_sleeping(struct task_struct *tsk);
+extern void umcg_wq_worker_running(struct task_struct *tsk);
+
+#else  /* CONFIG_UMCG */
+
+static inline void umcg_sys_enter(struct pt_regs *regs, long syscall)
+{
+}
+
+static inline void umcg_sys_exit(struct pt_regs *regs)
+{
+}
+
+static inline void umcg_notify_resume(struct pt_regs *regs)
+{
+}
+
+static inline void umcg_clear_child(struct task_struct *tsk)
+{
+}
+static inline void umcg_execve(struct task_struct *tsk)
+{
+}
+static inline void umcg_handle_exit(void)
+{
+}
+static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+}
+static inline void umcg_wq_worker_running(struct task_struct *tsk)
+{
+}
+
+#endif
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 void rseq_syscall(struct pt_regs *regs);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 528a478dbda8..3ba0af15e223 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -72,6 +72,7 @@ struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct umcg_task;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1057,6 +1058,9 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
 asmlinkage long sys_memfd_secret(unsigned int flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self, clockid_t which_clock);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
+asmlinkage long sys_umcg_kick(u32 flags, pid_t tid);
 
 /*
  * Architecture-specific system calls
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 73a6f34b3847..8fdc4a1fa9a5 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,6 +46,7 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_SYSCALL_UMCG,
 };
 
 #define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -55,6 +56,7 @@ enum syscall_work_bit {
 #define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
 #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
 #define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_UMCG	BIT(SYSCALL_WORK_BIT_SYSCALL_UMCG)
 #endif
 
 #include <asm/thread_info.h>
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 4557a8b6086f..495949af981e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,15 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 
+#define __NR_umcg_ctl 450
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 451
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
+#define __NR_umcg_kick 452
+__SYSCALL(__NR_umcg_kick, sys_umcg_kick)
+
 #undef __NR_syscalls
-#define __NR_syscalls 450
+#define __NR_syscalls 453
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..a994bbb062d5
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ *      sys_umcg_ctl()  - register/unregister UMCG tasks;
+ *      sys_umcg_wait() - wait/wake/context-switch.
+ *      sys_umcg_kick() - prod a UMCG task
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ */
+
+/*
+ * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
+ * The states represent the user space point of view.
+ *
+ *   ,--------(TF_PREEMPT + notify_resume)-------. ,------------.
+ *   |                                           v |            |
+ * RUNNING -(schedule)-> BLOCKED -(sys_exit)-> RUNNABLE  (signal + notify_resume)
+ *   ^                                           | ^            |
+ *   `--------------(sys_umcg_wait)--------------' `------------'
+ *
+ */
+#define UMCG_TASK_NONE			0x0000U
+#define UMCG_TASK_RUNNING		0x0001U
+#define UMCG_TASK_RUNNABLE		0x0002U
+#define UMCG_TASK_BLOCKED		0x0003U
+
+#define UMCG_TASK_MASK			0x00ffU
+
+/*
+ * UMCG_TF_PREEMPT: userspace indicates the worker should be preempted.
+ *
+ * Must only be set on UMCG_TASK_RUNNING; once set, any subsequent
+ * return-to-user (eg sys_umcg_kick()) will perform the equivalent of
+ * sys_umcg_wait() on it. That is, it will wake next_tid/server_tid, transfer
+ * to RUNNABLE and enqueue on the server's runnable list.
+ */
+#define UMCG_TF_PREEMPT			0x0100U
+/*
+ * UMCG_TF_COND_WAIT: indicate the task *will* call sys_umcg_wait()
+ *
+ * Enables server loops like (vs umcg_sys_exit()):
+ *
+ *   for(;;) {
+ *	self->status = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT;
+ *	// smp_mb() implied by xchg()
+ *
+ *	runnable_ptr = xchg(self->runnable_workers_ptr, NULL);
+ *	while (runnable_ptr) {
+ *		next = runnable_ptr->runnable_workers_ptr;
+ *
+ *		umcg_server_add_runnable(self, runnable_ptr);
+ *
+ *		runnable_ptr = next;
+ *	}
+ *
+ *	self->next = umcg_server_pick_next(self);
+ *	sys_umcg_wait(0, 0);
+ *   }
+ *
+ * without a signal or interrupt in between setting umcg_task::state and
+ * sys_umcg_wait() resulting in an infinite wait in umcg_notify_resume().
+ */
+#define UMCG_TF_COND_WAIT		0x0200U
+
+#define UMCG_TF_MASK			0xff00U
+
+#define UMCG_TASK_ALIGN			64
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+	/**
+	 * @state_ts: the current state of the UMCG task described by
+	 *            this struct, with a unique timestamp indicating
+	 *            when the last state change happened.
+	 *
+	 * Readable/writable by both the kernel and the userspace.
+	 *
+	 * UMCG task state:
+	 *   bits  0 -  7: task state;
+	 *   bits  8 - 15: state flags;
+	 *   bits 16 - 31: for userspace use;
+	 */
+	__u32	state;				/* r/w */
+
+	/**
+	 * @next_tid: the TID of the UMCG task that should be context-switched
+	 *            into in sys_umcg_wait(). Can be zero, in which case
+	 *            it'll switch to server_tid.
+	 *
+	 * @server_tid: the TID of the UMCG server that hosts this task,
+	 *		when RUNNABLE this task will get added to it's
+	 *		runnable_workers_ptr list.
+	 *
+	 * Read-only for the kernel, read/write for the userspace.
+	 */
+	__u32	next_tid;			/* r   */
+	__u32	server_tid;			/* r   */
+
+	__u32	__hole[1];
+
+	/*
+	 * Timestamps for when last we became BLOCKED, RUNNABLE, in CLOCK_MONOTONIC.
+	 */
+	__u64	blocked_ts;			/*   w */
+	__u64   runnable_ts;			/*   w */
+
+	/**
+	 * @runnable_workers_ptr: a single-linked list of runnable workers.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, userspace removes them.
+	 */
+	__u64	runnable_workers_ptr;		/* r/w */
+
+	__u64	__zero[3];
+
+} __attribute__((packed, aligned(UMCG_TASK_ALIGN)));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+	UMCG_CTL_REGISTER	= 0x00001,
+	UMCG_CTL_UNREGISTER	= 0x00002,
+	UMCG_CTL_WORKER		= 0x10000,
+};
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index 41a728debdbd..15d1e330fdb9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1686,6 +1686,21 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config HAVE_UMCG
+	bool
+
+config UMCG
+	bool "Enable User Managed Concurrency Groups API"
+	depends on 64BIT
+	depends on GENERIC_ENTRY
+	depends on HAVE_UMCG
+	default n
+	help
+	  Enable User Managed Concurrency Groups API, which form the basis
+	  for an in-process M:N userspace scheduling framework.
+	  At the moment this is an experimental/RFC feature that is not
+	  guaranteed to be backward-compatible.
+
 config KALLSYMS
 	bool "Load all symbols for debugging/ksymoops" if EXPERT
 	default y
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bad713684c2e..7d7bd5c300b1 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/sched.h>
 
 #include "common.h"
 
@@ -76,6 +77,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
 	if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, syscall);
 
+	if (work & SYSCALL_WORK_SYSCALL_UMCG)
+		umcg_sys_enter(regs, syscall);
+
 	syscall_enter_audit(regs, syscall);
 
 	return ret ? : syscall;
@@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
-	while (ti_work & EXIT_TO_USER_MODE_WORK) {
-
+	do {
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & _TIF_NEED_RESCHED)
@@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & _TIF_PATCH_PENDING)
 			klp_update_patch_state(current);
 
+		/* must be before handle_signal_work(); terminates on sigpending */
+		if (ti_work & _TIF_UMCG)
+			umcg_notify_resume(regs);
+
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 			handle_signal_work(regs, ti_work);
 
@@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		tick_nohz_user_enter_prepare();
 
 		ti_work = read_thread_flags();
-	}
+	} while (ti_work & EXIT_TO_USER_MODE_WORK);
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
 	return ti_work;
@@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs)
 	/* Flush pending rcuog wakeup before the last need_resched() check */
 	tick_nohz_user_enter_prepare();
 
-	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
+	if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
@@ -253,6 +260,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
 	step = report_single_step(work);
 	if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
 		arch_syscall_exit_tracehook(regs, step);
+
+	if (work & SYSCALL_WORK_SYSCALL_UMCG)
+		umcg_sys_exit(regs);
 }
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..4bdd51c75aee 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -749,6 +749,10 @@ void __noreturn do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");
 
+	/* Turn off UMCG sched hooks. */
+	if (unlikely(tsk->flags & PF_UMCG_WORKER))
+		tsk->flags &= ~PF_UMCG_WORKER;
+
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
@@ -786,6 +790,7 @@ void __noreturn do_exit(long code)
 
 	io_uring_files_cancel();
 	exit_signals(tsk);  /* sets PF_EXITING */
+	umcg_handle_exit();
 
 	/* sync mm's RSS info before statistics gathering */
 	if (tsk->mm)
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index c7421f2d05e1..c03eea9bc738 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04525933de94..6e10080dc25a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4272,6 +4272,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 #endif
+	umcg_clear_child(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6330,9 +6331,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * If a worker goes to sleep, notify and ask workqueue whether it
 	 * wants to wake up a task to maintain concurrency.
 	 */
-	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (task_flags & PF_WQ_WORKER)
 			wq_worker_sleeping(tsk);
+		else if (task_flags & PF_UMCG_WORKER)
+			umcg_wq_worker_sleeping(tsk);
 		else
 			io_wq_worker_sleeping(tsk);
 	}
@@ -6350,9 +6353,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
 
 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
+		else if (tsk->flags & PF_UMCG_WORKER)
+			umcg_wq_worker_running(tsk);
 		else
 			io_wq_worker_running(tsk);
 	}
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..9a8755045285
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,868 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include <asm/syscall.h>
+
+#include "sched.h"
+
+static struct task_struct *umcg_get_task(u32 tid)
+{
+	struct task_struct *tsk = NULL;
+
+	if (tid) {
+		rcu_read_lock();
+		tsk = find_task_by_vpid(tid);
+		if (tsk && current->mm == tsk->mm && tsk->umcg_task)
+			get_task_struct(tsk);
+		else
+			tsk = NULL;
+		rcu_read_unlock();
+	}
+
+	return tsk;
+}
+
+/**
+ * umcg_pin_pages: pin pages containing struct umcg_task of
+ *		   this task, its server (possibly this task again)
+ *		   and the next (possibly NULL).
+ */
+static int umcg_pin_pages(void)
+{
+	struct task_struct *server = NULL, *next = NULL, *tsk = current;
+	struct umcg_task __user *self = READ_ONCE(tsk->umcg_task);
+	int server_tid, next_tid;
+	int ret;
+
+	/* must not have stale state */
+	if (WARN_ON_ONCE(tsk->umcg_worker_page ||
+			 tsk->umcg_server_page ||
+			 tsk->umcg_next_page   ||
+			 tsk->umcg_server_task ||
+			 tsk->umcg_next_task   ||
+			 tsk->umcg_server      ||
+			 tsk->umcg_next))
+		return -EBUSY;
+
+	ret = -EFAULT;
+	if (pin_user_pages_fast((unsigned long)self, 1, 0,
+				&tsk->umcg_worker_page) != 1)
+		goto clear_self;
+
+	if (get_user(server_tid, &self->server_tid))
+		goto unpin_self;
+
+	ret = -ESRCH;
+	server = umcg_get_task(server_tid);
+	if (!server)
+		goto unpin_self;
+
+	ret = -EFAULT;
+	/* must cache due to possible concurrent change vs access_ok() */
+	tsk->umcg_server_task = READ_ONCE(server->umcg_task);
+	if (pin_user_pages_fast((unsigned long)tsk->umcg_server_task, 1, 0,
+				&tsk->umcg_server_page) != 1)
+		goto clear_server;
+
+	tsk->umcg_server = server;
+
+	if (get_user(next_tid, &self->next_tid))
+		goto unpin_server;
+
+	if (!next_tid)
+		goto done;
+
+	ret = -ESRCH;
+	next = umcg_get_task(next_tid);
+	if (!next)
+		goto unpin_server;
+
+	ret = -EFAULT;
+	tsk->umcg_next_task = READ_ONCE(next->umcg_task);
+	if (pin_user_pages_fast((unsigned long)tsk->umcg_next_task, 1, 0,
+				&tsk->umcg_next_page) != 1)
+		goto clear_next;
+
+	tsk->umcg_next = next;
+
+done:
+	return 0;
+
+clear_next:
+	tsk->umcg_next_task = NULL;
+	tsk->umcg_next_page = NULL;
+
+unpin_server:
+	unpin_user_page(tsk->umcg_server_page);
+
+clear_server:
+	tsk->umcg_server_task = NULL;
+	tsk->umcg_server_page = NULL;
+
+unpin_self:
+	unpin_user_page(tsk->umcg_worker_page);
+clear_self:
+	tsk->umcg_worker_page = NULL;
+
+	return ret;
+}
+
+static void umcg_unpin_pages(void)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->umcg_server) {
+		unpin_user_page(tsk->umcg_worker_page);
+		tsk->umcg_worker_page = NULL;
+
+		unpin_user_page(tsk->umcg_server_page);
+		tsk->umcg_server_page = NULL;
+		tsk->umcg_server_task = NULL;
+
+		put_task_struct(tsk->umcg_server);
+		tsk->umcg_server = NULL;
+
+		if (tsk->umcg_next) {
+			unpin_user_page(tsk->umcg_next_page);
+			tsk->umcg_next_page = NULL;
+			tsk->umcg_next_task = NULL;
+
+			put_task_struct(tsk->umcg_next);
+			tsk->umcg_next = NULL;
+		}
+	}
+}
+
+static void umcg_clear_task(struct task_struct *tsk)
+{
+	/*
+	 * This is either called for the current task, or for a newly forked
+	 * task that is not yet running, so we don't need strict atomicity
+	 * below.
+	 */
+	if (tsk->umcg_task) {
+		WRITE_ONCE(tsk->umcg_task, NULL);
+		tsk->umcg_worker_page = NULL;
+
+		tsk->umcg_server = NULL;
+		tsk->umcg_server_page = NULL;
+		tsk->umcg_server_task = NULL;
+
+		tsk->umcg_next = NULL;
+		tsk->umcg_next_page = NULL;
+		tsk->umcg_next_task = NULL;
+
+		tsk->flags &= ~PF_UMCG_WORKER;
+		clear_task_syscall_work(tsk, SYSCALL_UMCG);
+		clear_tsk_thread_flag(tsk, TIF_UMCG);
+	}
+}
+
+/* Called for a forked or execve-ed child. */
+void umcg_clear_child(struct task_struct *tsk)
+{
+	umcg_clear_task(tsk);
+}
+
+/* Called both by normally (unregister) and abnormally exiting workers. */
+void umcg_worker_exit(void)
+{
+	umcg_unpin_pages();
+	umcg_clear_task(current);
+}
+
+/*
+ * Do a state transition: @from -> @to.
+ *
+ * Will clear UMCG_TF_PREEMPT, UMCG_TF_COND_WAIT.
+ *
+ * When @to == {BLOCKED,RUNNABLE}, update timestamps.
+ *
+ * Returns:
+ *   0: success
+ *   -EAGAIN: when self->state != @from
+ *   -EFAULT
+ */
+static int umcg_update_state(struct task_struct *tsk,
+			     struct umcg_task __user *self,
+			     u32 from, u32 to)
+{
+	u32 old, new;
+	u64 now;
+
+	if (to >= UMCG_TASK_RUNNABLE) {
+		switch (tsk->umcg_clock) {
+		case CLOCK_REALTIME:      now = ktime_get_real_ns();     break;
+		case CLOCK_MONOTONIC:     now = ktime_get_ns();          break;
+		case CLOCK_BOOTTIME:      now = ktime_get_boottime_ns(); break;
+		case CLOCK_TAI:           now = ktime_get_clocktai_ns(); break;
+		}
+	}
+
+	if (!user_access_begin(self, sizeof(*self)))
+		return -EFAULT;
+
+	unsafe_get_user(old, &self->state, Efault);
+	do {
+		if ((old & UMCG_TASK_MASK) != from)
+			goto fail;
+
+		new = old & ~(UMCG_TASK_MASK |
+			      UMCG_TF_PREEMPT | UMCG_TF_COND_WAIT);
+		new |= to & UMCG_TASK_MASK;
+
+	} while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault));
+
+	if (to == UMCG_TASK_BLOCKED)
+		unsafe_put_user(now, &self->blocked_ts, Efault);
+	if (to == UMCG_TASK_RUNNABLE)
+		unsafe_put_user(now, &self->runnable_ts, Efault);
+
+	user_access_end();
+	return 0;
+
+fail:
+	user_access_end();
+	return -EAGAIN;
+
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+#define __UMCG_DIE(stmt, reason)	do {				\
+	stmt;								\
+	pr_warn_ratelimited("%s: killing task %s/%d because: " reason "\n",\
+			    __func__, current->comm, current->pid);	\
+	force_sig(SIGKILL);						\
+	return;								\
+} while (0)
+
+#define UMCG_DIE(reason)	__UMCG_DIE(,reason)
+#define UMCG_DIE_PF(reason)	__UMCG_DIE(pagefault_enable(), reason)
+#define UMCG_DIE_UNPIN(reason)	__UMCG_DIE(umcg_unpin_pages(), reason)
+
+/* Called from syscall enter path */
+void umcg_sys_enter(struct pt_regs *regs, long syscall)
+{
+	/* avoid recursion vs our own syscalls */
+	if (syscall == __NR_umcg_wait ||
+	    syscall == __NR_umcg_ctl)
+		return;
+
+	/* avoid recursion vs schedule() */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	/*
+	 * Pin all the state on sys_enter() such that we can rely on it
+	 * from dodgy contexts. It is either unpinned from pre-schedule()
+	 * or sys_exit(), whichever comes first, thereby ensuring the pin
+	 * is temporary.
+	 */
+	if (umcg_pin_pages())
+		UMCG_DIE("pin");
+
+	current->flags |= PF_UMCG_WORKER;
+}
+
+static int umcg_wake_task(struct task_struct *tsk, struct umcg_task __user *self)
+{
+	int ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING);
+	if (ret)
+		return ret;
+
+	try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
+	return 0;
+}
+
+static int umcg_wake_next(struct task_struct *tsk)
+{
+	int ret = umcg_wake_task(tsk->umcg_next, tsk->umcg_next_task);
+	if (ret)
+		return ret;
+
+	/*
+	 * If userspace sets umcg_task::next_tid, it needs to remove
+	 * that task from the ready-queue to avoid another server
+	 * selecting it. However, that also means it needs to put it
+	 * back in case it went unused.
+	 *
+	 * By clearing the field on use, userspace can detect this case
+	 * and DTRT.
+	 */
+	if (put_user(0u, &tsk->umcg_task->next_tid))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int umcg_wake_server(struct task_struct *tsk)
+{
+	int ret = umcg_wake_task(tsk->umcg_server, tsk->umcg_server_task);
+	switch (ret) {
+	case 0:
+	case -EAGAIN:
+		/*
+		 * Server could have timed-out or already be running
+		 * due to a runnable enqueue. See umcg_sys_exit().
+		 */
+		break;
+
+	default:
+		return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Wake ::next_tid or ::server_tid.
+ *
+ * Must be called in umcg_pin_pages() context, relies on
+ * tsk->umcg_{server,next}.
+ *
+ * Returns:
+ *   0: success
+ *   -EAGAIN
+ *   -EFAULT
+ */
+static int umcg_wake(struct task_struct *tsk)
+{
+	if (tsk->umcg_next)
+		return umcg_wake_next(tsk);
+
+	return umcg_wake_server(tsk);
+}
+
+/* pre-schedule() */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+	struct umcg_task __user *self = READ_ONCE(tsk->umcg_task);
+
+	/* Must not fault, mmap_sem might be held. */
+	pagefault_disable();
+
+	if (WARN_ON_ONCE(!tsk->umcg_server))
+		UMCG_DIE_PF("no server");
+
+	if (umcg_update_state(tsk, self, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED))
+		UMCG_DIE_PF("state");
+
+	if (umcg_wake(tsk))
+		UMCG_DIE_PF("wake");
+
+	pagefault_enable();
+
+	/*
+	 * We're going to sleep, make sure to unpin the pages, this ensures
+	 * the pins are temporary. Also see umcg_sys_exit().
+	 */
+	umcg_unpin_pages();
+}
+
+/* post-schedule() */
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+	/* nothing here, see umcg_sys_exit() */
+}
+
+/*
+ * Enqueue @tsk on it's server's runnable list
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * cmpxchg based single linked list add such that list integrity is never
+ * violated.  Userspace *MUST* remove it from the list before changing ->state.
+ * As such, we must change state to RUNNABLE before enqueue.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_enqueue_runnable(struct task_struct *tsk)
+{
+	struct umcg_task __user *server = tsk->umcg_server_task;
+	struct umcg_task __user *self = tsk->umcg_task;
+	u64 self_ptr = (unsigned long)self;
+	u64 first_ptr;
+
+	/*
+	 * umcg_pin_pages() did access_ok() on both pointers, use self here
+	 * only because __user_access_begin() isn't available in generic code.
+	 */
+	if (!user_access_begin(self, sizeof(*self)))
+		return -EFAULT;
+
+	unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault);
+	do {
+		unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault);
+	} while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
+
+	user_access_end();
+	return 0;
+
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+/*
+ * umcg_wait: Wait for ->state to become RUNNING
+ *
+ * Returns:
+ * 0		- success
+ * -EINTR	- pending signal
+ * -EINVAL	- ::state is not {RUNNABLE,RUNNING}
+ * -ETIMEDOUT
+ * -EFAULT
+ */
+int umcg_wait(u64 timo)
+{
+	struct task_struct *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	struct page *page = NULL;
+	u32 state;
+	int ret;
+
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+
+		/*
+		 * Faults can block and scribble our wait state.
+		 */
+		pagefault_disable();
+		if (get_user(state, &self->state)) {
+			pagefault_enable();
+
+			ret = -EFAULT;
+			if (page) {
+				unpin_user_page(page);
+				page = NULL;
+				break;
+			}
+
+			if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
+				page = NULL;
+				break;
+			}
+
+			continue;
+		}
+
+		if (page) {
+			unpin_user_page(page);
+			page = NULL;
+		}
+		pagefault_enable();
+
+		state &= UMCG_TASK_MASK;
+		if (state != UMCG_TASK_RUNNABLE) {
+			ret = 0;
+			if (state == UMCG_TASK_RUNNING)
+				break;
+
+			ret = -EINVAL;
+			break;
+		}
+
+		if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
+						    tsk->timer_slack_ns,
+						    HRTIMER_MODE_ABS,
+						    tsk->umcg_clock)) {
+			ret = -ETIMEDOUT;
+			break;
+		}
+	}
+	__set_current_state(TASK_RUNNING);
+
+	return ret;
+}
+
+void umcg_sys_exit(struct pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+	struct umcg_task __user *self = READ_ONCE(tsk->umcg_task);
+	long syscall = syscall_get_nr(tsk, regs);
+
+	if (syscall == __NR_umcg_wait)
+		return;
+
+	/*
+	 * sys_umcg_ctl() will get here without having called umcg_sys_enter()
+	 * as such it will look like a syscall that blocked.
+	 */
+
+	if (tsk->umcg_server) {
+		/*
+		 * Didn't block, we done.
+		 */
+		umcg_unpin_pages();
+		return;
+	}
+
+	/* avoid recursion vs schedule() */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	if (umcg_pin_pages())
+		UMCG_DIE("pin");
+
+	if (umcg_update_state(tsk, self, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE))
+		UMCG_DIE_UNPIN("state");
+
+	if (umcg_enqueue_runnable(tsk))
+		UMCG_DIE_UNPIN("enqueue");
+
+	/* Server might not be RUNNABLE, means it's already running */
+	if (umcg_wake_server(tsk))
+		UMCG_DIE_UNPIN("wake-server");
+
+	umcg_unpin_pages();
+
+	switch (umcg_wait(0)) {
+	case -EFAULT:
+	case -EINVAL:
+	case -ETIMEDOUT: /* how!?! */
+	default:
+		UMCG_DIE("wait");
+
+	case 0:
+	case -EINTR:
+		/* notify_resume will continue the wait after the signal */
+		break;
+	}
+
+	current->flags |= PF_UMCG_WORKER;
+}
+
+void umcg_notify_resume(struct pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	bool worker = tsk->flags & PF_UMCG_WORKER;
+	u32 state;
+
+	/* avoid recursion vs schedule() */
+	if (worker)
+		current->flags &= ~PF_UMCG_WORKER;
+
+	if (get_user(state, &self->state))
+		UMCG_DIE("get-state");
+
+	state &= UMCG_TASK_MASK | UMCG_TF_MASK;
+	if (state == UMCG_TASK_RUNNING)
+		goto done;
+
+	/*
+	 * See comment at UMCG_TF_COND_WAIT; TL;DR: user *will* call
+	 * sys_umcg_wait() and signals/interrupts shouldn't block
+	 * return-to-user.
+	 */
+	if (state == (UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT))
+		goto done;
+
+	if (state & UMCG_TF_PREEMPT) {
+		if (umcg_pin_pages())
+			UMCG_DIE("pin");
+
+		if (umcg_update_state(tsk, self,
+				      UMCG_TASK_RUNNING,
+				      UMCG_TASK_RUNNABLE))
+			UMCG_DIE_UNPIN("state");
+
+		if (umcg_enqueue_runnable(tsk))
+			UMCG_DIE_UNPIN("enqueue");
+
+		/*
+		 * XXX do we want a preemption consuming ::next_tid ?
+		 * I'm currently leaning towards no.
+		 */
+		if (umcg_wake_server(tsk))
+			UMCG_DIE_UNPIN("wake-server");
+
+		umcg_unpin_pages();
+	}
+
+	switch (umcg_wait(0)) {
+	case -EFAULT:
+	case -EINVAL:
+	case -ETIMEDOUT: /* how!?! */
+	default:
+		UMCG_DIE("wait");
+
+	case 0:
+	case -EINTR:
+		/* we'll will resume the wait after the signal */
+		break;
+	}
+
+done:
+	if (worker)
+		current->flags |= PF_UMCG_WORKER;
+}
+
+/**
+ * sys_umcg_kick: makes a UMCG task cycle through umcg_notify_resume()
+ *
+ * Returns:
+ * 0		- Ok;
+ * -ESRCH	- not a related UMCG task
+ * -EINVAL	- another error happened (unknown flags, etc..)
+ */
+SYSCALL_DEFINE2(umcg_kick, u32, flags, pid_t, tid)
+{
+	struct task_struct *task = umcg_get_task(tid);
+	if (!task)
+		return -ESRCH;
+
+	if (flags)
+		return -EINVAL;
+
+#ifdef CONFIG_SMP
+	smp_send_reschedule(task_cpu(task));
+#endif
+
+	return 0;
+}
+
+/**
+ * sys_umcg_wait: transfer running context
+ *
+ * Block until RUNNING. Userspace must already set RUNNABLE to deal with the
+ * sleep condition races (see TF_COND_WAIT).
+ *
+ * Will wake either ::next_tid or ::server_tid to take our place. If this is a
+ * server then not setting ::next_tid will wake self.
+ *
+ * Returns:
+ * 0		- OK;
+ * -ETIMEDOUT	- the timeout expired;
+ * -ERANGE	- the timeout is out of range (worker);
+ * -EAGAIN	- ::state wasn't RUNNABLE, concurrent wakeup;
+ * -EFAULT	- failed accessing struct umcg_task __user of the current
+ *		  task, the server or next;
+ * -ESRCH	- the task to wake not found or not a UMCG task;
+ * -EINVAL	- another error happened (e.g. the current task is not a
+ *		  UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo)
+{
+	struct task_struct *tsk = current;
+	struct umcg_task __user *self = READ_ONCE(tsk->umcg_task);
+	bool worker = tsk->flags & PF_UMCG_WORKER;
+	int ret;
+
+	if (!self || flags)
+		return -EINVAL;
+
+	if (worker) {
+		tsk->flags &= ~PF_UMCG_WORKER;
+		if (timo)
+			return -ERANGE;
+	}
+
+	/* see umcg_sys_{enter,exit}() syscall exceptions */
+	ret = umcg_pin_pages();
+	if (ret)
+		goto unblock;
+
+	/*
+	 * Clear UMCG_TF_COND_WAIT *and* check state == RUNNABLE.
+	 */
+	ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNABLE);
+	if (ret)
+		goto unpin;
+
+	if (worker) {
+		ret = umcg_enqueue_runnable(tsk);
+		if (ret)
+			goto unpin;
+	}
+
+	if (worker)
+		ret = umcg_wake(tsk);
+	else if (tsk->umcg_next)
+		ret = umcg_wake_next(tsk);
+
+	if (ret) {
+		/*
+		 * XXX already enqueued ourself on ::server_tid; failing now
+		 * leaves the lot in an inconsistent state since it'll also
+		 * unblock self in order to return the error. !?!?
+		 */
+		goto unpin;
+	}
+
+	umcg_unpin_pages();
+
+	ret = umcg_wait(timo);
+	switch (ret) {
+	case 0:		/* all done */
+	case -EINTR:	/* umcg_notify_resume() will continue the wait */
+		ret = 0;
+		break;
+
+	default:
+		goto unblock;
+	}
+out:
+	if (worker)
+		tsk->flags |= PF_UMCG_WORKER;
+	return ret;
+
+unpin:
+	umcg_unpin_pages();
+unblock:
+	/*
+	 * Workers will still block in umcg_notify_resume() before they can
+	 * consume their error, servers however need to get the error asap.
+	 *
+	 * Still, things might be unrecoverably screwy after this. Not our
+	 * problem.
+	 */
+	if (!worker)
+		umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING);
+	goto out;
+}
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags:       ORed values from enum umcg_ctl_flag; see below;
+ * @self:        a pointer to struct umcg_task that describes this
+ *               task and governs the behavior of sys_umcg_wait if
+ *               registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ *
+ *         UMCG workers:
+ *              - @flags & UMCG_CTL_WORKER
+ *              - self->state must be UMCG_TASK_BLOCKED
+ *
+ *         UMCG servers:
+ *              - !(@flags & UMCG_CTL_WORKER)
+ *              - self->state must be UMCG_TASK_RUNNING
+ *
+ *         All tasks:
+ *              - self->server_tid must be a valid server
+ *              - self->next_tid must be zero
+ *
+ *         If the conditions above are met, sys_umcg_ctl() immediately returns
+ *         if the registered task is a server. If the registered task is a
+ *         worker it will be added to it's server's runnable_workers_ptr list
+ *         and the server will be woken.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ *           is a UMCG worker, the userspace is responsible for waking its
+ *           server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0		- success
+ * -EFAULT	- failed to read @self
+ * -EINVAL	- some other error occurred
+ * -ESRCH	- no such server_tid
+ */
+SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock)
+{
+	struct task_struct *server;
+	struct umcg_task ut;
+
+	if ((unsigned long)self % UMCG_TASK_ALIGN)
+		return -EINVAL;
+
+	if (flags & ~(UMCG_CTL_REGISTER |
+		      UMCG_CTL_UNREGISTER |
+		      UMCG_CTL_WORKER))
+		return -EINVAL;
+
+	if (flags == UMCG_CTL_UNREGISTER) {
+		if (self || !current->umcg_task)
+			return -EINVAL;
+
+		if (current->flags & PF_UMCG_WORKER)
+			umcg_worker_exit();
+		else
+			umcg_clear_task(current);
+
+		return 0;
+	}
+
+	if (!(flags & UMCG_CTL_REGISTER))
+		return -EINVAL;
+
+	flags &= ~UMCG_CTL_REGISTER;
+
+	switch (which_clock) {
+	case CLOCK_REALTIME:
+	case CLOCK_MONOTONIC:
+	case CLOCK_BOOTTIME:
+	case CLOCK_TAI:
+		current->umcg_clock = which_clock;
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	if (current->umcg_task || !self)
+		return -EINVAL;
+
+	if (copy_from_user(&ut, self, sizeof(ut)))
+		return -EFAULT;
+
+	if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2])
+		return -EINVAL;
+
+	rcu_read_lock();
+	server = find_task_by_vpid(ut.server_tid);
+	if (server && server->mm == current->mm) {
+		if (flags == UMCG_CTL_WORKER) {
+			if (!server->umcg_task ||
+			    (server->flags & PF_UMCG_WORKER))
+				server = NULL;
+		} else {
+			if (server != current)
+				server = NULL;
+		}
+	} else {
+		server = NULL;
+	}
+	rcu_read_unlock();
+
+	if (!server)
+		return -ESRCH;
+
+	if (flags == UMCG_CTL_WORKER) {
+		if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+		current->flags |= PF_UMCG_WORKER;	/* hook schedule() */
+		set_syscall_work(SYSCALL_UMCG);		/* hook syscall */
+		set_thread_flag(TIF_UMCG);		/* hook return-to-user */
+
+		/* umcg_sys_exit() will transition to RUNNABLE and wait */
+
+	} else {
+		if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+		set_thread_flag(TIF_UMCG);		/* hook return-to-user */
+
+		/* umcg_notify_resume() would block if not RUNNING */
+	}
+
+	return 0;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d1944258cfc0..a4029e05129b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -273,6 +273,11 @@ COND_SYSCALL(landlock_create_ruleset);
 COND_SYSCALL(landlock_add_rule);
 COND_SYSCALL(landlock_restrict_self);
 
+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+COND_SYSCALL(umcg_kick);
+
 /* arch/example/kernel/sys_example.c */
 
 /* mm/fadvise.c */

From patchwork Thu Jan 13 23:39:39 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Peter Oskolkov <posk@google.com>
X-Patchwork-Id: 12713193
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 22ADDC433EF
	for <linux-mm@archiver.kernel.org>; Thu, 13 Jan 2022 23:40:14 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AD2A66B0078; Thu, 13 Jan 2022 18:40:13 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A350A6B007B; Thu, 13 Jan 2022 18:40:13 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 811256B007D; Thu, 13 Jan 2022 18:40:13 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com
 [216.40.44.134])
	by kanga.kvack.org (Postfix) with ESMTP id 65FBA6B0078
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 18:40:13 -0500 (EST)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id D6230181CC1DD
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:12 +0000 (UTC)
X-FDA: 79026884664.16.A9EFBE7
Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com
 [209.85.219.201])
	by imf03.hostedemail.com (Postfix) with ESMTP id 710DC20006
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:12 +0000 (UTC)
Received: by mail-yb1-f201.google.com with SMTP id
 h2-20020a5b0a82000000b0061192499188so14727154ybq.9
        for <linux-mm@kvack.org>; Thu, 13 Jan 2022 15:40:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=YdkUhGD3Iw8REOsXxzkZNQvRpBA0TpGdto6DpzxEJRg=;
        b=ORGfGaPEFsUSNA0jZJilt9MO5ZkL0WdXS1qQdDMZyRFpxcaJLOwoB+DYnMu2MD2jRp
         uIMLWn80XBvk5DhFNo1JzkEwq9sLDvSYxSnl6QyJgAfO81eWtMamzfTCBD6yxOMUjwNA
         U38HIm2ZMNLaUn99dFFMaiKQg+OhiNmcKlHs11QVVfv5oikYdoLeih2dGj834SxO79Pn
         32JTberZyPXDarVU3BpbRI8v6f1A4ypNLZQWCtEB5qk23bHaYdIFp62EoYNuRvAvP5vQ
         6Au5qLAGNUJzqhUxo3loWqdwVWExFXv/xRxN28MNY81dAVX0Lb+L6xhTtCIBOezgbRPz
         utIw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=YdkUhGD3Iw8REOsXxzkZNQvRpBA0TpGdto6DpzxEJRg=;
        b=NH4K/3Ob+Msby4NaLshOtasb59j87sc77o74wB5TgJJsW+rh8VdZnqFKvpopqH3LUX
         9wFqbuqE5gQm/LderTVHClR1v0IC/huwIJbwBN/N9ZKE0n+t3KOXx0XEhx9D9GZhK6+3
         LqHqoN8yRub1szPf5ZO7C34yCTD2akp7ilBRCjua3j3jScnnkdrUAXWvSRAIaoBAGFGD
         19AZY9excC0OtUYGP8D+R0aNNW9s6adj8b+2OVJSMXJF70qzwAR3gCdGVe2SQNK1IHiL
         zHUK2KutQR7n+kDU8iQY1Fw2q0d2oNZqPxfbVZQH5GIcSAUDBs5kMjCktdAcCoWu/fh8
         OgJA==
X-Gm-Message-State: AOAM533mYEGuESqoZ7tF4LsS2Z5h7hOyVneLo5Ruu1eMZllIK3vgP1tt
	hI9R8jVihIrHaodE2Dbsd2v5+Yao
X-Google-Smtp-Source: 
 ABdhPJwCLoMbHJSrjZMXL0ZEp89iLsIjFRdgS8B2JuN0BJhbUCJOe3cmeUEwht7rvQGqGaxkCnPhOXJz
X-Received: from posk.svl.corp.google.com
 ([2620:15c:2cd:202:c548:e79f:8954:121f])
 (user=posk job=sendgmr) by 2002:a05:6902:8f:: with SMTP id
 h15mr8334544ybs.95.1642117211781; Thu, 13 Jan 2022 15:40:11 -0800 (PST)
Date: Thu, 13 Jan 2022 15:39:39 -0800
In-Reply-To: <20220113233940.3608440-1-posk@google.com>
Message-Id: <20220113233940.3608440-5-posk@google.com>
Mime-Version: 1.0
References: <20220113233940.3608440-1-posk@google.com>
X-Mailer: git-send-email 2.34.1.703.g22d0c6ccf7-goog
Subject: [RFC PATCH v2 4/5] sched: UMCG: add a blocked worker list
From: Peter Oskolkov <posk@google.com>
To: Peter Zijlstra <peterz@infradead.org>, mingo@redhat.com,
 tglx@linutronix.de,
	juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org, x86@kernel.org, pjt@google.com, posk@google.com,
	avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, posk@posk.io
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 710DC20006
X-Stat-Signature: aund9xs4k1rfw4nafbxhh8b3w5m9kyjq
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=ORGfGaPE;
	spf=pass (imf03.hostedemail.com: domain of
 3W7jgYQQKCGQRQUMIQQING.EQONKPWZ-OOMXCEM.QTI@flex--posk.bounces.google.com
 designates 209.85.219.201 as permitted sender)
 smtp.mailfrom=3W7jgYQQKCGQRQUMIQQING.EQONKPWZ-OOMXCEM.QTI@flex--posk.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1642117212-86277
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The original idea of a UMCG server was that it was used as a proxy
for a CPU, so if a worker associated with the server is RUNNING,
the server itself is never ever was allowed to be RUNNING as well;
when umcg_wait() returned for a server, it meant that its worker
became BLOCKED.

In the new (old?) "per server runqueues" model implemented in
the previous patch in this patchset, servers are woken when
a previously blocked worker on their runqueue finishes its blocking
operation, even if the currently RUNNING worker continues running.

As now a server may run while a worker assigned to it is running,
the original idea of having at most a single worker RUNNING per
server, as a means to control the number of running workers, is
not really enforced, and the server, woken by a worker
doing BLOCKED=>RUNNABLE transition, may then call sys_umcg_wait()
with a second/third/etc. worker to run.

Support this scenario by adding a blocked worker list:
when a worker transitions RUNNING=>BLOCKED, not only its server
is woken, but the worker is also added to the blocked worker list
of its server.

This change introduces the following benefits:
- block detection how behaves similarly to wake detection;
  without this patch worker wakeups added wakees to the list
  and woke the server, while worker blocks only woke the server
  without adding blocked workers to a list, forcing servers
  to explicitly check worker's state;
- if the blocked worker woke sufficiently quickly, the server
  woken on the block event would observe its worker now as
  RUNNABLE, so the block event had to be inferred rather than
  explicitly signalled by the worker being added to the blocked
  worker list;
- it is now possible for a single server to control several
  RUNNING workers, which makes writing userspace schedulers
  simpler for smaller processes that do not need to scale beyond
  one "server";
- if the userspace wants to keep at most a single RUNNING worker
  per server, and have multiple servers with their own runqueues,
  this model is also naturally supported here.

So this change basically decouples block/wake detection from
M:N threading in the sense that the number of servers is now
does not have to be M or N, but is more driven by the scalability
needs of the userspace application.

Why keep this server/worker model at all then, and not use
something like io_uring to deliver block/wake events to the
userspace? The main benefit of this model is that servers
are woken synchronously on-cpu when an event happens, while
io_uring is more of an asynchronous event framework, so latencies
in this model are potentially better.

In addition, "multiple runqueues" type of scheduling is much easier
to implement with this method than with io_uring.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/umcg.h | 10 ++++-
 kernel/sched/umcg.c       | 90 ++++++++++++++++++++++++++++-----------
 2 files changed, 75 insertions(+), 25 deletions(-)

diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
index a994bbb062d5..93fccb44283b 100644
--- a/include/uapi/linux/umcg.h
+++ b/include/uapi/linux/umcg.h
@@ -116,6 +116,14 @@ struct umcg_task {
 	__u64	blocked_ts;			/*   w */
 	__u64   runnable_ts;			/*   w */
 
+	/**
+	 * @blocked_workers_ptr: a single-linked list of blocked workers.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, userspace removes them.
+	 */
+	__u64	blocked_workers_ptr;		/* r/w */
+
 	/**
 	 * @runnable_workers_ptr: a single-linked list of runnable workers.
 	 *
@@ -124,7 +132,7 @@ struct umcg_task {
 	 */
 	__u64	runnable_workers_ptr;		/* r/w */
 
-	__u64	__zero[3];
+	__u64	__zero[2];
 
 } __attribute__((packed, aligned(UMCG_TASK_ALIGN)));
 
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
index 9a8755045285..b85dec6b82e4 100644
--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -343,6 +343,67 @@ static int umcg_wake(struct task_struct *tsk)
 	return umcg_wake_server(tsk);
 }
 
+/*
+ * Enqueue @tsk on it's server's blocked or runnable list
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * cmpxchg based single linked list add such that list integrity is never
+ * violated.  Userspace *MUST* remove it from the list before changing ->state.
+ * As such, we must change state to BLOCKED or RUNNABLE before enqueue.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_enqueue_worker(struct task_struct *tsk, bool blocked)
+{
+	struct umcg_task __user *server = tsk->umcg_server_task;
+	struct umcg_task __user *self = tsk->umcg_task;
+	u64 self_ptr = (unsigned long)self;
+	u64 first_ptr;
+
+	/*
+	 * umcg_pin_pages() did access_ok() on both pointers, use self here
+	 * only because __user_access_begin() isn't available in generic code.
+	 */
+	if (!user_access_begin(self, sizeof(*self)))
+		return -EFAULT;
+
+	unsafe_get_user(first_ptr, blocked ? &server->blocked_workers_ptr :
+			&server->runnable_workers_ptr, Efault);
+	do {
+		unsafe_put_user(first_ptr, blocked ? &self->blocked_workers_ptr :
+				&self->runnable_workers_ptr, Efault);
+	} while (!unsafe_try_cmpxchg_user(blocked ? &server->blocked_workers_ptr :
+				&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
+
+	user_access_end();
+	return 0;
+
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+/*
+ * Enqueue @tsk on it's server's blocked list
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * cmpxchg based single linked list add such that list integrity is never
+ * violated.  Userspace *MUST* remove it from the list before changing ->state.
+ * As such, we must change state to BLOCKED before enqueue.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_enqueue_blocked(struct task_struct *tsk)
+{
+	return umcg_enqueue_worker(tsk, true /* blocked */);
+}
+
 /* pre-schedule() */
 void umcg_wq_worker_sleeping(struct task_struct *tsk)
 {
@@ -357,6 +418,9 @@ void umcg_wq_worker_sleeping(struct task_struct *tsk)
 	if (umcg_update_state(tsk, self, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED))
 		UMCG_DIE_PF("state");
 
+	if (umcg_enqueue_blocked(tsk))
+		UMCG_DIE_PF("enqueue");
+
 	if (umcg_wake(tsk))
 		UMCG_DIE_PF("wake");
 
@@ -390,29 +454,7 @@ void umcg_wq_worker_running(struct task_struct *tsk)
  */
 static int umcg_enqueue_runnable(struct task_struct *tsk)
 {
-	struct umcg_task __user *server = tsk->umcg_server_task;
-	struct umcg_task __user *self = tsk->umcg_task;
-	u64 self_ptr = (unsigned long)self;
-	u64 first_ptr;
-
-	/*
-	 * umcg_pin_pages() did access_ok() on both pointers, use self here
-	 * only because __user_access_begin() isn't available in generic code.
-	 */
-	if (!user_access_begin(self, sizeof(*self)))
-		return -EFAULT;
-
-	unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault);
-	do {
-		unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault);
-	} while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
-
-	user_access_end();
-	return 0;
-
-Efault:
-	user_access_end();
-	return -EFAULT;
+	return umcg_enqueue_worker(tsk, false /* !blocked */);
 }
 
 /*
@@ -821,7 +863,7 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t
 	if (copy_from_user(&ut, self, sizeof(ut)))
 		return -EFAULT;
 
-	if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2])
+	if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1])
 		return -EINVAL;
 
 	rcu_read_lock();

From patchwork Thu Jan 13 23:39:40 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Peter Oskolkov <posk@google.com>
X-Patchwork-Id: 12713194
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8FC17C43217
	for <linux-mm@archiver.kernel.org>; Thu, 13 Jan 2022 23:40:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1F9416B007D; Thu, 13 Jan 2022 18:40:16 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 184D96B007E; Thu, 13 Jan 2022 18:40:16 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 008596B0080; Thu, 13 Jan 2022 18:40:15 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0037.hostedemail.com
 [216.40.44.37])
	by kanga.kvack.org (Postfix) with ESMTP id D9F3C6B007D
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 18:40:15 -0500 (EST)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 8DBA48F6F7
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:15 +0000 (UTC)
X-FDA: 79026884790.03.0BC8CEC
Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com
 [209.85.219.202])
	by imf06.hostedemail.com (Postfix) with ESMTP id 1ABC0180004
	for <linux-mm@kvack.org>; Thu, 13 Jan 2022 23:40:14 +0000 (UTC)
Received: by mail-yb1-f202.google.com with SMTP id
 u185-20020a2560c2000000b0060fd98540f7so14880135ybb.0
        for <linux-mm@kvack.org>; Thu, 13 Jan 2022 15:40:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=M6CqLRKobTxmQJYwGP116eObFPKtwslTaqzW/rSJTK8=;
        b=qMGBjdhvTEK6hecBGOuRoK6F9ospIcTv+bSkzHtDzzW7elpIvv8C06+6MKkfz7rono
         PuJSCFA3CXEtek3Rf9jxxOyl/l7tTBEZ04+rY1/G7T+btp87CZdeFeT3eLx7mSuseTHi
         iUvCt0mbOOYItyWq9/z/yf7UhMvCB6omSP+9cILENtov+4CzMlPKk4ChS/CCDgOGcR7x
         rYu6FH+COfsfs2ihA+0r4LYfiiAGOXB0NDa+BWnm9ZmHakjHKtseaH4QESWTp8WFfWnH
         58XhmuiKFvO8OPBHefgWNMoqmUWy+6FE+BPuXyneucVFuZG5qCO1pP3RrHlwIjnG+ltC
         yX+g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=M6CqLRKobTxmQJYwGP116eObFPKtwslTaqzW/rSJTK8=;
        b=CfNnrIEXYSdQQj8VqI6Ow65XK/ej9ZDPO1d7Ce/Elfd6stt+Vtk+8vM/hlDPSX15gK
         c+XqPcizs1+/1vZZHPLD74cxrmr0P0QDIvso/eaiWQtoCCWpGXPb4+3xps+pcfXL6wuP
         NNR15Mp1rN7mr6NC2prlOvQmrojQ8Ypni4/FvbpVaqoBLLoXHPUh9dt214F87ETgw8W1
         FrFkpljy95BK1J4bdqnsyGgk7+wnm4yvinPPTDAZuuw2rT0D8ZKBHIRvTkLZ8qZ9FYDE
         h5tcrPeSJFUssmWwBd8VhSrCWhCHp/EraMGKBzsQOp0lN/8J4LOcneHFXYI8VRwZCZNQ
         vt4Q==
X-Gm-Message-State: AOAM530QfMpXlCt+rLoSBWo8OfOkVPCYSUJbIMvYJwFaNnbmTrXP/VLZ
	DxwS7WkKNWf0SjUYphEwrDMl7je/
X-Google-Smtp-Source: 
 ABdhPJyMyvqiBBmhkwL82olAcxjEEcUWe5ozz1YnPt/AtCMregTQoiHsk98ChHrmtDs/+uO8eeFVk5cL
X-Received: from posk.svl.corp.google.com
 ([2620:15c:2cd:202:c548:e79f:8954:121f])
 (user=posk job=sendgmr) by 2002:a25:ae8b:: with SMTP id
 b11mr9266950ybj.453.1642117214365;
 Thu, 13 Jan 2022 15:40:14 -0800 (PST)
Date: Thu, 13 Jan 2022 15:39:40 -0800
In-Reply-To: <20220113233940.3608440-1-posk@google.com>
Message-Id: <20220113233940.3608440-6-posk@google.com>
Mime-Version: 1.0
References: <20220113233940.3608440-1-posk@google.com>
X-Mailer: git-send-email 2.34.1.703.g22d0c6ccf7-goog
Subject: [RFC PATCH v2 5/5] sched: UMCG: allow to sys_umcg_kick UMCG servers
From: Peter Oskolkov <posk@google.com>
To: Peter Zijlstra <peterz@infradead.org>, mingo@redhat.com,
 tglx@linutronix.de,
	juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org, x86@kernel.org, pjt@google.com, posk@google.com,
	avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, posk@posk.io
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=qMGBjdhv;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of
 3XrjgYQQKCGcUTXPLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--posk.bounces.google.com
 designates 209.85.219.202 as permitted sender)
 smtp.mailfrom=3XrjgYQQKCGcUTXPLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--posk.bounces.google.com
X-Stat-Signature: dj4c3c7hwfwmc3ye8onmoq1nu9u9ct87
X-Rspamd-Queue-Id: 1ABC0180004
X-Rspamd-Server: rspam12
X-HE-Tag: 1642117214-48028
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Add enum umcg_kick_flags:
  @UMCG_KICK_RESCHED: reschedule the task; used for worker preemption
  @UMCG_KICK_TTWU: wake the task; used to wake servers.

It is sometimes useful to wake UMCG servers from the userspace, for
example when a server detects a worker wakeup and wakes an idle server to
run the newly woken worker.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/umcg.h | 10 ++++++++++
 kernel/sched/umcg.c       |  7 +++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
index 93fccb44283b..a29e5e91a251 100644
--- a/include/uapi/linux/umcg.h
+++ b/include/uapi/linux/umcg.h
@@ -148,4 +148,14 @@ enum umcg_ctl_flag {
 	UMCG_CTL_WORKER		= 0x10000,
 };
 
+/**
+ * enum umcg_kick_flag - flags to pass to sys_umcg_kick
+ * @UMCG_KICK_RESCHED: reschedule the task; used for worker preemption
+ * @UMCG_KICK_TTWU: wake the task; used to wake servers
+ */
+enum umcg_kick_flag {
+	UMCG_KICK_RESCHED	= 0x001,
+	UMCG_KICK_TTWU		= 0x002,
+};
+
 #endif /* _UAPI_LINUX_UMCG_H */
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
index b85dec6b82e4..e33ec9eddc3e 100644
--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -669,12 +669,15 @@ SYSCALL_DEFINE2(umcg_kick, u32, flags, pid_t, tid)
 	if (!task)
 		return -ESRCH;
 
-	if (flags)
+	if (flags != UMCG_KICK_RESCHED && flags != UMCG_KICK_TTWU)
 		return -EINVAL;
 
 #ifdef CONFIG_SMP
-	smp_send_reschedule(task_cpu(task));
+	if (flags == UMCG_KICK_RESCHED)
+		smp_send_reschedule(task_cpu(task));
 #endif
+	if (flags == UMCG_KICK_TTWU)
+		try_to_wake_up(task, TASK_NORMAL, 0);
 
 	return 0;
 }