From patchwork Wed Jan 29 17:20:10 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Stefan Hajnoczi <stefanha@redhat.com>
X-Patchwork-Id: 11356623
Return-Path: <SRS0=InjK=3S=vger.kernel.org=kvm-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E5C0B13A4
	for <patchwork-kvm@patchwork.kernel.org>;
 Wed, 29 Jan 2020 17:20:47 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id B9F552071E
	for <patchwork-kvm@patchwork.kernel.org>;
 Wed, 29 Jan 2020 17:20:47 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="gdkONAdz"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727141AbgA2RUq (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Wed, 29 Jan 2020 12:20:46 -0500
Received: from us-smtp-1.mimecast.com ([205.139.110.61]:30756 "EHLO
        us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S1726906AbgA2RUq (ORCPT
        <rfc822;kvm@vger.kernel.org>); Wed, 29 Jan 2020 12:20:46 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1580318442;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding;
        bh=CDEy69IL/eMYtpmDS3XjUwKwXrl7W0ck1hqzenez+2c=;
        b=gdkONAdzy+pd4yBkwcpuqpZLD1tKOLuPAfVsUnlWO7rcuC1kTdGJKLdue/bsmp/OY7xuYs
        5cFgLjxxDhHJ1+JIjQMxTbQuH9riVpH97zM1ktcYGcOhHaTyfjwadxP5SNI/i5/7XhPWd8
        CXIujZGQTeQw/kl/4HEv7DpiMnb0mNY=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-358-IuDla0bKOH2KmPPH79RJLw-1; Wed, 29 Jan 2020 12:20:20 -0500
X-MC-Unique: IuDla0bKOH2KmPPH79RJLw-1
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com
 [10.5.11.14])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 46763477;
        Wed, 29 Jan 2020 17:20:19 +0000 (UTC)
Received: from localhost (ovpn-117-180.ams2.redhat.com [10.36.117.180])
        by smtp.corp.redhat.com (Postfix) with ESMTP id D7B5F5DE53;
        Wed, 29 Jan 2020 17:20:11 +0000 (UTC)
From: Stefan Hajnoczi <stefanha@redhat.com>
To: linux-fsdevel@vger.kernel.org
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Avi Kivity <avi@scylladb.com>,
        Davide Libenzi <davidel@xmailserver.org>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Stefan Hajnoczi <stefanha@redhat.com>
Subject: [RFC] eventfd: add EFD_AUTORESET flag
Date: Wed, 29 Jan 2020 17:20:10 +0000
Message-Id: <20200129172010.162215-1-stefanha@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Some applications simply use eventfd for inter-thread notifications
without requiring counter or semaphore semantics.  They wait for the
eventfd to become readable using poll(2)/select(2) and then call read(2)
to reset the counter.

This patch adds the EFD_AUTORESET flag to reset the counter when
f_ops->poll() finds the eventfd is readable, eliminating the need to
call read(2) to reset the counter.

This results in a small but measurable 1% performance improvement with
QEMU virtio-blk emulation.  Each read(2) takes 1 microsecond execution
time in the event loop according to perf.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
Does this look like a reasonable thing to do?  I'm not very familiar
with f_ops->poll() or the eventfd internals, so maybe I'm overlooking a
design flaw.

I've tested this with QEMU and it works fine:
https://github.com/stefanha/qemu/commits/eventfd-autoreset
---
 fs/eventfd.c            | 99 +++++++++++++++++++++++++----------------
 include/linux/eventfd.h |  3 +-
 2 files changed, 62 insertions(+), 40 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8aa0ea8c55e8..208f6b9e2234 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -116,45 +116,62 @@ static __poll_t eventfd_poll(struct file *file, poll_table *wait)
 
 	poll_wait(file, &ctx->wqh, wait);
 
-	/*
-	 * All writes to ctx->count occur within ctx->wqh.lock.  This read
-	 * can be done outside ctx->wqh.lock because we know that poll_wait
-	 * takes that lock (through add_wait_queue) if our caller will sleep.
-	 *
-	 * The read _can_ therefore seep into add_wait_queue's critical
-	 * section, but cannot move above it!  add_wait_queue's spin_lock acts
-	 * as an acquire barrier and ensures that the read be ordered properly
-	 * against the writes.  The following CAN happen and is safe:
-	 *
-	 *     poll                               write
-	 *     -----------------                  ------------
-	 *     lock ctx->wqh.lock (in poll_wait)
-	 *     count = ctx->count
-	 *     __add_wait_queue
-	 *     unlock ctx->wqh.lock
-	 *                                        lock ctx->qwh.lock
-	 *                                        ctx->count += n
-	 *                                        if (waitqueue_active)
-	 *                                          wake_up_locked_poll
-	 *                                        unlock ctx->qwh.lock
-	 *     eventfd_poll returns 0
-	 *
-	 * but the following, which would miss a wakeup, cannot happen:
-	 *
-	 *     poll                               write
-	 *     -----------------                  ------------
-	 *     count = ctx->count (INVALID!)
-	 *                                        lock ctx->qwh.lock
-	 *                                        ctx->count += n
-	 *                                        **waitqueue_active is false**
-	 *                                        **no wake_up_locked_poll!**
-	 *                                        unlock ctx->qwh.lock
-	 *     lock ctx->wqh.lock (in poll_wait)
-	 *     __add_wait_queue
-	 *     unlock ctx->wqh.lock
-	 *     eventfd_poll returns 0
-	 */
-	count = READ_ONCE(ctx->count);
+	if (ctx->flags & EFD_AUTORESET) {
+		unsigned long flags;
+		__poll_t requested = poll_requested_events(wait);
+
+		spin_lock_irqsave(&ctx->wqh.lock, flags);
+		count = ctx->count;
+
+		/* Reset counter if caller is polling for read */
+		if (count != 0 && (requested & EPOLLIN)) {
+			ctx->count = 0;
+			events |= EPOLLOUT;
+			/* TODO is a EPOLLOUT wakeup necessary here? */
+		}
+
+		spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+	} else {
+		/*
+		 * All writes to ctx->count occur within ctx->wqh.lock.  This read
+		 * can be done outside ctx->wqh.lock because we know that poll_wait
+		 * takes that lock (through add_wait_queue) if our caller will sleep.
+		 *
+		 * The read _can_ therefore seep into add_wait_queue's critical
+		 * section, but cannot move above it!  add_wait_queue's spin_lock acts
+		 * as an acquire barrier and ensures that the read be ordered properly
+		 * against the writes.  The following CAN happen and is safe:
+		 *
+		 *     poll                               write
+		 *     -----------------                  ------------
+		 *     lock ctx->wqh.lock (in poll_wait)
+		 *     count = ctx->count
+		 *     __add_wait_queue
+		 *     unlock ctx->wqh.lock
+		 *                                        lock ctx->qwh.lock
+		 *                                        ctx->count += n
+		 *                                        if (waitqueue_active)
+		 *                                          wake_up_locked_poll
+		 *                                        unlock ctx->qwh.lock
+		 *     eventfd_poll returns 0
+		 *
+		 * but the following, which would miss a wakeup, cannot happen:
+		 *
+		 *     poll                               write
+		 *     -----------------                  ------------
+		 *     count = ctx->count (INVALID!)
+		 *                                        lock ctx->qwh.lock
+		 *                                        ctx->count += n
+		 *                                        **waitqueue_active is false**
+		 *                                        **no wake_up_locked_poll!**
+		 *                                        unlock ctx->qwh.lock
+		 *     lock ctx->wqh.lock (in poll_wait)
+		 *     __add_wait_queue
+		 *     unlock ctx->wqh.lock
+		 *     eventfd_poll returns 0
+		 */
+		count = READ_ONCE(ctx->count);
+	}
 
 	if (count > 0)
 		events |= EPOLLIN;
@@ -400,6 +417,10 @@ static int do_eventfd(unsigned int count, int flags)
 	if (flags & ~EFD_FLAGS_SET)
 		return -EINVAL;
 
+	/* Semaphore semantics don't make sense when autoreset is enabled */
+	if ((flags & EFD_SEMAPHORE) && (flags & EFD_AUTORESET))
+		return -EINVAL;
+
 	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
 	if (!ctx)
 		return -ENOMEM;
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index ffcc7724ca21..27577fafc553 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -21,11 +21,12 @@
  * shared O_* flags.
  */
 #define EFD_SEMAPHORE (1 << 0)
+#define EFD_AUTORESET (1 << 6) /* aliases O_CREAT */
 #define EFD_CLOEXEC O_CLOEXEC
 #define EFD_NONBLOCK O_NONBLOCK
 
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_AUTORESET)
 
 struct eventfd_ctx;
 struct file;