[3/3] eventfd: add internal reference counting to fix notifier race conditions

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
notifier->release() callback.  This lets notification clients know if
the eventfd is about to go away and is very useful particularly for
in-kernel clients.  However, as it stands today it is not possible to
use the notification API in a race-free way.  This patch adds some
additional logic to the notification subsystem to rectify this problem.

Background:
-----------------------
Eventfd currently only has one reference count mechanism: fget/fput.  This
in of itself is normally fine.  However, if a client expects to be
notified if the eventfd is closed, it cannot hold a fget() reference
itself or the underlying f_ops->release() callback will never be invoked
by VFS.  Therefore we have this somewhat unusual situation where we may
hold a pointer to an eventfd object (by virtue of having a waiter registered
in its wait-queue), but no reference.  This makes it nearly impossible to
design a mutual decoupling algorithm: you cannot unhook one side from the
other (or vice versa) without racing.

The first problem was dealt with by essentially "donating" a surrogate
object to eventfd.  In other words, when a client attached to eventfd
and then later detached, it would decouple internally in a race free way
and then leave part of the object still attached to the eventfd.  This
decoupled object would correctly detect the end-of-life of the eventfd
object at some point in the future and be deallocated: However, we cannot
guarantee that this operation would not race with a potential rmmod of the
client, and is therefore broken.

Solution Details:
-----------------------

1) We add a private kref to the internal eventfd_ctx object.  This
   reference can be (transparently) held by notification clients without
   affecting the ability for VFS to indicate ->release() notification.

2) We convert the current lockless POLLHUP to a more traditional locked
   variant (*) so that we can ensure a race free mutual-decouple
   algorithm without requiring an surrogate object.

3) We guard the decouple algorithm with an atomic bit-clear to ensure
   mutual exclusion of the decoupling and reference-drop.

4) We hold a reference to the underlying eventfd_ctx until all paths
   have satisfactorily completed to ensure we do not race with eventfd
   going away.

Between these points, we believe we now have a race-free release
mechanism.

[*] Clients that previously assumed the ->release() could sleep will
need to be refactored.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Michael S. Tsirkin <mst@redhat.com>
---

 fs/eventfd.c            |   62 +++++++++++++++++++++++++++++++++--------------
 include/linux/eventfd.h |    3 ++
 2 files changed, 47 insertions(+), 18 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[3/3] eventfd: add internal reference counting to fix notifier race conditions

Commit Message

Comments

Patch