diff mbox

[RFC,1/2] fs,eventpoll: Add ability to install target file by its number

Message ID 20170217083324.627615532@openvz.org (mailing list archive)
State New, archived
Headers show

Commit Message

Cyrill Gorcunov Feb. 17, 2017, 8:30 a.m. UTC
When we checkpoint a process we look into /proc/<pid>/fdinfo/<fd> of eventpoll
file and parse target files list from there. In most situations this is fine
because target file is present in the /proc/<pid>/fd/ list. But in case if file
descriptor was dup'ed or transferred via unix socket and closed after,
it might not be in the list and we can't figure out which file descriptor
to pass into epoll_ctl call.

To resolve this tie lets add EPOLL_CTL_DUP operation which simply takes
target file descriptor number and installs it into a caller's file table,
thus we can use kcmp() syscall and figure out which exactly file to be
added into eventpoll on restore procedure.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Andrey Vagin <avagin@openvz.org>
CC: Pavel Emelyanov <xemul@virtuozzo.com>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Andrew Morton <akpm@linuxfoundation.org>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Kir Kolyshkin <kir@openvz.org>
---
 fs/eventpoll.c                 |   74 +++++++++++++++++++++++++++++++++++------
 include/uapi/linux/eventpoll.h |    1 
 2 files changed, 65 insertions(+), 10 deletions(-)

Comments

Andy Lutomirski Feb. 17, 2017, 4:52 p.m. UTC | #1
On Fri, Feb 17, 2017 at 12:30 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> When we checkpoint a process we look into /proc/<pid>/fdinfo/<fd> of eventpoll
> file and parse target files list from there. In most situations this is fine
> because target file is present in the /proc/<pid>/fd/ list. But in case if file
> descriptor was dup'ed or transferred via unix socket and closed after,
> it might not be in the list and we can't figure out which file descriptor
> to pass into epoll_ctl call.
>
> To resolve this tie lets add EPOLL_CTL_DUP operation which simply takes
> target file descriptor number and installs it into a caller's file table,
> thus we can use kcmp() syscall and figure out which exactly file to be
> added into eventpoll on restore procedure.

This is a scary thing to let an unprivileged process do.

I'm wondering if there might be a nicer way to address this using a
better interface in /proc.
Cyrill Gorcunov Feb. 17, 2017, 5:11 p.m. UTC | #2
On Fri, Feb 17, 2017 at 08:52:59AM -0800, Andy Lutomirski wrote:
> >
> > To resolve this tie lets add EPOLL_CTL_DUP operation which simply takes
> > target file descriptor number and installs it into a caller's file table,
> > thus we can use kcmp() syscall and figure out which exactly file to be
> > added into eventpoll on restore procedure.
> 
> This is a scary thing to let an unprivileged process do.
> 
> I'm wondering if there might be a nicer way to address this using a
> better interface in /proc.

Well, I tend to agree. Need to add security checking if the target
file is accessable by a caller. As to better interface to procfs
nothing comes to mind immediately. Another potential problem is that
since it is never guaranteed that target file number listed in fdinfo
matching existing /proc/pid/fd/N, so that I think we will have to
use this dup functionality for every target file, which of course
not that fast. Probably need to think more if I manage to invent
some better and faster interface to find where exactly target file
belong in the whole process tree of a container.

Thanks for pointing about this security problem, Andy!
Andy Lutomirski Feb. 17, 2017, 5:20 p.m. UTC | #3
On Fri, Feb 17, 2017 at 9:11 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Fri, Feb 17, 2017 at 08:52:59AM -0800, Andy Lutomirski wrote:
>> >
>> > To resolve this tie lets add EPOLL_CTL_DUP operation which simply takes
>> > target file descriptor number and installs it into a caller's file table,
>> > thus we can use kcmp() syscall and figure out which exactly file to be
>> > added into eventpoll on restore procedure.
>>
>> This is a scary thing to let an unprivileged process do.
>>
>> I'm wondering if there might be a nicer way to address this using a
>> better interface in /proc.
>
> Well, I tend to agree. Need to add security checking if the target
> file is accessable by a caller. As to better interface to procfs
> nothing comes to mind immediately. Another potential problem is that
> since it is never guaranteed that target file number listed in fdinfo
> matching existing /proc/pid/fd/N, so that I think we will have to
> use this dup functionality for every target file, which of course
> not that fast. Probably need to think more if I manage to invent
> some better and faster interface to find where exactly target file
> belong in the whole process tree of a container.

i was imagining some proc or proc-like interface that lets you inspect
an open file without needing to know what process has the fd.

What if you introduced a new type of fd that's an "fd reference".  You
could add a kcmp mode that tells you whether an fd reference refers to
the same thing as a real fd, but you'd arrange for fd references to be
otherwise useless.

Alternatively, you could simply have an interface like kcmp (maybe a
new kcmp mode) that lets you compare an epoll set entry to an actual
fd.  Then you could figure out what it is but only if you already have
the fd by some other means.  Of course, if there are no references
left, you still have a problem.  Hmm.

>
> Thanks for pointing about this security problem, Andy!
Cyrill Gorcunov Feb. 17, 2017, 6:01 p.m. UTC | #4
On Fri, Feb 17, 2017 at 09:20:34AM -0800, Andy Lutomirski wrote:
> >>
> >> This is a scary thing to let an unprivileged process do.
> >>
> >> I'm wondering if there might be a nicer way to address this using a
> >> better interface in /proc.
> >
> > Well, I tend to agree. Need to add security checking if the target
> > file is accessable by a caller. As to better interface to procfs
> > nothing comes to mind immediately. Another potential problem is that
> > since it is never guaranteed that target file number listed in fdinfo
> > matching existing /proc/pid/fd/N, so that I think we will have to
> > use this dup functionality for every target file, which of course
> > not that fast. Probably need to think more if I manage to invent
> > some better and faster interface to find where exactly target file
> > belong in the whole process tree of a container.
> 
> i was imagining some proc or proc-like interface that lets you inspect
> an open file without needing to know what process has the fd.

We will have to find out which process opened the target fd (currently
in criu we simply assume that target file is always in process which
created epoll and other scenarios are not supported. in most situations
that's enough but unfortunately we find a testcase where it's not
true and have to find a way to support migrated targets too).

> What if you introduced a new type of fd that's an "fd reference".  You
> could add a kcmp mode that tells you whether an fd reference refers to
> the same thing as a real fd, but you'd arrange for fd references to be
> otherwise useless.
> 
> Alternatively, you could simply have an interface like kcmp (maybe a
> new kcmp mode) that lets you compare an epoll set entry to an actual
> fd.  Then you could figure out what it is but only if you already have
> the fd by some other means.  Of course, if there are no references
> left, you still have a problem.  Hmm.

kcmp over target set seems to be preferred since it gonna be a way
faster. But I think about fd-ref idea too. Thanks a huge!
Jason Baron Feb. 17, 2017, 6:34 p.m. UTC | #5
On 02/17/2017 01:01 PM, Cyrill Gorcunov wrote:
> On Fri, Feb 17, 2017 at 09:20:34AM -0800, Andy Lutomirski wrote:
>>>>
>>>> This is a scary thing to let an unprivileged process do.
>>>>
>>>> I'm wondering if there might be a nicer way to address this using a
>>>> better interface in /proc.
>>>
>>> Well, I tend to agree. Need to add security checking if the target
>>> file is accessable by a caller. As to better interface to procfs
>>> nothing comes to mind immediately. Another potential problem is that
>>> since it is never guaranteed that target file number listed in fdinfo
>>> matching existing /proc/pid/fd/N, so that I think we will have to
>>> use this dup functionality for every target file, which of course
>>> not that fast. Probably need to think more if I manage to invent
>>> some better and faster interface to find where exactly target file
>>> belong in the whole process tree of a container.
>>
>> i was imagining some proc or proc-like interface that lets you inspect
>> an open file without needing to know what process has the fd.
>
> We will have to find out which process opened the target fd (currently
> in criu we simply assume that target file is always in process which
> created epoll and other scenarios are not supported. in most situations
> that's enough but unfortunately we find a testcase where it's not
> true and have to find a way to support migrated targets too).
>
>> What if you introduced a new type of fd that's an "fd reference".  You
>> could add a kcmp mode that tells you whether an fd reference refers to
>> the same thing as a real fd, but you'd arrange for fd references to be
>> otherwise useless.
>>
>> Alternatively, you could simply have an interface like kcmp (maybe a
>> new kcmp mode) that lets you compare an epoll set entry to an actual
>> fd.  Then you could figure out what it is but only if you already have
>> the fd by some other means.  Of course, if there are no references
>> left, you still have a problem.  Hmm.
>
> kcmp over target set seems to be preferred since it gonna be a way
> faster. But I think about fd-ref idea too. Thanks a huge!
>

I sort of like kcmp() for this as well maybe the epoll fd argument can 
be a pointer to something like:

struct epoll_slot {
	int epfd;
	int slot;
};

Where 'epfd' is the epoll fd, and slot is just an index into the target 
fds associated with the epoll fd. That is, you would start from 0 and go 
up to n. When you reach the last one you get in -EBADSLT or something.

Regarding Andy's point of of not having a reference by some other means, 
I think you have to have it basically otherwise it would be removed from 
the epfd. I guess it could be being passed via a unix socket? But there 
must be a way to get at those anyways, in order to restore them.

Also, in the initial patch this function can not restore all target 
epoll files:

'ep_find_tfd(struct eventpoll *ep, int tfd)'

Because the 'target' file for an epfd is the pair: {fd, file}. Thus, you 
could have say two different 'target' files with the same fd number.

Thanks,

-Jason
Cyrill Gorcunov Feb. 17, 2017, 7:45 p.m. UTC | #6
On Fri, Feb 17, 2017 at 01:34:04PM -0500, Jason Baron wrote:
> > 
> > kcmp over target set seems to be preferred since it gonna be a way
> > faster. But I think about fd-ref idea too. Thanks a huge!
> > 
> 
> I sort of like kcmp() for this as well maybe the epoll fd argument can be a
> pointer to something like:
> 
> struct epoll_slot {
> 	int epfd;
> 	int slot;
> };
> 
> Where 'epfd' is the epoll fd, and slot is just an index into the target fds
> associated with the epoll fd. That is, you would start from 0 and go up to
> n. When you reach the last one you get in -EBADSLT or something.

Yes, I think we can even extend the slot and pass it as a pointer to
target fd set which are to be tested.

> Regarding Andy's point of of not having a reference by some other means, I
> think you have to have it basically otherwise it would be removed from the
> epfd. I guess it could be being passed via a unix socket? But there must be
> a way to get at those anyways, in order to restore them.

Yes. But the file may be obtained from unix sock, then dup'ed and installed
and then this dup'ed file get closed but unix instance is still alive. And
then we need to figure out that it's unix socket from where original target
file came from.

> Also, in the initial patch this function can not restore all target epoll
> files:
> 
> 'ep_find_tfd(struct eventpoll *ep, int tfd)'
> 
> Because the 'target' file for an epfd is the pair: {fd, file}. Thus, you
> could have say two different 'target' files with the same fd number.

Yes, but if we would have _DUP'ed them (ie installed into caller context)
we can find out their order and restore properly I think.

Thanks for comments, I'll think about all them at the weekend or early
next week.

	Cyrill
diff mbox

Patch

Index: linux-ml.git/fs/eventpoll.c
===================================================================
--- linux-ml.git.orig/fs/eventpoll.c
+++ linux-ml.git/fs/eventpoll.c
@@ -361,7 +361,7 @@  static inline struct epitem *ep_item_fro
 /* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
 static inline int ep_op_has_event(int op)
 {
-	return op != EPOLL_CTL_DEL;
+	return op != EPOLL_CTL_DEL && op != EPOLL_CTL_DUP;
 }
 
 /* Initialize the poll safe wake up structure */
@@ -967,6 +967,20 @@  free_uid:
 	return error;
 }
 
+static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd)
+{
+	struct rb_node *rbp;
+	struct epitem *epi;
+
+	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
+		epi = rb_entry(rbp, struct epitem, rbn);
+		if (epi->ffd.fd == tfd)
+			return epi;
+	}
+
+	return NULL;
+}
+
 /*
  * Search the file inside the eventpoll tree. The RB tree operations
  * are protected by the "mtx" mutex, and ep_find() must be called with
@@ -979,6 +993,9 @@  static struct epitem *ep_find(struct eve
 	struct epitem *epi, *epir = NULL;
 	struct epoll_filefd ffd;
 
+	if (unlikely(!file))
+		return ep_find_tfd(ep, fd);
+
 	ep_set_ffd(&ffd, file, fd);
 	for (rbp = ep->rbr.rb_node; rbp; ) {
 		epi = rb_entry(rbp, struct epitem, rbn);
@@ -1787,6 +1804,28 @@  static void clear_tfile_check_list(void)
 	INIT_LIST_HEAD(&tfile_check_list);
 }
 
+static int ep_install_tfd(struct eventpoll *ep, struct epitem *epi)
+{
+	struct file *file;
+	int ret = -ENOENT;
+
+	rcu_read_lock();
+	if (get_file_rcu(epi->ffd.file))
+		file = epi->ffd.file;
+	else
+		file = NULL;
+	rcu_read_unlock();
+
+	if (file) {
+		ret = get_unused_fd_flags(0);
+		if (ret >= 0)
+			fd_install(ret, file);
+		else
+			fput(file);
+	}
+	return ret;
+}
+
 /*
  * Open an eventpoll file descriptor.
  */
@@ -1867,15 +1906,24 @@  SYSCALL_DEFINE4(epoll_ctl, int, epfd, in
 	if (!f.file)
 		goto error_return;
 
-	/* Get the "struct file *" for the target file */
-	tf = fdget(fd);
-	if (!tf.file)
-		goto error_fput;
-
-	/* The target file descriptor must support poll */
-	error = -EPERM;
-	if (!tf.file->f_op->poll)
-		goto error_tgt_fput;
+	if (likely(op != EPOLL_CTL_DUP)) {
+		/* Get the "struct file *" for the target file */
+		tf = fdget(fd);
+		if (!tf.file)
+			goto error_fput;
+
+		/* The target file descriptor must support poll */
+		error = -EPERM;
+		if (!tf.file->f_op->poll)
+			goto error_tgt_fput;
+	} else {
+		/*
+		 * A special case where target file
+		 * is to be looked up and installed
+		 * into a caller.
+		 */
+		memset(&tf, 0, sizeof(tf));
+	}
 
 	/* Check if EPOLLWAKEUP is allowed */
 	if (ep_op_has_event(op))
@@ -1972,6 +2020,12 @@  SYSCALL_DEFINE4(epoll_ctl, int, epfd, in
 		else
 			error = -ENOENT;
 		break;
+	case EPOLL_CTL_DUP:
+		if (epi)
+			error = ep_install_tfd(ep, epi);
+		else
+			error = -ENOENT;
+		break;
 	case EPOLL_CTL_MOD:
 		if (epi) {
 			if (!(epi->event.events & EPOLLEXCLUSIVE)) {
Index: linux-ml.git/include/uapi/linux/eventpoll.h
===================================================================
--- linux-ml.git.orig/include/uapi/linux/eventpoll.h
+++ linux-ml.git/include/uapi/linux/eventpoll.h
@@ -25,6 +25,7 @@ 
 #define EPOLL_CTL_ADD 1
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
+#define EPOLL_CTL_DUP 4
 
 /* Set exclusive wakeup mode for the target file descriptor */
 #define EPOLLEXCLUSIVE (1 << 28)