Message ID | 1243893048-17031-18-git-send-email-ebiederm@xmission.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Mon, 1 Jun 2009, Eric W. Biederman wrote: > From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com> > > Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> > --- > fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- > 1 files changed, 32 insertions(+), 7 deletions(-) This patchset gives me the willies for the amount of changes and possible impact on many subsystems. Without having looked at the details, are you aware that epoll does not act like poll/select, and fds are not automatically removed (as in, dequeued from the poll wait queue) in any foreseeable amount of time after a POLLERR is received? As far as the usespace API goes, they have the right to remain there. - Davide -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi <davidel@xmailserver.org> writes: > On Mon, 1 Jun 2009, Eric W. Biederman wrote: > >> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com> >> >> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> >> --- >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- >> 1 files changed, 32 insertions(+), 7 deletions(-) > > This patchset gives me the willies for the amount of changes and possible > impact on many subsystems. It both is and is not that bad. It is the cost of adding a lock. For the VFS except for nfsd the I have touched everything that needs to be touched. Other subsystems that open read/write close files should be able to use existing vfs helpers so they don't need to know about the new locking explicitly. Actually taking advantage of this infrastructure in a subsystem is comparatively easy. It took me about an hour to get uio using it. That part is not deep by any means and is opt in. > Without having looked at the details, are you aware that epoll does not > act like poll/select, and fds are not automatically removed (as in, > dequeued from the poll wait queue) in any foreseeable amount of time after > a POLLERR is received? Yes I am aware of how epoll acts differently. > As far as the usespace API goes, they have the right to remain there. I absolutely agree. Currently I have the code acting like close() with respect to epoll and just having the file descriptor vanish at the end of the revoke. While we the revoke is in progress you get an EIO. The file descriptor is not freed by a revoke operation so you can happily hang unto it as long as you want. I thought of doing something more uniform to user space. But I observed that the existing epoll punts on the case of a file descriptor being closed and locking to go from a file to the other epoll datastructures is pretty horrid I said forget it and used the existing close behaviour. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2 Jun 2009, Eric W. Biederman wrote: > Davide Libenzi <davidel@xmailserver.org> writes: > > > On Mon, 1 Jun 2009, Eric W. Biederman wrote: > > > >> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com> > >> > >> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> > >> --- > >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- > >> 1 files changed, 32 insertions(+), 7 deletions(-) > > > > This patchset gives me the willies for the amount of changes and possible > > impact on many subsystems. > > It both is and is not that bad. It is the cost of adding a lock. We both know that it is not only the cost of a lock, but also the sprinkling over a pretty vast amount of subsystems, of another layer of code. > I thought of doing something more uniform to user space. But I observed > that the existing epoll punts on the case of a file descriptor being closed > and locking to go from a file to the other epoll datastructures is pretty > horrid I said forget it and used the existing close behaviour. Well, you cannot rely on the caller to tidy up the epoll fd by issuing an epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave lingering crap around. You cannot even hold a reference of the file, since otherwise the epoll hooking will have to trigger not only at ->release() time, but at every close, where you'll have to figure out if this is the last real userspace reference or not. Plus all the issues related to holding permanent extra references to userspace files. And since a file can be added in many epoll devices, you need to unregister it from all of them (hence the other datastructures lookup). Better this, on the slow path, with locks acquired only in the epoll usage case, than some other thing and on the fast path, for every file. - Davide -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi <davidel@xmailserver.org> writes: > On Tue, 2 Jun 2009, Eric W. Biederman wrote: > >> Davide Libenzi <davidel@xmailserver.org> writes: >> >> > On Mon, 1 Jun 2009, Eric W. Biederman wrote: >> > >> >> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com> >> >> >> >> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> >> >> --- >> >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- >> >> 1 files changed, 32 insertions(+), 7 deletions(-) >> > >> > This patchset gives me the willies for the amount of changes and possible >> > impact on many subsystems. >> >> It both is and is not that bad. It is the cost of adding a lock. > > We both know that it is not only the cost of a lock, but also the > sprinkling over a pretty vast amount of subsystems, of another layer of > code. I am not clear what problem you have. Is it the sprinkling the code that takes and removes the lock? Just the VFS needs to be involved with that. It is a slightly larger surface area than doing the work inside the file operations as we sometimes call the same method from 3-4 different places but it is definitely a bounded problem. Is it putting in the handful lines per subsystem to actually use this functionality? At that level something generic that is maintained outside of the subsystem is better than the mess we have with 4-5 different implementations in the subsystems that need it, each having a different assortment of bugs. >> I thought of doing something more uniform to user space. But I observed >> that the existing epoll punts on the case of a file descriptor being closed >> and locking to go from a file to the other epoll datastructures is pretty >> horrid I said forget it and used the existing close behaviour. > > Well, you cannot rely on the caller to tidy up the epoll fd by issuing an > epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave > lingering crap around. You cannot even hold a reference of the file, since > otherwise the epoll hooking will have to trigger not only at ->release() > time, but at every close, where you'll have to figure out if this is the > last real userspace reference or not. Plus all the issues related to > holding permanent extra references to userspace files. > And since a file can be added in many epoll devices, you need to > unregister it from all of them (hence the other datastructures lookup). > Better this, on the slow path, with locks acquired only in the epoll usage > case, than some other thing and on the fast path, for every file. Sure, and that is largely and I am preserving those semantics. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2 Jun 2009, Eric W. Biederman wrote: > I am not clear what problem you have. > > Is it the sprinkling the code that takes and removes the lock? Just > the VFS needs to be involved with that. It is a slightly larger > surface area than doing the work inside the file operations as we > sometimes call the same method from 3-4 different places but it is > definitely a bounded problem. > > Is it putting in the handful lines per subsystem to actually use this > functionality? At that level something generic that is maintained > outside of the subsystem is better than the mess we have with 4-5 > different implementations in the subsystems that need it, each having > a different assortment of bugs. Come on, only in the open fast path, there are at least two spin lock/unlock and two atomic ops. Without even starting to count all the extra branches and software added. Is this stuff *really* needed, or we can faitly happily live w/out? - Davide -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi <davidel@xmailserver.org> writes: > On Tue, 2 Jun 2009, Eric W. Biederman wrote: > >> I am not clear what problem you have. >> >> Is it the sprinkling the code that takes and removes the lock? Just >> the VFS needs to be involved with that. It is a slightly larger >> surface area than doing the work inside the file operations as we >> sometimes call the same method from 3-4 different places but it is >> definitely a bounded problem. >> >> Is it putting in the handful lines per subsystem to actually use this >> functionality? At that level something generic that is maintained >> outside of the subsystem is better than the mess we have with 4-5 >> different implementations in the subsystems that need it, each having >> a different assortment of bugs. > > Come on, only in the open fast path, there are at least two spin > lock/unlock and two atomic ops. Without even starting to count all the > extra branches and software added. > Is this stuff *really* needed, or we can faitly happily live w/out? ???? What code are you talking about? To the open path a few memory writes and a smp_wmb. No atomics and no spin lock/unlocks. Are you complaining because I retain the file_list? Eric -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 3 Jun 2009, Eric W. Biederman wrote: > What code are you talking about? > > To the open path a few memory writes and a smp_wmb. No atomics and no > spin lock/unlocks. > > Are you complaining because I retain the file_list? Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin lock/unlock couple present in __dentry_open() (same sort of the release path)? And that's only like 5% of the code touched by the new special handling of the file operations structure (basically, every f_op access ends up being wrapped by two atomic ops and other extra code). The question, that I'd like to reiterate is, is this stuff really needed? Anyway, my complaint ends here and I'll let others evaluate if merging this patchset is worth the cost. - Davide -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi <davidel@xmailserver.org> writes: > On Wed, 3 Jun 2009, Eric W. Biederman wrote: > >> What code are you talking about? >> >> To the open path a few memory writes and a smp_wmb. No atomics and no >> spin lock/unlocks. >> >> Are you complaining because I retain the file_list? > > Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin > lock/unlock couple present in __dentry_open() (same sort of the release > path)? You might be remembering v1. In v2 I have operations like file_hotplug_read_trylock that implement a lock but use an rcu like algorithm. So there are no atomic operations involved with their associated pipeline stalls. Over my previous version this made a reasonable performance benefit. > And that's only like 5% of the code touched by the new special handling of > the file operations structure (basically, every f_op access ends up being > wrapped by two atomic ops and other extra code). Yes there is a single extra wrapping of every file in the syscall path. So we know that someone is using it. > The question, that I'd like to reiterate is, is this stuff really needed? > Anyway, my complaint ends here and I'll let others evaluate if merging > this patchset is worth the cost. Sure. My apologies for not answering that question earlier. My perspective is that every subsystem that winds up supporting hotplug hardware winds up rolling it's own version of something like this, and they each have a different set of bugs. So one generic version is definitely worth implementing. Similarly there is a case for a generic revoke facility in the kernel. Alan at least has made the case that there are certain security problems that can not be solved in userspace without revoke. From an implementation point of view doing the generic implementation at the vfs level has significant benefits. The extra locking appears reasonable from a code maintenance and comprehensibility point of view. A real pain to find all of the entry points into the vfs, and get other code to use the right vfs helpers they should always have been using but I am volunteering to do that work. The practical question I see is are the performance overheads of my primitives low enough that I do not cause performance regressions on anyone's fast path. As far as I have been able to measure is that the performance overhead is low enough, because I have been able to avoid the use of atomics and have been able to use fairly small code with predictable branches. Which is why I pressed you to be certain I understood where you are coming from. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index a89f370..eabb167 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -627,8 +627,13 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, struct epitem *epi, *tmp; list_for_each_entry_safe(epi, tmp, head, rdllink) { - if (epi->ffd.file->f_op->poll(epi->ffd.file, NULL) & - epi->event.events) + int events = DEAD_POLLMASK; + + if (file_hotplug_read_trylock(epi->ffd.file)) { + events = epi->ffd.file->f_op->poll(epi->ffd.file, NULL); + file_hotplug_read_unlock(epi->ffd.file); + } + if (events & epi->event.events) return POLLIN | POLLRDNORM; else { /* @@ -1060,8 +1065,12 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, list_del_init(&epi->rdllink); - revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) & - epi->event.events; + revents = DEAD_POLLMASK; + if (file_hotplug_read_trylock(epi->ffd.file)) { + revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL); + file_hotplug_read_unlock(epi->ffd.file); + } + revents &= epi->event.events; /* * If the event mask intersect the caller-requested one, @@ -1248,10 +1257,17 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (!tfile) goto error_fput; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto error_tgt_fput; + + if (!file_hotplug_read_trylock(tfile)) + goto error_file_unlock; + /* The target file descriptor must support poll */ error = -EPERM; if (!tfile->f_op || !tfile->f_op->poll) - goto error_tgt_fput; + goto error_tgt_unlock; /* * We have to check that the file structure underneath the file descriptor @@ -1260,7 +1276,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, */ error = -EINVAL; if (file == tfile || !is_file_epoll(file)) - goto error_tgt_fput; + goto error_tgt_unlock; /* * At this point it is safe to assume that the "private_data" contains @@ -1302,6 +1318,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, } mutex_unlock(&ep->mtx); +error_tgt_unlock: + file_hotplug_read_unlock(tfile); +error_file_unlock: + file_hotplug_read_unlock(file); error_tgt_fput: fput(tfile); error_fput: @@ -1338,13 +1358,16 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, if (!file) goto error_return; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto error_fput; /* * We have to check that the file structure underneath the fd * the user passed to us _is_ an eventpoll file. */ error = -EINVAL; if (!is_file_epoll(file)) - goto error_fput; + goto error_unlock; /* * At this point it is safe to assume that the "private_data" contains @@ -1355,6 +1378,8 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, /* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); +error_unlock: + file_hotplug_read_unlock(file); error_fput: fput(file); error_return: