Message ID | 1828884A29C6694DAF28B7E6B8A82373AB04FB7F@ORSMSX109.amr.corp.intel.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > owner@vger.kernel.org] On Behalf Of Hefty, Sean ... > > The following is a sketch, or outline, for an ioctl framework. > The purpose of the patch is to drive discussion and feedback on various design > decisions rather than low-level comments on the code, so that the changes can > be incorporated earlier in the development cycle. The code is entirely untested > and some functionality is not coded. > This approach focuses on facilities for arbitrary device-specific objects (which could basically fit GPUs, NVMe, whatever...), while our primary goals are to replace write() with ioctl() and strive for a common *rdma* object model. In contrast, the patchset [1], focuses on the task at hand. The emphasis is on a common framework for parsing and parameter validation. It follows the agreed-upon concepts such as a single ioctl request code for common objects, natural mapping of the existing object mode (and hence easier backporting), typed attributes for extending the objects to accommodate multiple vendor features, and a unique ioctl request code for vendor-specific commands. --Liran [1] http://permalink.gmane.org/gmane.linux.drivers.rdma/36984 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/24/2016 12:02 PM, Liran Liss wrote: >> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- >> owner@vger.kernel.org] On Behalf Of Hefty, Sean > ... >> >> The following is a sketch, or outline, for an ioctl framework. >> The purpose of the patch is to drive discussion and feedback on various design >> decisions rather than low-level comments on the code, so that the changes can >> be incorporated earlier in the development cycle. The code is entirely untested >> and some functionality is not coded. >> > > This approach focuses on facilities for arbitrary device-specific objects (which could basically fit GPUs, NVMe, whatever...), while our primary goals are to replace write() with ioctl() and strive for a common *rdma* object model. Please be precise when you use the word "our". In particular, I do not count myself among your "our". > In contrast, the patchset [1], focuses on the task at hand. The emphasis is on a common framework for parsing and parameter validation. It follows the agreed-upon Again, precision is needed here. Agreed upon by whom? Not I, and evidently not Sean either. > concepts such as a single ioctl request code for common objects, I'm not sure I agree with this as a priority. > natural mapping of the existing object mode (and hence easier backporting), Nor this. > typed attributes for extending the objects to accommodate multiple vendor features, and a unique ioctl request code for vendor-specific commands. > --Liran > > [1] http://permalink.gmane.org/gmane.linux.drivers.rdma/36984 While concalls to discuss these topics are handy, it is easy to miss objections or alternative priorities. So let me sum up how I see things. First, we had a security issue that needed addressed. We did so with a security check on the current ABI. During the discussions about that, I specifically told Linus we wanted to address it that way for now so that we could work on getting the verbs 2.0 API that addresses some of the shortcomings of the current API out the door, but do it right and not rush things. That verbs 2.0 API would fix the security issue up completely where the security checks are no longer needed. What we have today is a proposal to basically take the verbs 1.0 API and convert it from write() to ioctl() with as little change as possible to the API. This is what the patchset you referenced is all about, and what you and Jason have been talking about. Allow me to be clear: this is not verbs 2.0 API. This is verbs 1.0 over ioctl. Nothing more. And it does nothing to fix up the shortcomings of the verbs 1.0 API that are playing out right now in things like the timestamp pacthset. Sean's proposal does away with the rigid nature of the current verbs 1.0 API and makes it much more flexible on a per-driver basis. This doesn't address the end user API issues, but it at least cleans up the user driver <-> kernel driver API so that one vendor's driver is not forced to carry around all the baggage needed for every other vendor's drivers. Under that model, each vendor only carries what they need. It would then be libibverbs responsibility to take that driver specific data and present it to the user in a generic way. The mechanism of that is not yet known, but thinking about it is what prompted my email in the "Further thoughts on uAPI" discussion, the one where I made the possible wc processing example. This is why I don't think it's possible to think about the verbs 2.0 API and not also the user visible libibverbs API at the same time. If all we want to talk about is verbs 1.0 over ioctl, then yes, we can do that. But not if we truly want to discuss a verbs 2.0 API. And I have yet to gather from the discussions I hear from people that we are in fact decided on pursuing a verbs 1.0 over ioctl API instead of considering a verbs 2.0 API. So please don't present things as though they are settled. As far as I can tell, they very much are not.
On Tue, May 24, 2016 at 01:57:54PM -0400, Doug Ledford wrote: > Sean's proposal does away with the rigid nature of the current verbs 1.0 > API and makes it much more flexible on a per-driver basis. This doesn't > address the end user API issues, but it at least cleans up the user > driver <-> kernel driver API so that one vendor's driver is not forced > to carry around all the baggage needed for every other vendor's drivers. I'm not sure what you are reading, but to me both proposals look very similar. They are both based on the generic object/action/method sort of model I talked about in an earlier thread. The main differences seem to boil down to the data marshalling and the dispatching style for the kernel side.. Sean hasn't explored how to encode the actual method arguments, while Mellanox's has a fairly well developed scheme with the netlink encoding and sgl result list thingy. > Under that model, each vendor only carries what they need. It would > then be libibverbs responsibility to take that driver specific data > and Either patchset could go in this direction. This is a basic question we need to decide on. I'm starting to think the basic thrust of the Mellanox series (provide an easy compatability with our userspace) is a sane approach. The other options look like far too much work to use as a starting point. That doesn't mean we can't decide to move in a direct-only direction - the uAPI needs to have enough extension points to allow for that. Such work should happen incrementally, and mainly target new uAPIs. > and not also the user visible libibverbs API at the same time. If all > we want to talk about is verbs 1.0 over ioctl, then yes, we can do that. > But not if we truly want to discuss a verbs 2.0 API. And I have yet to > gather from the discussions I hear from people that we are in fact > decided on pursuing a verbs 1.0 over ioctl API instead of considering a > verbs 2.0 API. You are the only person I've heard who wants to restructure the libibverbs interface at the same time.. When I kicked off this process at the OFA meeting the discussions revolved around some very clear goals: 1) Comprehensively close the security issue 2) Provide 100% support for existing libibverbs users on the current libibverbs API 3) Restructure obvious known weaknesses (eg IB specific areas) 4) Add additional extension points I feel like both patches are still going down the above path.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Under that model, each vendor only carries what they need. It would > > then be libibverbs responsibility to take that driver specific data > > and > > Either patchset could go in this direction. This is a basic question > we need to decide on. > > I'm starting to think the basic thrust of the Mellanox series (provide > an easy compatability with our userspace) is a sane approach. The > other options look like far too much work to use as a starting point. What matters is whether the work needs to be done anyway, and how can we get to the end design. I don’t see a time crunch to force switching to ioctls. > You are the only person I've heard who wants to restructure the > libibverbs interface at the same time.. > > When I kicked off this process at the OFA meeting the discussions > revolved around some very clear goals: > 1) Comprehensively close the security issue > 2) Provide 100% support for existing libibverbs users on the current > libibverbs API > 3) Restructure obvious known weaknesses (eg IB specific areas) > 4) Add additional extension points I still don't think we even have agreement on what this interface should be. There is not a single user space library that's being targeted here. If the only goal is to support libibverbs, then we need another option for devices that are not verbs based, but expose hardware specific structures to user space. This is why you see objections to putting this under uverbs. Non-verbs use cases keep getting completely overlooked in the conversation. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 24, 2016 at 10:38:58PM +0000, Hefty, Sean wrote: > > I'm starting to think the basic thrust of the Mellanox series (provide > > an easy compatability with our userspace) is a sane approach. The > > other options look like far too much work to use as a starting point. > > What matters is whether the work needs to be done anyway, and how > can we get to the end design. I don?t see a time crunch to force > switching to ioctls. Well, it doesn't need to be done, the verbs calls we've already coded do not *need* to be recoded - except due to the security problem. > > You are the only person I've heard who wants to restructure the > > libibverbs interface at the same time.. > > > > When I kicked off this process at the OFA meeting the discussions > > revolved around some very clear goals: > > 1) Comprehensively close the security issue > > 2) Provide 100% support for existing libibverbs users on the current > > libibverbs API > > 3) Restructure obvious known weaknesses (eg IB specific areas) > > 4) Add additional extension points > > I still don't think we even have agreement on what this interface > should be. There is not a single user space library that's being > targeted here. As I outlined, the genesis to do this was to migrate libibverbs to an ioctl interface and in the process set a better stage for all the other things people want to do. That includes better supporting non-libibverbs libraries (dpdk, libfabric, whatever) The purpose of the object/method style framework was to better set the stage for the core code to provide that kind of wider interface for the drivers. A major family of objects/methods is going to be the libibverbs 1.0 support layer which is mandatory to port libibverbs to this interface. The driver specific and any future 'non-verbs' common objects/methods have little concrete to bring to the table right now. The cdev stuff in hfi1/qib is very small and easilly handled by the bypass-to-driver path that both patches include. > hardware specific structures to user space. This is why you see > objections to putting this under uverbs. Non-verbs use cases keep > getting completely overlooked in the conversation. I disagree, non-verbs is not a central topic of conversation because nothing beyond needing a bypass-to-driver channel has been brought up by that community. That need has clearly been met in both patch series. Further, the basic things presented by both series so far are still general enough to allow for new common uAPIs that are unrelated to the verbs family. I'm not sure what else can be done until the non-verbs universe makes a consensus on what common kernel support it needs??? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > Sent: Wednesday, May 25, 2016 12:42 AM > > Sean's proposal does away with the rigid nature of the current verbs > > 1.0 API and makes it much more flexible on a per-driver basis. This > > doesn't address the end user API issues, but it at least cleans up the > > user driver <-> kernel driver API so that one vendor's driver is not > > forced to carry around all the baggage needed for every other vendor's > drivers. > > I'm not sure what you are reading, but to me both proposals look very similar. > They are both based on the generic object/action/method sort of model I talked > about in an earlier thread. > > The main differences seem to boil down to the data marshalling and the > dispatching style for the kernel side.. > > Sean hasn't explored how to encode the actual method arguments, while > Mellanox's has a fairly well developed scheme with the netlink encoding and sgl > result list thingy. > BTW, we can easily incorporate Sean's facility for mapping user-->kernel objects as helper function for method handlers in the proposed patches. > > Under that model, each vendor only carries what they need. It would > > then be libibverbs responsibility to take that driver specific data > > and > > Either patchset could go in this direction. This is a basic question we need to > decide on. > > I'm starting to think the basic thrust of the Mellanox series (provide an easy > compatability with our userspace) is a sane approach. The other options look > like far too much work to use as a starting point. > > That doesn't mean we can't decide to move in a direct-only direction - the uAPI > needs to have enough extension points to allow for that. Such work should > happen incrementally, and mainly target new uAPIs. > Right. The proposed patches go a long way in enabling extensibility. Typed parameters allow you to extend/replace/omit object properties as the interface evolves. New object types allow you to introduce new concepts. And the vendor channel can be used for arbitrary communication between a provider's user/kernel drivers. We are not rushing anything. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> As I outlined, the genesis to do this was to migrate libibverbs to an > ioctl interface and in the process set a better stage for all the > other things people want to do. That includes better supporting > non-libibverbs libraries (dpdk, libfabric, whatever) Then create ioctl 1 - direct mapping of the write interface Ioctl 2-255 - extensions This requires the least amount of work, supports what's needed, and is extensible. > The purpose of the object/method style framework was to better set the > stage for the core code to provide that kind of wider interface for > the drivers. I tried applying this to the rdma cm. It's a very poor fit. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 25, 2016 at 02:59:41PM +0000, Hefty, Sean wrote: > > As I outlined, the genesis to do this was to migrate libibverbs to an > > ioctl interface and in the process set a better stage for all the > > other things people want to do. That includes better supporting > > non-libibverbs libraries (dpdk, libfabric, whatever) > > Then create ioctl 1 - direct mapping of the write interface > Ioctl 2-255 - extensions That is an option.. > > The purpose of the object/method style framework was to better set the > > stage for the core code to provide that kind of wider interface for > > the drivers. > > I tried applying this to the rdma cm. It's a very poor fit. I'm very surprised by this, what do we have to do for rdma cm then? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/24/2016 5:41 PM, Jason Gunthorpe wrote: > On Tue, May 24, 2016 at 01:57:54PM -0400, Doug Ledford wrote: > >> Sean's proposal does away with the rigid nature of the current verbs 1.0 >> API and makes it much more flexible on a per-driver basis. This doesn't >> address the end user API issues, but it at least cleans up the user >> driver <-> kernel driver API so that one vendor's driver is not forced >> to carry around all the baggage needed for every other vendor's drivers. > > I'm not sure what you are reading, but to me both proposals look very > similar. They are both based on the generic object/action/method sort > of model I talked about in an earlier thread. They are similar in initial expression, but not in intent. Sean's is not concerned about preserving struct ib_qp (just as an example) as it stands, while Mellanox's patchset is all about passing around the same objects via a different interface. Even though they encode objects in netlink attributes, they are still expected to be the same basic objects we use today (and this is how they minimize the driver impact). > The main differences seem to boil down to the data marshalling and the > dispatching style for the kernel side.. Data marshalling in Sean's case also entails data content changes with a modest reorganization of what it entails for an item to be a core item (Sean can correct me if I'm wrong here). > Sean hasn't explored how to encode the actual method arguments, while > Mellanox's has a fairly well developed scheme with the netlink > encoding and sgl result list thingy. You are correct that Sean's patch has very little in the way of argument validation. However, I'm not entirely sure that Sean intended the core to do that sort of validation, he may have intended the drivers to do their own on the passed through data. The Mellanox patches do so, but at the expense of netlink which many people on this list find painful to read. >> Under that model, each vendor only carries what they need. It would >> then be libibverbs responsibility to take that driver specific data >> and > > Either patchset could go in this direction. This is a basic question > we need to decide on. And this is my central point, that I tried to make in my previous email: There are multiple trains of thought on where this will end up, and simply switching from write to ioctl is only part of the overall big picture. There should only be one API break in this entire process, so we need to make sure that any other possible API breakers are included in the initial change to the ioctl interface. > I'm starting to think the basic thrust of the Mellanox series (provide > an easy compatability with our userspace) is a sane approach. The > other options look like far too much work to use as a starting point. I could not care less about this argument. When you have to break an API, you do what you have to do to do the job right. Doing things the right way may turn out to be the easy way, but the argument would be because it's the right thing to do, not the easy thing to do. > That doesn't mean we can't decide to move in a direct-only direction - > the uAPI needs to have enough extension points to allow for that. Such > work should happen incrementally, and mainly target new uAPIs. This is arguable. If we know we want to go basically direct only in the future, then preserving the existing layer in the ioctl API eventually becomes a burden. It would be better to go direct only from the beginning. This needs to be settled. >> and not also the user visible libibverbs API at the same time. If all >> we want to talk about is verbs 1.0 over ioctl, then yes, we can do that. >> But not if we truly want to discuss a verbs 2.0 API. And I have yet to >> gather from the discussions I hear from people that we are in fact >> decided on pursuing a verbs 1.0 over ioctl API instead of considering a >> verbs 2.0 API. > > You are the only person I've heard who wants to restructure the > libibverbs interface at the same time.. That's not entirely true. My vision in my head for how we might start altering the libibverbs interface is already being done (although with a slightly different implementation than I had in mind) in the timestamp patches. What I want to see us do in libibverbs is to make extensions start following a new pattern instead of the one they have traditionally followed. But the reason I bring this up is because we need to be thinking of the end to end data transfers when we are thinking about the API break and any changes we might make. I'm thinking about possible changes in libibverbs, Sean is thinking about libfabrics/psm2/hfi1. If we end up just doing a behind the scenes switch from write to ioctl with no changing of data structures or command flow or anything else, then we can ignore the end to end picture because it won't change significantly. But if we do other things too, then I want other people to keep things like this in mind, it's fundamental to good design in a case like this.
On Wed, May 25, 2016 at 02:06:38PM -0400, Doug Ledford wrote: > >> Sean's proposal does away with the rigid nature of the current verbs 1.0 > >> API and makes it much more flexible on a per-driver basis. This doesn't > >> address the end user API issues, but it at least cleans up the user > >> driver <-> kernel driver API so that one vendor's driver is not forced > >> to carry around all the baggage needed for every other vendor's drivers. > > > > I'm not sure what you are reading, but to me both proposals look very > > similar. They are both based on the generic object/action/method sort > > of model I talked about in an earlier thread. > They are similar in initial expression, but not in intent. Sean's is > not concerned about preserving struct ib_qp (just as an example) as it > stands, while Mellanox's patchset is all about passing around the same > objects via a different interface. Even though they encode objects in > netlink attributes, they are still expected to be the same basic objects > we use today (and this is how they minimize the driver impact). Again, I think you are reading way too much into both patches. Until we go down the object by object patches it is too early to say how exactly things will be translated in either case. Eg Sean's comments about dis-aggregating the event_fd from the context in the Mellanox series is a good example of the kind of discussion and improvements I'm expecting to see on an object-by-object basis. But today I view both series as primarily exploring the dispatch and marshalling of the function calls - which is one of the first questions to settle. > their own on the passed through data. The Mellanox patches do so, but > at the expense of netlink which many people on this list find painful to > read. Well, I'm not sold on the netlink idea either, and I posted an alternative already. If you see another alternative you should describe it. > There are multiple trains of thought on where this will end up, and > simply switching from write to ioctl is only part of the overall big > picture. There should only be one API break in this entire process, so > we need to make sure that any other possible API breakers are included > in the initial change to the ioctl interface. To be very clear, an *API* break is not tolerable. The libibverbs 1.0 API must still be 100% functional on the new interface. We are talking about changing the *ABI* that transports that API. There is a lot of latitude there, but at the end of the day all the same objects must still exist in some form. If people have ideas for different future APIs then the best approach is to make room for them in the ABI. Sean's discussion on the event stuff is a prime example of that process. eg we can make room in the ABI for a different async event model by dis-aggregating things at the ABI level. The old API still remains preserved by that kind of dis-aggregation. If someone wants to suggest a different starting point for this than libibverbs 1.0 then they had better do the work and concretely describe that soon. Otherwise I see no better option than to go object-by-object through the libibverbs 1.0 API and collect comments. Obvious existing avenues to source comments from are better support for OPA, iWarp, rocee, dpdk and libfabric. > > I'm starting to think the basic thrust of the Mellanox series (provide > > an easy compatability with our userspace) is a sane approach. The > > other options look like far too much work to use as a starting point. > > I could not care less about this argument. When you have to break an > API, you do what you have to do to do the job right. Doing things the > right way may turn out to be the easy way, but the argument would be > because it's the right thing to do, not the easy thing to do. Usually giant monster changes result in failure, it is a bad software engineering practice. Substantially recoding every driver and every provider seems like it is too big a task to complete to me. Just preserving the existing write interface structure-by-structure also seems like a bad idea because we already know it has an inadaquate ABI for current drivers. > This is arguable. If we know we want to go basically direct only in the > future, then preserving the existing layer in the ioctl API eventually > becomes a burden. It would be better to go direct only from the > beginning. This needs to be settled. It would be like any other migration, if we eventually reach a point where the kernel fully supports something like direct-only for all kernel drivers then we can talk about an obsolescence path for the other interface. But, it also isn't really clear if direct-only is even a good idea, I would be really interested to see some patches exploring what that would look like. I'm still thinking of a hybrid common & direct approach as the leading option ... > If we end up just doing a behind the scenes switch from write to ioctl > with no changing of data structures or command flow or anything else, > then we can ignore the end to end picture because it won't change > significantly. You should describe what specific things you want to see. It is time for the people who are hand waving about non-verbs, non-qp, 'new data transfers' to sit down and specify what they want to see out of the kernel uABI. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/25/2016 03:00 PM, Jason Gunthorpe wrote: > On Wed, May 25, 2016 at 02:06:38PM -0400, Doug Ledford wrote: > > To be very clear, an *API* break is not tolerable. The libibverbs 1.0 > API must still be 100% functional on the new interface. You're drawing a distinction that doesn't exist. Libibverbs presents both an API and an ABI to user space applications, and the kernel presents both an API and ABI to libibverbs (or libfabric or whatever chooses to use it). Changing from write to ioctl changes the kernel to verbs API, and creates a new ABI, and possibly removes an old ABI if we dropped the write interface. The bare term API is not deterministic. >> If we end up just doing a behind the scenes switch from write to ioctl >> with no changing of data structures or command flow or anything else, >> then we can ignore the end to end picture because it won't change >> significantly. > > You should describe what specific things you want to see. > > It is time for the people who are hand waving about non-verbs, non-qp, > 'new data transfers' to sit down and specify what they want to see out > of the kernel uABI. This is a fair point, but one I don't have time to respond to at the moment. I've already spent more time on this topic this week than I should have given the other tasks I have at hand with the merge window closing in a few days.
> It is time for the people who are hand waving about non-verbs, non-qp, > 'new data transfers' to sit down and specify what they want to see out > of the kernel uABI. I want the core ioctl framework to do this: 1. Map user objects to/from the kernel to validate pointers. 2. Provide reference counting on such objects, in order to protect against the user attempting to destroy objects that are in use. 3. Provide protection that the state of an object is not driven in two different directions. 4. Safely process the removal of the underlying driver or device. 5. Support driver specific operations to the kernel. 6. Report events from multiple sources to a single file descriptor. 7. Minimize the kernel footprint for allocated kernel objects. 8. Minimize the code path needed to process any ioctl, without greatly affecting maintainability. 9. Copy ioctl input/output data to/from the kernel. If designed correctly, I believe that the ioctl core will be suitably generic to support uverbs, ucm (if kept), urdma cm, umad (if there's a reason to add new ioctls), and any drivers. Honestly, there's very little that would stand in the way of future subsystems (e.g. accelerators like Xeon Phi, maybe nvm) adopting it, if they so choose, though I'm not suggesting that be a goal. Even limiting the discussion to short term needs, the rdma cm requires the same update. The same ioctl core should work there as well. Otherwise, it's a very good indication that we will not meet the needs of whatever future drivers are thrown to linux-rdma, except to say use a driver specific channel and re-implement the above functionality. I would go as far as mixing rdma cm, verbs, and driver ioctls on the same fd (driven by shared event reporting), and that's a different file structure than what we have today. IMO, the format of the ioctl data is a much lower concern and not something the framework should be concerned about. From the above requirements, the framework needs to understand specific types of operations: creation, close, and ones that require exclusive access. Event handling requires more thought. I started re-working my patch to address the rdma cm. The original proposal wasn't generic enough and provided a poor fit, because the rdma cm has a single object and 25 versions of 'modify'. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 25, 2016 at 07:59:29PM +0000, Hefty, Sean wrote: > > It is time for the people who are hand waving about non-verbs, non-qp, > > 'new data transfers' to sit down and specify what they want to see out > > of the kernel uABI. > > I want the core ioctl framework to do this: > > 1. Map user objects to/from the kernel to validate pointers. > 2. Provide reference counting on such objects, in order to protect against the user attempting to destroy objects that are in use. > 3. Provide protection that the state of an object is not driven in two different directions. > 4. Safely process the removal of the underlying driver or device. > 5. Support driver specific operations to the kernel. > 6. Report events from multiple sources to a single file descriptor. > 7. Minimize the kernel footprint for allocated kernel objects. > 8. Minimize the code path needed to process any ioctl, without greatly affecting maintainability. > 9. Copy ioctl input/output data to/from the kernel. I agree with all of this as a suitable goal for the common dispatch/marshal framework. I like the idea of the generic uobject stuff in your patches. #6 seems 'not like the others' - seems like a reasonable concrete thing to consider as we build the objects on the framework. [understanding that compat operation is still required] > I started re-working my patch to address the rdma cm. The original > proposal wasn't generic enough and provided a poor fit, because the > rdma cm has a single object and 25 versions of 'modify'. I was imagining that all objects would be able to have a set of object specific functions beyond the common set. IIRC some verbs objects will need this as well, and I expect drivers to want it. This is why I wasn't thinking of using the ioctl # to dispatch the method. It is best if the driver specific calls have globally unique identifiers of some sort to protect against calling a method on the wrong driver. I would encourage you to go as far as Mellanox's series does and demonstrate the verbs context object in this framework ... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > I want the core ioctl framework to do this: > > > > 1. Map user objects to/from the kernel to validate pointers. > > 2. Provide reference counting on such objects, in order to protect against the user attempting to > destroy objects that are in use. > > 3. Provide protection that the state of an object is not driven in two different directions. > > 4. Safely process the removal of the underlying driver or device. > > 5. Support driver specific operations to the kernel. > > 6. Report events from multiple sources to a single file descriptor. > > 7. Minimize the kernel footprint for allocated kernel objects. > > 8. Minimize the code path needed to process any ioctl, without greatly affecting maintainability. > > 9. Copy ioctl input/output data to/from the kernel. > > I agree with all of this as a suitable goal for the common > dispatch/marshal framework. > > I like the idea of the generic uobject stuff in your patches. > > #6 seems 'not like the others' - seems like a reasonable concrete > thing to consider as we build the objects on the > framework. [understanding that compat operation is still required] I agree. Handling event reporting in general is slightly different from the other objections. But it ties in with object close processing, so I think the framework needs something here. #6 is based on examining the application APIs and driving requirements down. I.e. poll a single fd to get all events: CQ notifications, async events, CM disconnect events, etc. And while I'm thinking about it, expand events to support user specified. This would allow many apps to rid themselves of socketpairs which currently fill that role. > This is why I wasn't thinking of using the ioctl # to dispatch the > method. It is best if the driver specific calls have globally unique > identifiers of some sort to protect against calling a method on the > wrong driver. I thought of this as well. I'll see if I can fit it into my next revision. > I would encourage you to go as far as Mellanox's series does and > demonstrate the verbs context object in this framework ... I'm planning on going as far as issuing rdma cm commands, though through a 'quick' method that leaves the existing kernel code mostly intact. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 25, 2016 at 09:46:16PM +0000, Hefty, Sean wrote: > > #6 seems 'not like the others' - seems like a reasonable concrete > > thing to consider as we build the objects on the > > framework. [understanding that compat operation is still required] > > I agree. Handling event reporting in general is slightly different > from the other objections. But it ties in with object close > processing, so I think the framework needs something here. Yes, some kind of core framework seem essential here, and funneling all events seems reasonable if we can figure out how. > #6 is based on examining the application APIs and driving > requirements down. I.e. poll a single fd to get all events: CQ > notifications, async events, CM disconnect events, etc. And while > I'm thinking about it, expand events to support user specified. > This would allow many apps to rid themselves of socketpairs which > currently fill that role. socketpairs? I always use eventfds for that kind of stuff.. Are you hearing people object to number of fds? I've never thought of it as such a big deal, epoll makes it very painless to deal with.. Do people have scaling issues? What would be nice is having strong synchronization in the event stream, eg between cm (dis)connection and related cq events so those unexpected races go away... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Are you hearing people object to number of fds? I've never thought of > it as such a big deal, epoll makes it very painless to deal with.. Do > people have scaling issues? This was an item brought up in the ofiwg requirement gathering phase. Epoll is nice, and it's what libfabric uses internally to hide multiple fd's, but there are still more resources involved, plus additional overhead processing the events. Exascale is one big scaling issue, and large core count systems exacerbate the problems. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > I want the core ioctl framework to do this: > > > > > > 1. Map user objects to/from the kernel to validate pointers. > > > 2. Provide reference counting on such objects, in order to protect against the > user attempting to > > destroy objects that are in use. > > > 3. Provide protection that the state of an object is not driven in two different > directions. > > > 4. Safely process the removal of the underlying driver or device. > > > 5. Support driver specific operations to the kernel. > > > 6. Report events from multiple sources to a single file descriptor. > > > 7. Minimize the kernel footprint for allocated kernel objects. > > > 8. Minimize the code path needed to process any ioctl, without greatly affecting > maintainability. > > > 9. Copy ioctl input/output data to/from the kernel. > > > > I agree with all of this as a suitable goal for the common > > dispatch/marshal framework. > > > > I like the idea of the generic uobject stuff in your patches. > > > > #6 seems 'not like the others' - seems like a reasonable concrete > > thing to consider as we build the objects on the > > framework. [understanding that compat operation is still required] > > I agree. Handling event reporting in general is slightly different from the other > objections. But it ties in with object close processing, so I think the framework > needs something here. > > #6 is based on examining the application APIs and driving requirements down. I.e. > poll a single fd to get all events: CQ notifications, async events, CM disconnect > events, etc. And while I'm thinking about it, expand events to support user > specified. This would allow many apps to rid themselves of socketpairs which > currently fill that role. > Applications I have written that block typically put the CQ, async, rdmacm fds all into an poll/epoll group and block that way. But a common mechanism for events might end up cleaner and simpler... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > Sent: Wednesday, May 25, 2016 11:52 PM > > > > I want the core ioctl framework to do this: > > > > 1. Map user objects to/from the kernel to validate pointers. > > 2. Provide reference counting on such objects, in order to protect against the > user attempting to destroy objects that are in use. > > 3. Provide protection that the state of an object is not driven in two different > directions. > > 4. Safely process the removal of the underlying driver or device. > > 5. Support driver specific operations to the kernel. > > 6. Report events from multiple sources to a single file descriptor. > > 7. Minimize the kernel footprint for allocated kernel objects. > > 8. Minimize the code path needed to process any ioctl, without greatly > affecting maintainability. > > 9. Copy ioctl input/output data to/from the kernel. > > I agree with all of this as a suitable goal for the common dispatch/marshal > framework. > The goals are OK. However, I think that processing object locking before validation is not secure. A malicious user could easily cause a deadlock due to ordering. Rather, IDR processing should be offered as helper functions to method handlers *after* parameter validation. This also removes the need to pass arrays of objects separately from the command attributes, which looks cumbersome. I hope that we can show this in our next revision. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 26, 2016 at 06:07:02PM +0000, Liran Liss wrote: > However, I think that processing object locking before validation is not secure. > A malicious user could easily cause a deadlock due to ordering. Hum, interesting point. I wonder if we already have bugs there today? I'd say that is actually more likely to happen if it is open coded. Nested object locking like that requires consistent ordering to avoid deadlock.. It would be easier for the core code to enforce consistent nested lock acquire ordering than to try and do that scattered across all the code. You are right, the core code cannot just iterate over the provided object list and grab the locks. One very simple option is for the core to enforce a sort order on the object list as it grabs the locks and fail&unwind for unordered lists. Parameter validation doesn't really factor into this problem.... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I'd say that is actually more likely to happen if it is open > coded. Nested object locking like that requires consistent ordering to > avoid deadlock.. > > It would be easier for the core code to enforce consistent nested lock > acquire ordering than to try and do that scattered across all the > code. You are right, the core code cannot just iterate over the > provided object list and grab the locks. > > One very simple option is for the core to enforce a sort order on the > object list as it grabs the locks and fail&unwind for unordered lists. I did not have object locking in my proposal. I tried to maintain very simple locking, hoping it would make it possible for a mere human to validate that the locking was correct. Plus it kept the size of the object structure smaller. A lock associated with the idr is held while objects are acquired. A simple reference count is bumped for each object when used. If exclusive access is required, a flag is set. If the exclusive flag is set, then any attempt to acquire the object is rejected and the ioctl fails. Coordination with the removal of the kernel client (i.e. device) is done using a rw-lock. Ioctls acquire a read lock, while the removal processing acquires the write lock. Maybe there's a reason to create a more complex scheme than this, similar to what uverbs has, but AFAICT, this level of synchronization seems sufficient. Any lower-level locking is the responsibility of the kernel client. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 27, 2016 at 12:22:56AM +0000, Hefty, Sean wrote: > A lock associated with the idr is held while objects are acquired. > A simple reference count is bumped for each object when used. If > exclusive access is required, a flag is set. If the exclusive flag > is set, then any attempt to acquire the object is rejected and the > ioctl fails. How is that different from your basic idr+kref locking scheme common in the kernel? What is a non-blocking exclusive useful for? Removal? > Coordination with the removal of the kernel client (i.e. device) is > done using a rw-lock. Ioctls acquire a read lock, while the removal > processing acquires the write lock. uverbs ended up using a more complex srcu scheme for this citing performance reasons.. I argued for the simpler rwlock at the time.. > synchronization seems sufficient. Any lower-level locking is the > responsibility of the kernel client. So there are still object locks, but the core doesn't help with them at all. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > A lock associated with the idr is held while objects are acquired. > > A simple reference count is bumped for each object when used. If > > exclusive access is required, a flag is set. If the exclusive flag > > is set, then any attempt to acquire the object is rejected and the > > ioctl fails. > > How is that different from your basic idr+kref locking scheme common > in the kernel? I don't view it as significantly different. A kref is just an atomic anyway. > What is a non-blocking exclusive useful for? Removal? Removal and possibly modify calls. Note that I only apply the exclusive flag to the first object in the list, so some ordering of what the objects are is required. (I didn't expect clients to accept arbitrary ordering anyway.) We can debate how the ioctl framework should serialize ioctl processing, versus pushing that functionality either into user space or the ioctl handlers. My proposal serialized against device removal only, otherwise it failed the call if exclusive access was required. This supports pushing the serialization either up or down, but it's outside the framework. > > Coordination with the removal of the kernel client (i.e. device) is > > done using a rw-lock. Ioctls acquire a read lock, while the removal > > processing acquires the write lock. > > uverbs ended up using a more complex srcu scheme for this citing > performance reasons.. I argued for the simpler rwlock at the time.. I guess I'm re-arguing for a simpler rwlock then. :) In any case, the actual locking mechanism can change. > > synchronization seems sufficient. Any lower-level locking is the > > responsibility of the kernel client. > > So there are still object locks, but the core doesn't help with them > at all. Moving this into a framework which doesn't understand what the objects actually are would be difficult. I can see this being done by a uverbs core, which is one of the users of the ioctl framework. The ucma is another user of this code. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index f818538..d70ba32 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -34,4 +34,4 @@ ib_umad-y := user_mad.o ib_ucm-y := ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o urdma.o diff --git a/include/uapi/rdma/rdma_ioctl.h b/include/uapi/rdma/rdma_ioctl.h new file mode 100644 index 0000000..3e34791 --- /dev/null +++ b/include/uapi/rdma/rdma_ioctl.h @@ -0,0 +1,134 @@ +/* + * Copyright (c) 2016 Intel Corporation, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef RDMA_IOCTL_H +#define RDMA_IOCTL_H + +#include <linux/types.h> +#include <rdma/ib_user_verbs.h> +#include <linux/ioctl.h> + + +/* ioctls are grouped into 1 of 8 domains/namespaces +#define URDMA_DOMAIN(nr) (_IOC_NR(nr) >> 5) +enum { + URDMA_DOMAIN_OBJECT, + URDMA_DOMAIN_DRIVER, + URDMA_MAX_DOMAIN +}; +*/ + +#define URDMA_OP_MASK 0x7F +#define URDMA_OP(nr) (_IOC_NR(nr) & URDMA_OP_MASK) + +/* operations */ +enum { + URDMA_QUERY, + URDMA_OPEN, + URDMA_CLOSE, + URDMA_MODIFY, + URDMA_READ, + URDMA_WRITE, + URDMA_MAX_OP +}; + +/* driver specific object operations set the high-order op bit */ +#define URDMA_DRIVER_OP (0x80) + +/* operation domains, doubles as object types */ +enum { + URDMA_DRIVER, /* is this usable? */ + URDMA_DEVICE, + URDMA_PORT, + URDMA_CQ, + URDMA_PD, + URDMA_AH, + URDMA_MR, + URDMA_SHARED_RX, + URDMA_SHARED_TX, + URDMA_QP, + URDMA_CMD_CTX, + /* others... */ + URDMA_MAX_DOMAIN +}; + +/* driver specific domains set the high-order domain bit */ +#define URDMA_DRIVER_DOMAIN (1 << 16) + +struct urdma_obj_id { + u32 instance_id; + u16 obj_type; + u16 resv; +}; + +/* ensure that data beyond header starts at 64-byte alignment */ +struct urdma_ioctl { + u8 version; + u8 count; + u16 domain; + u16 length; + u16 resv; + u64 flags; + union { + struct urdma_obj_id obj_id[0]; + u64 data[0]; +#ifdef __KERNEL__ + void *obj[0]; +#endif + }; +}; + + +#define URDMA_TYPE 0xda +#define URDMA_IO(op) _IO(URDMA_TYPE, op) +#define URDMA_IOR(op, type) _IOR(URDMA_TYPE, op, type) +#define URDMA_IOW(op, type) _IOW(URDMA_TYPE, op, type) +#define URDMA_IOWR(op, type) _IOWR(URDMA_TYPE, op, type) + +#define URDMA_DRIVER_CMD(op) (op | URDMA_DRIVER_OP) +#define URDMA_DRIVER_IO(op) URDMA_IO(URDMA_DRIVER_CMD(op)) +#define URDMA_DRIVER_IOR(op, type) URDMA_IOR(URDMA_DRIVER_CMD(op), type) +#define URDMA_DRIVER_IOW(op, type) URDMA_IOW(URDMA_DRIVER_CMD(op), type) +#define URDMA_DRIVER_IOWR(op, type) URDMA_IOWR(URDMA_DRIVER_CMD(op), type) + +#define URDMA_IOCTL(op) URDMA_IOWR(URDMA_##op, struct urdma_ioctl) + +#define URDMA_IOCTL_QUERY URDMA_IOCTL(QUERY) +#define URDMA_IOCTL_OPEN URDMA_IOCTL(OPEN) +#define URDMA_IOCTL_CLOSE URDMA_IOCTL(CLOSE) +#define URDMA_IOCTL_MODIFY URDMA_IOCTL(MODIFY) +#define URDMA_IOCTL_READ URDMA_IOCTL(READ) +#define URDMA_IOCTL_WRITE URDMA_IOCTL(WRITE) + + +#endif /* RDMA_IOCTL_H */ + diff --git a/include/rdma/rdma_uapi.h b/include/rdma/rdma_uapi.h new file mode 100644 index 0000000..8b5105e --- /dev/null +++ b/include/rdma/rdma_uapi.h @@ -0,0 +1,140 @@ +/* + * Copyright (c) 2016 Intel Corporation, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef RDMA_UAPI_H +#define RDMA_UAPI_H + +#include <linux/types.h> +#include <rdma/ib_user_verbs.h> +#include <uapi/rdma/rdma_ioctl.h> + + +#define URDMA_OFFSET(dom, op) (dom * URDMA_MAX_OP + op) +#define URDMA_MAX_BASE (URDMA_MAX_DOMAIN * URDMA_MAX_OP - 1) +#define URDMA_DRIVER_OFFSET(op) (op) + +/* Object and control flags */ +/* Operation on object requires exclusive access - e.g. MODIFY */ +#define URDMA_EXCL (1 << 0) +/* Events may be generated for the given object - e.g. CQ, QP */ +#define URDMA_EVENT (1 << 1) +/* Device resources have been freed */ +#define URDMA_CLOSED (1 << 2) + +struct urdma_device; + +typedef long (*urdma_ioctl_handler_t)(struct urdma_device *dev, + void *data, void *file_data); + +/* The purpose of this structure is to guide the behavior of the + * common ioctl processing code. + */ +struct urdma_ioctl_desc { + u64 flags; + unsigned int cmd; + u16 length; /* max size needed */ + urdma_ioctl_handler_t func; + const char *name; +}; + +typedef long (*urdma_ioctl_hook_t)(struct urdma_device *dev, + struct urdma_ioctl *ioctl, + struct urdma_ioctl_desc *desc, + void *file_data); + +#define URDMA_DESC(_dom, _op, _func, _flags) \ + [URDMA_OFFSET(URDMA_##_dom, URDMA_##_op)] = { \ + .flags = _flags, \ + .cmd = URDMA_IOCTL_##_op, \ + .func = _func, \ + .name = #_dom "_" #_op \ + } + +#define URDMA_DRIVER_DESC(_dom, _op, _func, _flags) \ + [URDMA_DRIVER_OFFSET(URDMA_##_dom, URDMA_##_op)] = { \ + .flags = _flags, \ + .cmd = URDMA_IOCTL_##_op, \ + .func = _func, \ + .name = #_dom "_" #_op \ + } + +extern const struct urdma_ioctl_desc verbs_ioctl[URDMA_MAX_BASE]; + +struct urdma_driver { + int num_ioctls; + struct urdma_ioctl_desc *ioctl; +}; + +/* will merge with ib_device, can be separated later to support + * non-verbs devices that do not plug into the kernel APIs + */ +struct urdma_device { + struct urdma_driver *drv; + struct rw_semaphore rw_lock; + int flags; + int num_ioctls; + struct urdma_ioctl_desc *ioctl; + + /* Order to cleanup obj_list. Objects are destroyed from + * obj_list[close_map[0]]..obj_list[close_map[n]] + */ + int num_objs; + int *close_map; + struct list_head *obj_lists; +}; + +/* use ib_uobject? */ +/* urdma will protect against destroying an object that is in use, + * but all locking is pushed down to the drivers. + * Keep this structure as small as possible + */ +struct urdma_obj { + u64 ucontext; + void *kcontext; + u32 instance_id; /* idr index */ + u16 obj_type; + u16 flags; + struct urdma_device *dev; + struct list_head entry; + atomic_t use_cnt; + //struct kref ref; + //struct rw_semaphore mutex; +}; + +struct urdma_map { + struct idr idr; + struct mutex lock; +}; + + +#endif /* RDMA_UAPI_H */ +diff --git a/drivers/infiniband/core/urdma.c b/drivers/infiniband/core/urdma.c new file mode 100644 index 0000000..bd84eb9 --- /dev/null +++ b/drivers/infiniband/core/urdma.c @@ -0,0 +1,373 @@ +/* + * Copyright (c) 2016 Intel Corporation, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include <linux/types.h> +#include <linux/bitops.h> + +#include <rdma/ib_user_verbs.h> +#include <rdma/ib_verbs.h> +#include <uapi/rdma/rdma_ioctl.h> +#include <rdma/rdma_uapi.h> + +#include "uverbs.h" + + +static long urdma_query_device(struct urdma_device *dev, void *data, + void *file_data) +{ + return -ENOSYS; +} + +/* shared ioctl function dispatch table, usable by all verbs devices */ +const struct urdma_ioctl_desc verbs_ioctl[URDMA_MAX_BASE] = { + URDMA_DESC(DEVICE, QUERY, urdma_query_device, 0), +// URDMA_DESC(DEVICE, OPEN, urdma_open_device, URDMA_EVENT), + /* we could also assume exclusive access for modify/close operations */ +// URDMA_DESC(DEVICE, CLOSE, urdma_close_device, URDMA_EXCL), +// URDMA_DESC(DEVICE, MODIFY, urdma_modify_device, URDMA_EXCL), +// URDMA_DESC(CQ, QUERY, urdma_query_cq, 0), +// URDMA_DESC(CQ, OPEN, urdma_open_cq, URDMA_EVENT), +// URDMA_DESC(CQ, CLOSE, urdma_close_cq, URDMA_EXCL), +// URDMA_DESC(CQ, MODIFY, urdma_modify_cq, URDMA_EXCL), + /* ... yadda yadda yadda */ +}; +EXPORT_SYMBOL(verbs_ioctl); + + +/* Map instance id's to object structures. + * We can define per object/device/driver maps if needed for better + * parallelism, but use one for now. + */ +struct urdma_map map = { IDR_INIT(map.idr), + __MUTEX_INITIALIZER(map.lock) }; + + +static struct urdma_obj * urdma_get_obj(struct idr *idr, struct urdma_device *dev, + struct urdma_obj_id *id, bool excl) +{ + struct urdma_obj *obj; + + if (id->resv) + return ERR_PTR(-EINVAL); + + obj = idr_find(idr, id->instance_id); + if (!obj || obj->dev != dev || obj->obj_type != id->obj_type) + return ERR_PTR(-ENOENT); + else if (obj->flags & URDMA_EXCL || (excl && atomic_read(&obj->use_cnt))) + return ERR_PTR(-EBUSY); + + if (excl) + obj->flags |= URDMA_EXCL; + atomic_inc(&obj->use_cnt); + return obj; +} + +static void urdma_put_obj(struct urdma_obj *obj) +{ + if (obj->flags & URDMA_EXCL) + obj->flags &= ~URDMA_EXCL; + atomic_dec(&obj->use_cnt); +} + +static void urdma_unmap_obj(struct urdma_ioctl *ioctl, int index) +{ + struct urdma_obj *obj; + + obj = ioctl->obj[index]; + ioctl->obj_id[index].instance_id = obj->instance_id; + ioctl->obj_id[index].obj_type = obj->obj_type; + ioctl->obj_id[index].resv = 0; + urdma_put_obj(obj); +} + +static void urdma_unmap_objs(struct urdma_device *dev, struct urdma_ioctl *ioctl) +{ + int i; + + for (i = 0; i < ioctl->count; i++) + urdma_unmap_obj(ioctl, i); +} + +static long urdma_map_objs(struct urdma_device *dev, + struct urdma_ioctl *ioctl, bool excl) +{ + struct urdma_obj *obj; + int i; + + mutex_lock(&map.lock); + for (i = 0; i < ioctl->count; i++) { + obj = urdma_get_obj(&map.idr, dev, &ioctl->obj_id[i], + excl && i == 0); + if (IS_ERR(obj)) + goto err; + + ioctl->obj[i] = obj; + } + mutex_unlock(&map.lock); + return 0; +err: + while (i--) + urdma_unmap_obj(ioctl, i); + return PTR_ERR(obj); +} + +/* process driver specific ioctl + * driver ioctl's follow more conventional ioctl format + */ +long urdma_driver_ioctl(struct ib_uverbs_file *file_data, unsigned int cmd, + unsigned long arg) +{ + struct urdma_device *dev /*= file_data->dev*/; + struct urdma_driver *drv = dev->drv; + struct urdma_ioctl_desc *desc; + char stack_data[128], *data; + u16 size; + int offset; + long ret; + + offset = URDMA_OP(cmd); + if (offset >= drv->num_ioctls || !drv->ioctl[offset].func) + return -EINVAL; + + desc = &drv->ioctl[offset]; + size = _IOC_SIZE(desc->cmd); + if (size > sizeof(stack_data)) { + data = kmalloc(size, GFP_KERNEL); + if (!data) + return -ENOMEM; + } else { + data = stack_data; + } + + if (desc->cmd & IOC_IN) { + if (copy_from_user(data, (void __user *) arg, size)) { + ret = -EFAULT; + goto out; + } + } else if (desc->cmd & IOC_OUT) { + memset(data, 0, size); + } + + /* data is in/out parameter */ + ret = desc->func(dev, data, file_data); + + if (desc->cmd & IOC_OUT) { + if (copy_to_user((void __user *) arg, data, size)) + ret = -EFAULT; + } +out: + if (data != stack_data) + kfree(data); + return ret; +} + +static long urdma_pre_common(struct urdma_device *dev, struct urdma_ioctl *ioctl, + struct urdma_ioctl_desc *desc, void *file_data) +{ + down_read(&dev->rw_lock); + if (dev->flags & URDMA_CLOSED) { + up_read(&dev->rw_lock); + return -ENODEV; + } + + return urdma_map_objs(dev, ioctl, desc->flags & URDMA_EXCL); +} + +static long urdma_post_common(struct urdma_device *dev, struct urdma_ioctl *ioctl, + struct urdma_ioctl_desc *desc, void *file_data) +{ + urdma_unmap_objs(dev, ioctl); + up_read(&dev->rw_lock); + return 0; +} + +static long urdma_pre_open(struct urdma_device *dev, struct urdma_ioctl *ioctl, + struct urdma_ioctl_desc *desc, void *file_data) +{ + struct urdma_obj *obj; + + obj = kzalloc(sizeof *obj, GFP_KERNEL); + if (!obj) + return -ENOMEM; + + obj->flags = URDMA_EXCL; + obj->obj_type = ioctl->domain; + atomic_set(&obj->use_cnt, 1); + + mutex_lock(&map.lock); + obj->instance_id = idr_alloc(&map.idr, obj, 0, 0, GFP_KERNEL); + /* TODO: handle driver objects */ + if (obj->instance_id >= 0) + list_add_tail(&obj->entry, &dev->obj_lists[obj->obj_type]); + mutex_unlock(&map.lock); + + if (obj->instance_id < 0) { + kfree(obj); + return -ENOMEM; + } + + /* new object added after input object array */ + ioctl->obj[ioctl->count++] = obj; + return 0; +} + +static long urdma_pre_close(struct urdma_device *dev, struct urdma_ioctl *ioctl, + struct urdma_ioctl_desc *desc, void *file_data) +{ + if (ioctl->count != 1) + return -EINVAL; + return urdma_map_objs(dev, ioctl, desc->flags & URDMA_EXCL); +} + +static long urdma_post_close(struct urdma_device *dev, struct urdma_ioctl *ioctl, + struct urdma_ioctl_desc *desc, void *file_data) +{ + struct urdma_obj *obj; + + obj = ioctl->obj[0]; + ioctl->obj[0] = NULL; + + mutex_lock(&map.lock); + idr_remove(&map.idr, obj->instance_id); + list_del(&obj->entry); + mutex_unlock(&map.lock); + kfree(obj); + return 0; +} + +const static urdma_ioctl_hook_t urdma_pre_op[URDMA_MAX_OP] = { + [URDMA_QUERY] = urdma_pre_common, + [URDMA_OPEN] = urdma_pre_open, + [URDMA_CLOSE] = urdma_pre_close, + [URDMA_MODIFY] = urdma_pre_common, + [URDMA_READ] = urdma_pre_common, + [URDMA_WRITE]= urdma_pre_common, +}; + +const static urdma_ioctl_hook_t urdma_post_op[URDMA_MAX_OP] = { + [URDMA_QUERY] = urdma_post_common, + [URDMA_OPEN] = urdma_post_common, + [URDMA_CLOSE] = urdma_post_close, + [URDMA_MODIFY] = urdma_post_common, + [URDMA_READ] = urdma_post_common, + [URDMA_WRITE]= urdma_post_common, +}; + +long urdma_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) +{ + struct urdma_device *dev; + struct ib_uverbs_file *file_data; + struct urdma_ioctl_desc *desc; + struct urdma_ioctl hdr, *data; + char stack_data[128]; + u8 op; + int offset; + long ret; + + file_data = filp->private_data; + /* dev = file_data->dev; */ + + if (_IOC_NR(cmd) & URDMA_DRIVER_OP) + return urdma_driver_ioctl(file_data, cmd, arg); + + op = URDMA_OP(cmd); + if (op > URDMA_MAX_OP || _IOC_SIZE(cmd) < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, (void __user *) arg, sizeof(hdr))) + return -EFAULT; + + offset = URDMA_OFFSET(hdr.domain, op); + if (offset >= dev->num_ioctls || !dev->ioctl[offset].func) + return -EINVAL; + + desc = &dev->ioctl[offset]; + if ((sizeof(hdr) + hdr.count * sizeof(hdr.obj_id) > hdr.length) || + (hdr.length > desc->length)) + return -EINVAL; + + if (desc->length > sizeof(stack_data)) { + data = kmalloc(desc->length, GFP_KERNEL); + if (!data) + return -ENOMEM; + } else { + data = (struct urdma_ioctl *) stack_data; + } + + if (copy_from_user(data, (void __user *) arg, hdr.length)) { + ret = -EFAULT; + goto out; + } + + if (urdma_pre_op[op]) { + ret = urdma_pre_op[op](dev, data, desc, file_data); + if (ret) + goto out; + } + + ret = desc->func(dev, data, file_data); + + if (urdma_post_op[op]) { + ret = urdma_post_op[op](dev, data, desc, file_data); + if (ret) + goto out; + } + + if (copy_to_user((void __user *) arg, data, data->length)) + ret = -EFAULT; +out: + if (data != (struct urdma_ioctl *) stack_data) + kfree(data); + return ret; +} + +static void urdma_close_obj(struct urdma_device *dev, struct urdma_obj *obj) +{ + /* kernel initiated close, releaes device resources */ +} + +static void urdma_close_dev(struct urdma_device *dev) +{ + struct urdma_obj *obj; + int i; + + down_write(&dev->rw_lock); + dev->flags |= URDMA_CLOSED; + + for (i = 0; i < dev->num_objs; i++) { + list_for_each_entry(obj, &dev->obj_lists[dev->close_map[i]], entry) { + urdma_close_obj(dev, obj); + } + } + up_write(&dev->rw_lock); +} +
The following is a sketch, or outline, for an ioctl framework. The purpose of the patch is to drive discussion and feedback on various design decisions rather than low-level comments on the code, so that the changes can be incorporated earlier in the development cycle. The code is entirely untested and some functionality is not coded. The general architecture behind the implementation is described below. Conceptually, ioctl's are split into two main groups: structured and unstructured. Unstructured ioctls: Unstructured ioctls are dispatched directly to a driver with minimal processing by the framework. Structured ioctls: Structured ioctls are designed to support generic framework processing. These ioctls are dispatched to per-device routines and are based on the object ioctl model which was previously posted to the linux-rdma mail list. It is anticipated that most ioctls will be structured, and the following descriptions assume structured ioctls. Ioctl commands: There are 6 generic ioctl commands: QUERY, OPEN, CLOSE, MODIFY, READ, and WRITE. Commands target a specific functional domain, which may map to specific objects. The implementation has a 1:1 mapping, but this is not required. Possible domains include: DRIVER, DEVICE, PORT, CQ, PD, MR, QP, etc. The full set of domains, and how they map to objects, needs to be determined. Ioctls are routed to dispatch routines based on the <domain, command> pair. Additionally, each command includes pre and post operation hooks. The hooks allow command specific (e.g. OPEN or CLOSE) functionality to be applied. Ioctl descriptors: Descriptors are kernel structures that describe how each ioctl should be processed by the framework, and include the dispatch routine and control flags. Each device maintains an array of ioctl descriptors. Discussion is needed on the format of the descriptors, though these are easily changeable. Ioctl format: All ioctls start with the same basic header. The header includes a version, the length of any input data, and the domain. Structured ioctls also include an array of objects that the ioctl will require access to for processing. This patch does not define the format of the ioctl data beyond the object array. That is ultimately determined based on the <domain, command>. However, as seen in the patch, the format of the object array will matter and needs discussion. Kernel objects: The framework maps object identifiers between user space and the kernel. Based on descriptor control flags, it will provide exclusive access to objects where needed. The framework also handles clean-up in case of application exits and supports device removal. Event handling will also be supported by the framework, though that work still needs to be done. Architecturally, the framework supports drivers plugging into it independent of any kernel interface that may be used. It is also relatively independent of any user space library or implementation which may require access to kernel resources. Signed-off-by: Sean Hefty <sean.hefty@intel.com> --- I restructured the email to show the headers first. The code is also available in the dev branch on github.com/shefty/linux. drivers/infiniband/core/Makefile | 2 +- drivers/infiniband/core/urdma.c | 373 ++++++++++++++++++++++++++++++++++++++ include/rdma/rdma_uapi.h | 140 ++++++++++++++ include/uapi/rdma/rdma_ioctl.h | 134 ++++++++++++++ 4 files changed, 648 insertions(+), 1 deletions(-) create mode 100644 drivers/infiniband/core/urdma.c create mode 100644 include/rdma/rdma_uapi.h create mode 100644 include/uapi/rdma/rdma_ioctl.h