Message ID | 1437548143-24893-29-git-send-email-sagig@mellanox.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
> +/** > + * ib_map_mr_sg() - Populates MR with a dma mapped SG list > + * @mr: memory region > + * @sg: dma mapped scatterlist > + * @sg_nents: number of entries in sg > + * @access: access permissions I know moving the access flags here was my idea originally, but I seem convinced by your argument that it might fit in better with the posting helper. Or did someone else come up with a better argument that mine for moving it here? > +int ib_map_mr_sg(struct ib_mr *mr, > + struct scatterlist *sg, > + unsigned short sg_nents, > + unsigned int access) > +{ > + int rc; > + > + if (!mr->device->map_mr_sg) > + return -ENOSYS; > + > + rc = mr->device->map_mr_sg(mr, sg, sg_nents); Do we really need a driver callout here? It seems like we should just do the map here, and then either have a flag for the mlx5 indirect mapping, or if you want to keep the abstraction add the method at that point but make it optional, so that all the other drivers don't need the boilerplate code. Also it seems like this returns 0/-error. How do callers like SRP see that it only did a partial mapping and it needs another MR? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/22/2015 7:50 PM, Christoph Hellwig wrote: >> +/** >> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list >> + * @mr: memory region >> + * @sg: dma mapped scatterlist >> + * @sg_nents: number of entries in sg >> + * @access: access permissions > > I know moving the access flags here was my idea originally, but I seem > convinced by your argument that it might fit in better with the posting > helper. Or did someone else come up with a better argument that mine > for moving it here? Not really. I was and still pretty indifferent about it... > >> +int ib_map_mr_sg(struct ib_mr *mr, >> + struct scatterlist *sg, >> + unsigned short sg_nents, >> + unsigned int access) >> +{ >> + int rc; >> + >> + if (!mr->device->map_mr_sg) >> + return -ENOSYS; >> + >> + rc = mr->device->map_mr_sg(mr, sg, sg_nents); > > Do we really need a driver callout here? It seems like we should > just do the map here, and then either have a flag for the mlx5 indirect > mapping, or if you want to keep the abstraction add the method at that > point but make it optional, so that all the other drivers don't need the > boilerplate code. I commented on this bit in another reply. I think that several drivers will want to use their own mappings. But I can change that if it's not the case... > > Also it seems like this returns 0/-error. How do callers like SRP > see that it only did a partial mapping and it needs another MR? Umm, I think SRP would need to iterate over the sg list and pass partial SGs to the mapping (I can add a break; statement if we met sg_nents) It's not perfect, but the idea was not to do backflips here. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote: > > +/** > > + * ib_map_mr_sg() - Populates MR with a dma mapped SG list > > + * @mr: memory region > > + * @sg: dma mapped scatterlist > > + * @sg_nents: number of entries in sg > > + * @access: access permissions > > I know moving the access flags here was my idea originally, but I seem > convinced by your argument that it might fit in better with the posting > helper. Or did someone else come up with a better argument that mine > for moving it here? I was hoping we'd move the DMA flush and translate into here and make it mandatory. Is there any reason not to do that? > > +int ib_map_mr_sg(struct ib_mr *mr, > > + struct scatterlist *sg, > > + unsigned short sg_nents, > > + unsigned int access) > > +{ > > + int rc; > > + > > + if (!mr->device->map_mr_sg) > > + return -ENOSYS; > > + > > + rc = mr->device->map_mr_sg(mr, sg, sg_nents); > > Do we really need a driver callout here? It seems like we should The call out makes sense to me.. The driver will convert the scatter list directly into whatever HW representation it needs and prepare everything for posting. Every driver has a different HW format, so it must be a callout. > Also it seems like this returns 0/-error. How do callers like SRP > see that it only did a partial mapping and it needs another MR? I would think it is an error to pass in more sg_nents than the MR was created with, so SRP should never get a partial mapping as it should never ask for more than max_entries. (? Sagi, did I get the intent of this right?) Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 22, 2015 at 09:55:28AM +0300, Sagi Grimberg wrote: > +/** > + * ib_map_mr_sg() - Populates MR with a dma mapped SG list > + * @mr: memory region > + * @sg: dma mapped scatterlist > + * @sg_nents: number of entries in sg > + * @access: access permissions Again, related to my prior comments, please have two of these: ib_map_mr_sg_rkey() ib_map_mr_sg_lkey() So we force ULPs to think about what they are doing properly, and we get a chance to actually force lkey to be local use only for IB. > +static inline void > +ib_set_fastreg_wr(struct ib_mr *mr, > + u32 key, The key should come from MR. Once the above is split then it is obvious which key to use. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 22, 2015 at 11:44:01AM -0600, Jason Gunthorpe wrote: > I was hoping we'd move the DMA flush and translate into here and make > it mandatory. Is there any reason not to do that? That would be a reason for passing in a direction, but it would also up the question on what form we pass that access flag in. The old-school RDMA local/remote read/write flags, or a enum_dma_direction and either a bool or separate functions for lkey/rkey. Although I wonder if we really need to differentiate between rkey and leky in this ib_map_mr_sg function, or if we should do it when allocating the mr, i.e. in ib_alloc_mr. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/22/2015 8:44 PM, Jason Gunthorpe wrote: > On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote: >>> +/** >>> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list >>> + * @mr: memory region >>> + * @sg: dma mapped scatterlist >>> + * @sg_nents: number of entries in sg >>> + * @access: access permissions >> >> I know moving the access flags here was my idea originally, but I seem >> convinced by your argument that it might fit in better with the posting >> helper. Or did someone else come up with a better argument that mine >> for moving it here? > > I was hoping we'd move the DMA flush and translate into here and make > it mandatory. Is there any reason not to do that? The reason I didn't added it in was so the ULPs can make sure they meet the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his SG list set partials and iSER to detect gaps (they need to dma map for that). > >>> +int ib_map_mr_sg(struct ib_mr *mr, >>> + struct scatterlist *sg, >>> + unsigned short sg_nents, >>> + unsigned int access) >>> +{ >>> + int rc; >>> + >>> + if (!mr->device->map_mr_sg) >>> + return -ENOSYS; >>> + >>> + rc = mr->device->map_mr_sg(mr, sg, sg_nents); >> >> Do we really need a driver callout here? It seems like we should > > The call out makes sense to me.. > > The driver will convert the scatter list directly into whatever HW > representation it needs and prepare everything for posting. Every > driver has a different HW format, so it must be a callout. > >> Also it seems like this returns 0/-error. How do callers like SRP >> see that it only did a partial mapping and it needs another MR? > > I would think it is an error to pass in more sg_nents than the MR was > created with, so SRP should never get a partial mapping as it should > never ask for more than max_entries. > > (? Sagi, did I get the intent of this right?) Error is returned when: - sg_nents > max_entries - sg has gaps -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/22/2015 9:02 PM, Jason Gunthorpe wrote: > On Wed, Jul 22, 2015 at 09:55:28AM +0300, Sagi Grimberg wrote: >> +/** >> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list >> + * @mr: memory region >> + * @sg: dma mapped scatterlist >> + * @sg_nents: number of entries in sg >> + * @access: access permissions > > Again, related to my prior comments, please have two of these: > > ib_map_mr_sg_rkey() > ib_map_mr_sg_lkey() > > So we force ULPs to think about what they are doing properly, and we > get a chance to actually force lkey to be local use only for IB. The lkey/rkey decision is passed in the fastreg post_send(). ib_map_mr_sg is just a mapping API, not the registration itself. > >> +static inline void >> +ib_set_fastreg_wr(struct ib_mr *mr, >> + u32 key, > > The key should come from MR. Once the above is split then it is > obvious which key to use. IMO, it's obvious as it is. I don't see why should anyone get it wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2015 at 02:19:55AM -0700, Christoph Hellwig wrote: > Although I wonder if we really need to differentiate between rkey and > leky in this ib_map_mr_sg function, or if we should do it when > allocating the mr, i.e. in ib_alloc_mr. The allocation is agnostic to the usage, the map is what solidifies things into a certain use, effectively based on the access flags.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2015 at 01:19:16PM +0300, Sagi Grimberg wrote: > >Again, related to my prior comments, please have two of these: > > > >ib_map_mr_sg_rkey() > >ib_map_mr_sg_lkey() > > > >So we force ULPs to think about what they are doing properly, and we > >get a chance to actually force lkey to be local use only for IB. > > The lkey/rkey decision is passed in the fastreg post_send(). That is too late to check the access flags. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/23/2015 7:14 PM, Jason Gunthorpe wrote: > On Thu, Jul 23, 2015 at 01:19:16PM +0300, Sagi Grimberg wrote: >>> Again, related to my prior comments, please have two of these: >>> >>> ib_map_mr_sg_rkey() >>> ib_map_mr_sg_lkey() >>> >>> So we force ULPs to think about what they are doing properly, and we >>> get a chance to actually force lkey to be local use only for IB. >> >> The lkey/rkey decision is passed in the fastreg post_send(). > > That is too late to check the access flags. Why? the access permissions are kept in the mr context? I can move it to the post interface if it makes more sense. the access is kind of out of place in the mapping routine anyway... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote: > >I was hoping we'd move the DMA flush and translate into here and make > >it mandatory. Is there any reason not to do that? > > The reason I didn't added it in was so the ULPs can make sure they meet > the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his > SG list set partials and iSER to detect gaps (they need to dma map > for that). The ULP can always get the sg list's virtual address to check for gaps. Page aligned gaps are always OK. BTW, the logic in ib_sg_to_pages should be checking that directly, as coded, it won't work with swiotlb: // Only the first SG entry can start unaligned if (i && page_addr != dma_addr) return EINVAL; // Only the last SG entry can end unaligned if ((page_addr + dma_len) & PAGE_MASK != end_dma_addr) if (!is_last) return EINVAL; Don't use sg->offset after dma mapping. The biggest problem with checking the virtual address is swiotlb. However, if swiotlb is used this API is basically broken as swiotlb downgrades everything to a 2k alignment, which means we only ever get 1 s/g entry. To efficiently support swiotlb we'd need to see the driver be able to work with a page size of IO_TLB_SEGSIZE (2k) so it can handle the de-alignment that happens during bouncing. My biggest problem with pushing the dma address up to the ULP is basically that: The ULP has no idea what the driver can handle, maybe the driver can handle the 2k pages. So, that leaves a flow where the ULP does a basic sanity check on the virtual side, then asks the IB core to map it. The mapping could still fail because of swiotlb. If the mapping fails, then the ULP has to bounce buffer, or MR split, or totally fail. For bounce buffer, all solutions have a DMA map/unmap cost, so it doesn't matter if ib_map_mr_sg does that internally. For MR fragment, the DMA mapping is still usable. Maybe we do need a slightly different core API to help MR fragmentation? Sounds like NFS uses this too? num_mrs = ib_mr_fragment_sg(&scatterlist); while (..) ib_map_fragment_sg(mr[i++],&scatterlist,&offset); Perhaps? Maybe that is even better because something like iser could do the parallel: ib_mr_needs_fragment_sg(reference_mr,&scatterlist) Which hides all the various restrictions in driver code. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote: > On 7/22/2015 8:44 PM, Jason Gunthorpe wrote: > >On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote: > >>>+/** > >>>+ * ib_map_mr_sg() - Populates MR with a dma mapped SG list > >>>+ * @mr: memory region > >>>+ * @sg: dma mapped scatterlist > >>>+ * @sg_nents: number of entries in sg > >>>+ * @access: access permissions > >> > >>I know moving the access flags here was my idea originally, but I seem > >>convinced by your argument that it might fit in better with the posting > >>helper. Or did someone else come up with a better argument that mine > >>for moving it here? > > > >I was hoping we'd move the DMA flush and translate into here and make > >it mandatory. Is there any reason not to do that? > > The reason I didn't added it in was so the ULPs can make sure they meet > the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his > SG list set partials and iSER to detect gaps (they need to dma map > for that). I would like to see the kdoc for ib_map_mr_sg explain exactly what is required of the caller, maybe just hoist this bit from the ib_sg_to_pages Not entirely required if we are going to have an API to do the test.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote: > >>>So we force ULPs to think about what they are doing properly, and we > >>>get a chance to actually force lkey to be local use only for IB. > >> > >>The lkey/rkey decision is passed in the fastreg post_send(). > > > >That is too late to check the access flags. > > Why? the access permissions are kept in the mr context? Sure, one could do if (key == mr->lkey) .. check lkey flags in the post, but that seems silly considering we want the post inlined.. > I can move it to the post interface if it makes more sense. > the access is kind of out of place in the mapping routine anyway... All the dma routines have an access equivalent during map, I don't think it is out of place.. To my mind, the map is the point where the MR should crystallize into an rkey or lkey MR, not at the post. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > I would like to see the kdoc for ib_map_mr_sg explain exactly what is > required of the caller, maybe just hoist this bit from the > ib_sg_to_pages I'll add the kdoc. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/23/2015 8:55 PM, Jason Gunthorpe wrote: > On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote: >>> I was hoping we'd move the DMA flush and translate into here and make >>> it mandatory. Is there any reason not to do that? >> >> The reason I didn't added it in was so the ULPs can make sure they meet >> the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his >> SG list set partials and iSER to detect gaps (they need to dma map >> for that). > > The ULP can always get the sg list's virtual address to check for > gaps. Page aligned gaps are always OK. I guess I can pull DMA mapping in there, but we will need an opposite routine ib_umap_mr_sg() since it'll be weird if the ULP will do dma unmap without doing the map... > > BTW, the logic in ib_sg_to_pages should be checking that directly, as > coded, it won't work with swiotlb: > > // Only the first SG entry can start unaligned > if (i && page_addr != dma_addr) > return EINVAL; > // Only the last SG entry can end unaligned > if ((page_addr + dma_len) & PAGE_MASK != end_dma_addr) > if (!is_last) > return EINVAL; > > Don't use sg->offset after dma mapping. > > The biggest problem with checking the virtual address is > swiotlb. However, if swiotlb is used this API is basically broken as > swiotlb downgrades everything to a 2k alignment, which means we only > ever get 1 s/g entry. Can you explain what do you mean by "downgrades everything to a 2k alignment"? If the ULP is responsible for a PAGE_SIZE alignment than how would this get out of alignment with swiotlb? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/23/2015 9:51 PM, Jason Gunthorpe wrote: > On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote: > >>>>> So we force ULPs to think about what they are doing properly, and we >>>>> get a chance to actually force lkey to be local use only for IB. >>>> >>>> The lkey/rkey decision is passed in the fastreg post_send(). >>> >>> That is too late to check the access flags. >> >> Why? the access permissions are kept in the mr context? > > Sure, one could do if (key == mr->lkey) .. check lkey flags in the > post, but that seems silly considering we want the post inlined.. Why should we check the lkey/rkey access flags in the post? > >> I can move it to the post interface if it makes more sense. >> the access is kind of out of place in the mapping routine anyway... > > All the dma routines have an access equivalent during map, I don't > think it is out of place.. > > To my mind, the map is the point where the MR should crystallize into > an rkey or lkey MR, not at the post. I'm not sure I understand why the lkey/rkey should be set at the map routine. To me, it seems more natural to map_mr_sg and then either register the lkey or the rkey. It's easy enough to move the key arg to ib_map_mr_sg, but I don't see a good reason why at the moment. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jul 26, 2015 at 12:37:55PM +0300, Sagi Grimberg wrote: > I guess I can pull DMA mapping in there, but we will need an opposite > routine ib_umap_mr_sg() since it'll be weird if the ULP will do dma > unmap without doing the map... Yes, and ideally it would help ULPs to order these operations properly. eg we shouldn't be abusing the DMA API and unmapping before invalidate completes by default. That breaks obscure stuff in various ways... > >The biggest problem with checking the virtual address is > >swiotlb. However, if swiotlb is used this API is basically broken as > >swiotlb downgrades everything to a 2k alignment, which means we only > >ever get 1 s/g entry. > > Can you explain what do you mean by "downgrades everything to a 2k > alignment"? If the ULP is responsible for a PAGE_SIZE alignment than > how would this get out of alignment with swiotlb? swiotlb copies all DMA maps to a shared buffer below 4G so it can be used with 32 bit devices. The shared buffer is managed in a way that copies each s/g element to a continuous 2k aligned subsection of the buffer. Basically, swiotlb realigns everything that passes through it. The DMA API allows this, so ultimately, code has to check the dma physical address when concerned about alignment.. But we should not expect this to commonly fail. So, something like.. if (!ib_does_sgl_fit_in_mr(mr,sg)) .. bounce buffer .. if (!ib_map_mr_sg(mr,sg)) // does dma mapping and checks it .. bounce buffer .. .. post .. .. send invalidate .. .. catch invalidate completion ... ib_unmap_mr(mr); // does dma unmap ? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jul 26, 2015 at 12:45:10PM +0300, Sagi Grimberg wrote: > On 7/23/2015 9:51 PM, Jason Gunthorpe wrote: > >On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote: > > > >>>>>So we force ULPs to think about what they are doing properly, and we > >>>>>get a chance to actually force lkey to be local use only for IB. > >>>> > >>>>The lkey/rkey decision is passed in the fastreg post_send(). > >>> > >>>That is too late to check the access flags. > >> > >>Why? the access permissions are kept in the mr context? > > > >Sure, one could do if (key == mr->lkey) .. check lkey flags in the > >post, but that seems silly considering we want the post inlined.. > > Why should we check the lkey/rkey access flags in the post? Eh? It was your idea.. I just want to check the access flags and force lkey's to not have ACCESS_REMOTE set without complaining loudly. To do that you need to know if the mr is a lkey/rkey, and you need to know the flags. > >>I can move it to the post interface if it makes more sense. > >>the access is kind of out of place in the mapping routine anyway... > > > >All the dma routines have an access equivalent during map, I don't > >think it is out of place.. > > > >To my mind, the map is the point where the MR should crystallize into > >an rkey or lkey MR, not at the post. > > I'm not sure I understand why the lkey/rkey should be set at the map > routine. To me, it seems more natural to map_mr_sg and then either > register the lkey or the rkey. We need to check the access flags to put a stop to this remote access lkey security problem. That means we need to label every MR as a lkey or rkey MR. No more MR's can be both nonsense. Pick a place to do that and enforce that IB cannot have remote access LKEYs. My vote is to do that work in map, because I don't think it make any sense in post (post should not fail) Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/27/2015 12:14 PM, Jason Gunthorpe wrote: > On Sun, Jul 26, 2015 at 12:45:10PM +0300, Sagi Grimberg wrote: >> On 7/23/2015 9:51 PM, Jason Gunthorpe wrote: >>> On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote: >>> >>>>>>> So we force ULPs to think about what they are doing properly, and we >>>>>>> get a chance to actually force lkey to be local use only for IB. >>>>>> The lkey/rkey decision is passed in the fastreg post_send(). >>>>> That is too late to check the access flags. >>>> Why? the access permissions are kept in the mr context? >>> Sure, one could do if (key == mr->lkey) .. check lkey flags in the >>> post, but that seems silly considering we want the post inlined.. >> Why should we check the lkey/rkey access flags in the post? > Eh? It was your idea.. > > I just want to check the access flags and force lkey's to not have > ACCESS_REMOTE set without complaining loudly. > > To do that you need to know if the mr is a lkey/rkey, and you need to > know the flags. > >>>> I can move it to the post interface if it makes more sense. >>>> the access is kind of out of place in the mapping routine anyway... >>> All the dma routines have an access equivalent during map, I don't >>> think it is out of place.. >>> >>> To my mind, the map is the point where the MR should crystallize into >>> an rkey or lkey MR, not at the post. >> I'm not sure I understand why the lkey/rkey should be set at the map >> routine. To me, it seems more natural to map_mr_sg and then either >> register the lkey or the rkey. > We need to check the access flags to put a stop to this remote access > lkey security problem. That means we need to label every MR as a lkey > or rkey MR. > > No more MR's can be both nonsense. Well technically an MR with REMOTE_WRITE also has LOCAL_WRITE set. So you are proposing the core disallow a ULP from using the lkey for this type of MR? Say in a RECV sge? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 27, 2015 at 03:11:04PM -0500, Steve Wise wrote: > Well technically an MR with REMOTE_WRITE also has LOCAL_WRITE set. So you > are proposing the core disallow a ULP from using the lkey for this type of > MR? Say in a RECV sge? Yes, absolutely. It is wrong anyhow, RECV isn't special, if you RECV into memory that is exposed via a rkey MR, you have to invalidate that MR and fence DMA before you can touch the buffer. Only very special, carefully designed, cases could avoid that. We don't have those cases, so lets just ban it. The only exception is the iWarp RDMA READ thing. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 22/07/2015 09:55, Sagi Grimberg wrote: > +/** > + * ib_sg_to_pages() - Convert a sg list to a page vector > + * @dev: ib device > + * @sgl: dma mapped scatterlist > + * @sg_nents: number of entries in sg > + * @max_pages: maximum pages allowed > + * @pages: output page vector > + * @npages: output number of mapped pages > + * @length: output total byte length > + * @offset: output first byte offset > + * > + * Core service helper for drivers to convert a scatter > + * list to a page vector. The assumption is that the > + * sg must meet the following conditions: > + * - Only the first sg is allowed to have an offset > + * - All the elements are of the same size - PAGE_SIZE > + * - The last element is allowed to have length less than > + * PAGE_SIZE > + * > + * If any of those conditions is not met, the routine will > + * fail with EINVAL. > + */ > +int ib_sg_to_pages(struct scatterlist *sgl, > + unsigned short sg_nents, > + unsigned short max_pages, > + u64 *pages, u32 *npages, > + u32 *length, u64 *offset) > +{ > + struct scatterlist *sg; > + u64 last_end_dma_addr = 0, last_page_addr = 0; > + unsigned int last_page_off = 0; > + int i, j = 0; > + > + /* TODO: We can do better with huge pages */ > + > + *offset = sg_dma_address(&sgl[0]); > + *length = 0; > + > + for_each_sg(sgl, sg, sg_nents, i) { > + u64 dma_addr = sg_dma_address(sg); > + unsigned int dma_len = sg_dma_len(sg); > + u64 end_dma_addr = dma_addr + dma_len; > + u64 page_addr = dma_addr & PAGE_MASK; > + > + *length += dma_len; > + > + /* Fail we ran out of pages */ > + if (unlikely(j > max_pages)) > + return -EINVAL; > + > + if (i && sg->offset) { > + if (unlikely((last_end_dma_addr) != dma_addr)) { > + /* gap - fail */ > + goto err; > + } > + if (last_page_off + dma_len < PAGE_SIZE) { > + /* chunk this fragment with the last */ > + last_end_dma_addr += dma_len; > + last_page_off += dma_len; > + continue; > + } else { > + /* map starting from the next page */ > + page_addr = last_page_addr + PAGE_SIZE; > + dma_len -= PAGE_SIZE - last_page_off; > + } > + } > + > + do { > + pages[j++] = page_addr; I think this line could overrun the pages buffer. The test above only checks at the beginning of the sg, but with an sg larger than PAGE_SIZE, you could still overrun. > + page_addr += PAGE_SIZE; > + } while (page_addr < end_dma_addr); > + > + last_end_dma_addr = end_dma_addr; > + last_page_addr = end_dma_addr & PAGE_MASK; > + last_page_off = end_dma_addr & ~PAGE_MASK; > + } > + > + *npages = j; > + > + return 0; > +err: > + pr_err("RDMA alignment violation\n"); > + for_each_sg(sgl, sg, sg_nents, i) { > + u64 dma_addr = sg_dma_address(sg); > + unsigned int dma_len = sg_dma_len(sg); > + > + pr_err("sg[%d]: offset=0x%x, dma_addr=0x%llx, dma_len=0x%x\n", > + i, sg->offset, dma_addr, dma_len); > + } > + > + return -EINVAL; > +} > +EXPORT_SYMBOL(ib_sg_to_pages); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Can you explain what do you mean by "downgrades everything to a 2k >> alignment"? If the ULP is responsible for a PAGE_SIZE alignment than >> how would this get out of alignment with swiotlb? > > swiotlb copies all DMA maps to a shared buffer below 4G so it can be > used with 32 bit devices. > > The shared buffer is managed in a way that copies each s/g element to > a continuous 2k aligned subsection of the buffer. > Thanks for the explanation. > Basically, swiotlb realigns everything that passes through it. So this won't ever happen if the ULP will DMA map the SG and check for gaps right? Also, is it interesting to support swiotlb even if we don't have any devices that require it (and should we expect one to ever exist)? > > The DMA API allows this, so ultimately, code has to check the dma > physical address when concerned about alignment.. But we should not > expect this to commonly fail. > > So, something like.. > > if (!ib_does_sgl_fit_in_mr(mr,sg)) > .. bounce buffer .. I don't understand the need for this is we do the same thing if the actual mapping fails... > > if (!ib_map_mr_sg(mr,sg)) // does dma mapping and checks it > .. bounce buffer .. Each ULP would want to do something different, iser will bounce but srp would need to use multiple mrs, nfs will split the request. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 10:13:09AM +0300, Sagi Grimberg wrote: > >Basically, swiotlb realigns everything that passes through it. > > So this won't ever happen if the ULP will DMA map the SG and check > for gaps right? Once mapped the physical address isn't going to change - but at some point we must check the physical address directly. > Also, is it interesting to support swiotlb even if we don't have > any devices that require it (and should we expect one to ever exist)? swiotlb is an obvious example, and totally uninteresting to support, but we must correctly use the DMA API. > >The DMA API allows this, so ultimately, code has to check the dma > >physical address when concerned about alignment.. But we should not > >expect this to commonly fail. > > > >So, something like.. > > > > if (!ib_does_sgl_fit_in_mr(mr,sg)) > > .. bounce buffer .. > > I don't understand the need for this is we do the same thing > if the actual mapping fails... Just performance. DMA mapping is potentially very expensive, the common case to detect will be a sg that is virtually unaligned. This virtual scan could be bundled insde the map, but if a ULP knows it is page aligned already then that is just creating overhead.. I'm ambivalent.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 10:36:31AM -0600, Jason Gunthorpe wrote: > > Also, is it interesting to support swiotlb even if we don't have > > any devices that require it (and should we expect one to ever exist)? > > swiotlb is an obvious example, and totally uninteresting to support, > but we must correctly use the DMA API. Do we have a choice? It seems like various setups with DMA restrictions rely on it, including many Xen PV guests. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/23/2015 8:55 PM, Jason Gunthorpe wrote: > On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote: >>> I was hoping we'd move the DMA flush and translate into here and make >>> it mandatory. Is there any reason not to do that? >> >> The reason I didn't added it in was so the ULPs can make sure they meet >> the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his >> SG list set partials and iSER to detect gaps (they need to dma map >> for that). > > The ULP can always get the sg list's virtual address to check for > gaps. Page aligned gaps are always OK. So I had a go with moving the DMA mapping into ib_map_mr_sg() and it turns out mapping somewhat poorly if the ULP _may_ register memory or just send sg_lists (like storage targets over IB/iWARP). So the ULP will sometimes use the DMA mapping and sometimes it won't... feels kinda off to me... it's much saner to do: 1. dma_map_sg 2. register / send-sg-list 3. unregister (if needed) 4. dma_unmap_sg then: 1. if register - call ib_map_mr_sg (which calls dma_map_sg) else do dma_map_sg 2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg) else do dma_unmap_sg this kinda forces ULP to completely separate these code paths with with very little sharing. Also, at the moment, when ULPs are doing either FRWR or FMRs its a pain to get a non-intrusive conversion. I'm thinking we should keep dma_map_sg out of ib_map_mr_sg, and leave it to the ULP like it does today (at least in the first stage...) Thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 19, 2015 at 02:56:24PM +0300, Sagi Grimberg wrote: > So I had a go with moving the DMA mapping into ib_map_mr_sg() and > it turns out mapping somewhat poorly if the ULP _may_ register memory > or just send sg_lists (like storage targets over IB/iWARP). So the ULP > will sometimes use the DMA mapping and sometimes it won't... feels > kinda off to me... Yes, it's odd. > it's much saner to do: > 1. dma_map_sg > 2. register / send-sg-list > 3. unregister (if needed) > 4. dma_unmap_sg > > then: > 1. if register - call ib_map_mr_sg (which calls dma_map_sg) > else do dma_map_sg > 2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg) > else do dma_unmap_sg > > this kinda forces ULP to completely separate these code paths with > with very little sharing. > > Also, at the moment, when ULPs are doing either FRWR or FMRs > its a pain to get a non-intrusive conversion. > > I'm thinking we should keep dma_map_sg out of ib_map_mr_sg, and leave > it to the ULP like it does today (at least in the first stage...) Keep it out for now. I think we need to move the dma mapping into the RDMA care rather sooner than later, but that must also include ib_post_send/recv, so it's better done separately. After having a look at the mess some drivers (ipath,qib,hfi & ehca) cause with abuse of dma_map_ops I've got an even strong opion on the whole subject now. However I think we'll get more things done if we split them into smaller steps. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Keep it out for now. Ok, I was also thinking on moving the access flags to the work request again. It doesn't make much sense there unless I go with what Jason suggested with ib_map_mr_[lkey|rkey] to protect against remote access for lkeys in IB which to me, sounds redundant at this point given that ULPs will set the access according to iWARP anyway. I'd prefer to get this right with a different helper like Steve suggested: int rdma_access_flags(int mr_roles); This way we don't need to protect against it. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 19, 2015 at 07:09:18PM +0300, Sagi Grimberg wrote: > Ok, I was also thinking on moving the access flags > to the work request again. Yes, with the current code I don't think we need it in the MR. > I'd prefer to get this right with a different helper like Steve > suggested: > int rdma_access_flags(int mr_roles); We can start with that. In the long run we really want to have two higher level helpers to RDMA READ a scatterlist: - one for iWARP that uses an FR and RDMA READ WITH INVALIDATE - one of IB-like transports that just uses a READ with the local lkey Right now every ULP that wants to support iWarp needs to duplicate that code. This leads to some curious situations like the NFS server apparently always using FRs if available for this if my reading of svc_rdma_accept() is correct, or the weird parallel code pathes for IB vs iWarp in RDS: hch@brick:~/work/linux/net/rds$ ls ib* ib.c ib_cm.c ib.h ib_rdma.c ib_recv.c ib_ring.c ib_send.c ib_stats.c ib_sysctl.c hch@brick:~/work/linux/net/rds$ ls iw* iw.c iw_cm.c iw.h iw_rdma.c iw_recv.c iw_ring.c iw_send.c iw_stats.c iw_sysctl.c -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 19, 2015 at 02:56:24PM +0300, Sagi Grimberg wrote: > On 7/23/2015 8:55 PM, Jason Gunthorpe wrote: > >On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote: > >>>I was hoping we'd move the DMA flush and translate into here and make > >>>it mandatory. Is there any reason not to do that? > >> > >>The reason I didn't added it in was so the ULPs can make sure they meet > >>the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his > >>SG list set partials and iSER to detect gaps (they need to dma map > >>for that). > > > >The ULP can always get the sg list's virtual address to check for > >gaps. Page aligned gaps are always OK. > > So I had a go with moving the DMA mapping into ib_map_mr_sg() and > it turns out mapping somewhat poorly if the ULP _may_ register memory > or just send sg_lists (like storage targets over IB/iWARP). So the ULP > will sometimes use the DMA mapping and sometimes it won't... feels > kinda off to me... You need to split the rkey and lkey API flows to pull this off - the rkey side never needs to touch a sg, while the lkey side should always try and use a sg first. I keep saying this: they have fundamentally different ULP usages. > 1. if register - call ib_map_mr_sg (which calls dma_map_sg) > else do dma_map_sg > 2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg) > else do dma_unmap_sg From what I've seen in the ULPs the flow control is generally such that the MR is 'consumed' even if it isn't used by a send. So lkey usage is simply split into things that absolutely don't need a MR, and things that maybe do. The maybe side can go ahead and always consume the MR resource, but optimize the implementation to a SG list to avoid a performance hit. Then the whole API becomes symmetric. The ULP says, 'here is a scatterlist list and a lkey MR, make me a ib_sg list' and the core either packes it as is into the sg, or it spins up the MR and packs that. This lets the unmap be symmetric, as the core always dma_unmaps, but only tears down the MR if it was used. The cost is the lkey MR slot is always consumed, which should be OK because SQE flow control bounds the number of concurrent MRs required, so consuming a SQE but not a MR doesn't provide an advantage. > Also, at the moment, when ULPs are doing either FRWR or FMRs > its a pain to get a non-intrusive conversion. Without FMR sharing API entry points it is going to be hard to unify them.. ie the map and alloc API side certainly could be shared.. > I'm thinking we should keep dma_map_sg out of ib_map_mr_sg, and leave > it to the ULP like it does today (at least in the first stage...) I'm fine with first stage, but we still really do need to figure how how to get better code sharing in our API here.. Maybe we can do the rkey side right away until we can figure out how to harmonize the rkey sg/mr usage? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> 1. if register - call ib_map_mr_sg (which calls dma_map_sg) >> else do dma_map_sg >> 2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg) >> else do dma_unmap_sg > > From what I've seen in the ULPs the flow control is generally such > that the MR is 'consumed' even if it isn't used by a send. Not really. if registration is not needed, an MR is not consumed. In fact, in svcrdma the IB code path never uses those, and the iWARP code path always use those for RDMA_READs and not RDMA_WRITEs. Also, isert use those only when signature is enabled and registration is required. > > So lkey usage is simply split into things that absolutely don't need a > MR, and things that maybe do. The maybe side can go ahead and always > consume the MR resource, but optimize the implementation to a SG list > to avoid a performance hit. > > Then the whole API becomes symmetric. The ULP says, 'here is a > scatterlist list and a lkey MR, make me a ib_sg list' and the core > either packes it as is into the sg, or it spins up the MR and packs > that. Always consuming an MR resource is an extra lock acquire given these are always kept in a pool structure. >> I'm thinking we should keep dma_map_sg out of ib_map_mr_sg, and leave >> it to the ULP like it does today (at least in the first stage...) > > I'm fine with first stage, but we still really do need to figure how > how to get better code sharing in our API here.. > > Maybe we can do the rkey side right away until we can figure out how > to harmonize the rkey sg/mr usage? I'm fine with that. I agree we still need to do better. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 20, 2015 at 01:05:59PM +0300, Sagi Grimberg wrote: > >>1. if register - call ib_map_mr_sg (which calls dma_map_sg) > >> else do dma_map_sg > >>2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg) > >> else do dma_unmap_sg > > > > From what I've seen in the ULPs the flow control is generally such > >that the MR is 'consumed' even if it isn't used by a send. > > Not really. if registration is not needed, an MR is not consumed. In > fact, in svcrdma the IB code path never uses those, and the iWARP code > path always use those for RDMA_READs and not RDMA_WRITEs. Also, isert > use those only when signature is enabled and registration is required. The MR is not *used* but it should be 'consumed' - in the sense that every RPC slot is associated (implicitly) with a MR, leaving the unused MR in some kind of pool doesn't really help anything. Honestly, the MR pool idea doesn't really help anything, it just makes confusion. What should be pooled is the 'request slot' itself, in the sense that if a request slot is in the 'ready to go' pool it is guarenteed to be able to complete *any* request without blocking. That means the MR/SQE/CQE resources are all ready to go. Any ancillary memory is ready to use, etc. The ULP should design its slot with the idea that it doesn't have to allocate memory, or IB resources, or block, once the slot becomes 'ready to go'. Review the discussion Chuck and I had on SQE flow control for a sense on what that means. Understand why the lifetime of the MR and the SQE and the slot are all convoluted together if RDMA is used correctly. Trying to decouple the sub resources, ie by separately pooling the MR/SQE/etc, is just unnecessary complexity, IMHO.. NFS client already had serioues bugs in this area. So, I turn to the idea that every ULP should work as the above, which means when it gets to working on a 'slot' that implies there is an actual struct ib_mr resource guaranteed available. This is why I suggested using the 'struct ib_mr' to guide the SG construction even if the actual HW MR isn't going to be used. The struct ib_mr is tied to the slot, so using it has no cost. ------- But, maybe that is too much of a shortcut, and thinking about it more and all the other things I've written about.. Lets just directly address this issue and add something called 'struct ib_op_slot'. Data transfer would look like this: struct *ib_send_wr cur; cur = ib_slot_make_send(&req->op_slot,scatter_list); cur->next = ib_slot_make_rdma_read(&next_req->op,scatter_list, rkey,length); ib_post_send(qp,cur); [.. at CQE time ..] if (ib_slot_complete(qp,req->op_slot)) [.. slot is now 'ready to go' ..] else [.. otherwise more stuff was posted, have to wait ...] This forces the ULP to deal with many of the issues. Having a slot means guarenteed minimum avaiable MR,SQE,CQE resources. That guarenteed minimum avoids the messy API struggle in my prior writings. .. and maybe the above is even thinking too small, to Christoph's earlier musings, I wonder if a slot based middle API could hijack the entire SCQ processing and have a per-slot callback scheme instead. That kind of intervention is exactly what we'd need to trivially hide the FMR difference. ... and now this provides enough context to start talking about common helper APIs for common ULP things, like the rdma_read switch. The slot has pre-allocated everything needed to handle the variations. ... which suddenly starts to be really viable because the slot guarentees SQE availability too. ... and we start having the idea of a slot able to do certain tasks, and codify that with API help at creation: struct nfs_rpc_slot { strict ib_op_slot slot; }; struct ib_op_slot_attributes attrs; ib_init_slot_attrs(&attrs,ib_pd); ib_request_action(&attrs, "args describing RDMA read with N SGEs"); if (ib_request_action("args describing a requirement for signature")) signature_supported = true; if (ib_request_action("args describing a requirement for non-page-aligned")) byte_sgl_supported = true; ib_request_action("args describing SEND with N SGEs"); ib_request_action("args describing N RDMA reads each with N SGEs"); for (required slot concurrency) ib_alloc_slot(&rpc.slot,&attrs); Then the alloc just grabs everything required. ..mumble mumble.. some way to flow into the QP/CQ allocation attributes too .. Essentially, the ULP says 'here is what I want to do with this slot' and the core code *guarentees* that if the slot is 'ready to go' then 'any single work of any requested type' can be queued without blocking or memory allocation. Covers SQEs, CQEs, MRs, etc. ib_request_action is a basic pattern that does various tests and ends up doing: attrs->num_mrs = max(attrs->num_mrs, needed_for_this_action); attrs->num_mrs_sge = max(attrs->num_mrs_sge, needed_for_this_action); attrs->num_wr_sge = max(attrs->num_qp_sqe, needed_for_this_action); attrs->num_sqe = max(attrs->num_sqe, needed_for_this_action); attrs->num_cqe = max(attrs->num_cqe, needed_for_this_action); [ie we compute the maximum allocation needed to satisfy the requested requirement] Each request could fail, eg if signature is not supported then the request_action will fail, so we have a more uniform way to talk about send queue features. ... and the ULP could have a 'heavy' and 'light' slot pool if that made some kind of sense for its work load. So, that is a long road, but maybe there are reasonable interm stages? Anyhow, conceptually, an idea. Eliminates the hated fmr pool concept, cleans up bad thinking around queue flow control. Provides at least a structure to abstract transport differences. --------- It could look something like this: struct ib_op_slot { struct ib_mr **mr_list; // null terminated void *wr_memory; void *sg_memory; unsigned int num_sgs; }; struct ib_send_wr *ib_slot_make_send(struct ib_op_slot *slot, const struct scatter_list *sgl) { dma_map(sgl); if (num_sges(sgl) < slot->num_sgs) { // send fits in the sg list struct ib_send_wr *wr = slot->wr_memory; wr->sg0list = slot->sg_memory; .. pack it in .. return wr; } else { // Need to spin up a MR.. struct { struct ib_send_wr frwr_wr; struct ib_send_wr send_wr; } *wrs = slot->wr_memory; wrs->frwr_wr.next = &wrs->send_wr ... pack it in ... return &wrs->frwr_wr; } // similar for FMR } .. similar concept for rdma read, etc. .. ib_request_action makes sure the wr_memory/sg_memory are pre-sized to accommodate the action. Add optional #ifdef'd debugging to check for bad ULP usage .. function pointers could be used to provide special optimal versions if necessary .. Complex things like signature just vanish from the API. ULP sees something like: if (ib_request_action("args describing a requirement for signature")) signature_supported = true; wr = ib_slot_make_rdma_write_signature(slot,....); Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 20, 2015 at 01:04:13PM -0600, Jason Gunthorpe wrote: > Trying to decouple the sub resources, ie by separately pooling the > MR/SQE/etc, is just unnecessary complexity, IMHO.. NFS client already > had serioues bugs in this area. > > So, I turn to the idea that every ULP should work as the above, which > means when it gets to working on a 'slot' that implies there is an > actual struct ib_mr resource guaranteed available. This is why I > suggested using the 'struct ib_mr' to guide the SG construction even if > the actual HW MR isn't going to be used. The struct ib_mr is tied to > the slot, so using it has no cost. How is this going to work for drivers that might consumer multiple MRs per request like SRP or similar upcoming block drivers? Unless you want to allocate a potentially large number of MRs for each request that scheme doesn't work. > This forces the ULP to deal with many of the issues. Having a slot > means guarenteed minimum avaiable MR,SQE,CQE resources. That > guarenteed minimum avoids the messy API struggle in my prior writings. > > .. and maybe the above is even thinking too small, to Christoph's > earlier musings, I wonder if a slot based middle API could hijack the > entire SCQ processing and have a per-slot callback scheme > instead. That kind of intervention is exactly what we'd need to > trivially hide the FMR difference. FYI, I have working early patches to do per-WR completion callback, I'll post them after I get them into a slightly better shape. As for your grand schemes: I like some of the idea there, but we need to get there gradually. I'd much prefer to finish Sagi's simple scheme, get my completion work in, add abstractions for RDMA READ and WRITE scatterlist mapping and build things up slowly. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 20, 2015 at 11:34:58PM -0700, Christoph Hellwig wrote: > How is this going to work for drivers that might consumer multiple > MRs per request like SRP or similar upcoming block drivers? Unless > you want to allocate a potentially large number of MRs for each > request that scheme doesn't work. There are at least two approaches, and it depends on how flow control to the driving layer works out. Look at what the ULP does when the existing MR pool exhausts: - Exhaustion is not allowed. In this model every slot must truely handle every required action without blocking. The ULP somehow wrangles things so pool exhaustion is not possible. NFS client is a good example. Where NFS client went wrong is that the MR alone is not enough, issuing a request requires SQE/CQE resources, failing to track that caused hard to find bugs. - Exhaustion is allowed, and somehow the ULP is able to stop processing. In this case you'd just swap MRs for slots in the pool, probably having pools of different kinds of slots to optimize resource use. Pool draw down includes SQE/CQE/etc resources as well. A multiple rkey MR case would just draw down the required slots from the pool. I suspect client side tends to lean toward the first option and target side the second - targets can always do back pressure flow control by simply halting RQE processing, and it makes alot of sense on a target to globally pool slots across all client QPs. This idea of a slot is just a higher level structure we can hang other stuff off - like the sg/mr decision, the iwarp rdma read change, sqe accounting. We don't need to start with everything, but I'm looking at Sagi's notes on trying to factor the lkey side code paths and thinking a broader abstraction than raw MR is needed to solve that. > FYI, I have working early patches to do per-WR completion callback, > I'll post them after I get them into a slightly better shape. Interesting.. > As for your grand schemes: I like some of the idea there, but we > need to get there gradually. I'd much prefer to finish Sagi's simple > scheme, get my completion work in, add abstractions for RDMA READ and > WRITE scatterlist mapping and build things up slowly. Yes, absolutely, we have to go slowly - but exploring how we can fit this together in some other way can help guide some of the smaller choices. Sagi could drop the lkey side, getting the rkey side in order would be nice enough. Something like this is a direction to address the lkey side. Ie we could 1:1 replace MR with 'slot' and use that to factor the lkey code paths. Over time slot can grow organically to factor more code. Slot would be a new object for the core, one that is guarenteed to last from post->completion, that seems like exactly the sort of object a completion callback scheme would benefit from. Guarenteed memory to hang callback pointers/etc off. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index beed431..9875163 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -1481,3 +1481,126 @@ int ib_check_mr_status(struct ib_mr *mr, u32 check_mask, mr->device->check_mr_status(mr, check_mask, mr_status) : -ENOSYS; } EXPORT_SYMBOL(ib_check_mr_status); + + +/** + * ib_map_mr_sg() - Populates MR with a dma mapped SG list + * @mr: memory region + * @sg: dma mapped scatterlist + * @sg_nents: number of entries in sg + * @access: access permissions + * + * After this completes successfully, the memory region is ready + * for fast registration. + */ +int ib_map_mr_sg(struct ib_mr *mr, + struct scatterlist *sg, + unsigned short sg_nents, + unsigned int access) +{ + int rc; + + if (!mr->device->map_mr_sg) + return -ENOSYS; + + rc = mr->device->map_mr_sg(mr, sg, sg_nents); + if (!rc) + mr->access = access; + + return rc; +} +EXPORT_SYMBOL(ib_map_mr_sg); + +/** + * ib_sg_to_pages() - Convert a sg list to a page vector + * @dev: ib device + * @sgl: dma mapped scatterlist + * @sg_nents: number of entries in sg + * @max_pages: maximum pages allowed + * @pages: output page vector + * @npages: output number of mapped pages + * @length: output total byte length + * @offset: output first byte offset + * + * Core service helper for drivers to convert a scatter + * list to a page vector. The assumption is that the + * sg must meet the following conditions: + * - Only the first sg is allowed to have an offset + * - All the elements are of the same size - PAGE_SIZE + * - The last element is allowed to have length less than + * PAGE_SIZE + * + * If any of those conditions is not met, the routine will + * fail with EINVAL. + */ +int ib_sg_to_pages(struct scatterlist *sgl, + unsigned short sg_nents, + unsigned short max_pages, + u64 *pages, u32 *npages, + u32 *length, u64 *offset) +{ + struct scatterlist *sg; + u64 last_end_dma_addr = 0, last_page_addr = 0; + unsigned int last_page_off = 0; + int i, j = 0; + + /* TODO: We can do better with huge pages */ + + *offset = sg_dma_address(&sgl[0]); + *length = 0; + + for_each_sg(sgl, sg, sg_nents, i) { + u64 dma_addr = sg_dma_address(sg); + unsigned int dma_len = sg_dma_len(sg); + u64 end_dma_addr = dma_addr + dma_len; + u64 page_addr = dma_addr & PAGE_MASK; + + *length += dma_len; + + /* Fail we ran out of pages */ + if (unlikely(j > max_pages)) + return -EINVAL; + + if (i && sg->offset) { + if (unlikely((last_end_dma_addr) != dma_addr)) { + /* gap - fail */ + goto err; + } + if (last_page_off + dma_len < PAGE_SIZE) { + /* chunk this fragment with the last */ + last_end_dma_addr += dma_len; + last_page_off += dma_len; + continue; + } else { + /* map starting from the next page */ + page_addr = last_page_addr + PAGE_SIZE; + dma_len -= PAGE_SIZE - last_page_off; + } + } + + do { + pages[j++] = page_addr; + page_addr += PAGE_SIZE; + } while (page_addr < end_dma_addr); + + last_end_dma_addr = end_dma_addr; + last_page_addr = end_dma_addr & PAGE_MASK; + last_page_off = end_dma_addr & ~PAGE_MASK; + } + + *npages = j; + + return 0; +err: + pr_err("RDMA alignment violation\n"); + for_each_sg(sgl, sg, sg_nents, i) { + u64 dma_addr = sg_dma_address(sg); + unsigned int dma_len = sg_dma_len(sg); + + pr_err("sg[%d]: offset=0x%x, dma_addr=0x%llx, dma_len=0x%x\n", + i, sg->offset, dma_addr, dma_len); + } + + return -EINVAL; +} +EXPORT_SYMBOL(ib_sg_to_pages); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 7a93e2d..d543fee 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1013,6 +1013,7 @@ enum ib_wr_opcode { IB_WR_RDMA_READ_WITH_INV, IB_WR_LOCAL_INV, IB_WR_FAST_REG_MR, + IB_WR_FASTREG_MR, IB_WR_MASKED_ATOMIC_CMP_AND_SWP, IB_WR_MASKED_ATOMIC_FETCH_AND_ADD, IB_WR_BIND_MW, @@ -1117,6 +1118,10 @@ struct ib_send_wr { u32 rkey; } fast_reg; struct { + struct ib_mr *mr; + u32 key; + } fastreg; + struct { struct ib_mw *mw; /* The new rkey for the memory window. */ u32 rkey; @@ -1316,6 +1321,9 @@ struct ib_mr { struct ib_uobject *uobject; u32 lkey; u32 rkey; + int access; + u64 iova; + u32 length; atomic_t usecnt; /* count number of MWs */ }; @@ -1661,6 +1669,9 @@ struct ib_device { enum ib_mr_type mr_type, u32 max_entries, u32 flags); + int (*map_mr_sg)(struct ib_mr *mr, + struct scatterlist *sg, + unsigned short sg_nents); struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len); void (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list); @@ -2991,4 +3002,30 @@ static inline int ib_check_mr_access(int flags) int ib_check_mr_status(struct ib_mr *mr, u32 check_mask, struct ib_mr_status *mr_status); +int ib_map_mr_sg(struct ib_mr *mr, + struct scatterlist *sg, + unsigned short sg_nents, + unsigned int access); + +int ib_sg_to_pages(struct scatterlist *sgl, + unsigned short sg_nents, + unsigned short max_pages, + u64 *pages, u32 *npages, + u32 *length, u64 *offset); + +static inline void +ib_set_fastreg_wr(struct ib_mr *mr, + u32 key, + uintptr_t wr_id, + bool signaled, + struct ib_send_wr *wr) +{ + wr->opcode = IB_WR_FASTREG_MR; + wr->wr_id = wr_id; + wr->send_flags = signaled ? IB_SEND_SIGNALED : 0; + wr->num_sge = 0; + wr->wr.fastreg.mr = mr; + wr->wr.fastreg.key = key; +} + #endif /* IB_VERBS_H */
The new fast registration is receiving a struct scatterlist and converts it to a page list under the verbs API. The user is provided with a new verb ib_map_mr_sg, and a helper to set the send work request structure. The drivers are handed with a generic helper that converts a scatterlist into a vector of pages. Given that some drivers have a shadow mapped page list, I expect that drivers might use their own routines to avoid the extra copies. The new registration API is added with fast_reg for now, but once all drivers and ULPs will be ported, we can drop the old registration API. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> --- drivers/infiniband/core/verbs.c | 123 ++++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 37 ++++++++++++ 2 files changed, 160 insertions(+)