Message ID | 20191003201858.11666-1-dave@stgolabs.net (mailing list archive) |
---|---|
Headers | show |
Series | lib/interval-tree: move to half closed intervals | expand |
On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote: > It has been discussed[1,2] that almost all users of interval trees would better > be served if the intervals were actually not [a,b], but instead [a, b). This So how does a user represent a range from ULONG_MAX to ULONG_MAX now? I think the problem is that large parts of the kernel just don't consider integer overflow. Because we write in C, it's natural to write: for (i = start; i < end; i++) and just assume that we never need to hit ULONG_MAX or UINT_MAX. If we're storing addresses, that's generally true -- most architectures don't allow addresses in the -PAGE_SIZE to ULONG_MAX range (or they'd have trouble with PTR_ERR). If you're looking at file sizes, that's not true on 32-bit machines, and we've definitely seen filesystem bugs with files nudging up on 16TB (on 32 bit with 4k page size). Or block driver bugs with similarly sized block devices. So, yeah, easier to use. But damning corner cases.
On Thu, 03 Oct 2019, Matthew Wilcox wrote: >On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote: >> It has been discussed[1,2] that almost all users of interval trees would better >> be served if the intervals were actually not [a,b], but instead [a, b). This > >So how does a user represent a range from ULONG_MAX to ULONG_MAX now? I would assume that any such lookups would be stab queries (anon/vma interval tree). So both anon and files. And yeah, I blissfully ignored any overflow scenarios. This should at least be documented. > >I think the problem is that large parts of the kernel just don't consider >integer overflow. Because we write in C, it's natural to write: > > for (i = start; i < end; i++) > >and just assume that we never need to hit ULONG_MAX or UINT_MAX. Similarly, I did not adjust queries such as 0 to ULONG_MAX, which are actually real, then again any intersecting ranges will most likely not even be close to end. >If we're storing addresses, that's generally true -- most architectures >don't allow addresses in the -PAGE_SIZE to ULONG_MAX range (or they'd >have trouble with PTR_ERR). If you're looking at file sizes, that's >not true on 32-bit machines, and we've definitely seen filesystem bugs >with files nudging up on 16TB (on 32 bit with 4k page size). Or block >driver bugs with similarly sized block devices. > >So, yeah, easier to use. But damning corner cases. I agree. Thanks, Davidlohr
On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote: > Hi, > > It has been discussed[1,2] that almost all users of interval trees would better > be served if the intervals were actually not [a,b], but instead [a, b). This > series attempts to convert all callers by way of transitioning from using > "interval_tree_generic.h" to "interval_tree_gen.h". Once all users are converted, > we remove the former. > > Patch 1: adds a call that will make patch 8 easier to review by introducing stab > queries for the vma interval tree. > > Patch 2: adds the new interval_tree_gen.h which is the same as the old one but > uses [a,b) intervals. > > Patch 3-9: converts, in baby steps (as much as possible), each interval tree to > the new [a,b) one. It is done this way also to maintain bisectability. > Most conversions are pretty straightforward, however, there are some > creative ways in which some callers use the interval 'end' when going > through intersecting ranges within a tree. Ie: patch 3, 6 and 9. > > Patch 10: deletes the interval_tree_generic.h header; there are no longer any users. > > Patch 11: finally simplifies x86 pat tree to use the new interval tree machinery. > > This has been lightly tested, and certainly not on driver paths that do non > trivial conversions. Also needs more eyeballs as conversions can be easily > missed (even when I've tried mitigating this by renaming the endpoint from 'last' > to 'end' in each corresponding structure). > > Because this touches a lot of drivers, I'm Cc'ing the whole thing to a couple of > relevant lists (mm, dri, rdma); sorry if you consider this spam. > > Applies on top of today's linux-next tree. Please consider for v5.5. > > Thanks! > > [1] https://lore.kernel.org/lkml/CANN689HVDJXKEwB80yPAVwvRwnV4HfiucQVAho=dupKM_iKozw@mail.gmail.com/ Hurm, this is not entirely accurate. Most users do actually want overlapping and multiple ranges. I just studied this extensively: radeon_mn actually wants overlapping but seems to mis-understand the interval_tree API and actively tries hard to prevent overlapping at great cost and complexity. I have a patch to delete all of this and just be overlapping. amdgpu_mn copied the wrongness from radeon_mn All the DRM drivers are basically the same here, tracking userspace controlled VAs, so overlapping is essential hfi1/mmu_rb definitely needs overlapping as it is dealing with userspace VA ranges under control of userspace. As do the other infiniband users. vhost probably doesn't overlap in the normal case, but again userspace could trigger overlap in some pathalogical case. The [start,last] allows the interval to cover up to ULONG_MAX. I don't know if this is needed however. Many users are using userspace VAs here. Is there any kernel configuration where ULONG_MAX is a valid userspace pointer? Ie 32 bit 4G userspace? I don't know. Many users seemed to have bugs where they were taking a userspace controlled start + length and converting them into a start/end for interval tree without overflow protection (woops) Also I have a series already cooking to delete several of these interval tree users, which will terribly conflict with this :\ Is it really necessary to make such churn for such a tiny API change? Jason
On Thu, 03 Oct 2019, Jason Gunthorpe wrote: >On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote: >> Hi, >> >> It has been discussed[1,2] that almost all users of interval trees would better >> be served if the intervals were actually not [a,b], but instead [a, b). This >> series attempts to convert all callers by way of transitioning from using >> "interval_tree_generic.h" to "interval_tree_gen.h". Once all users are converted, >> we remove the former. >> >> Patch 1: adds a call that will make patch 8 easier to review by introducing stab >> queries for the vma interval tree. >> >> Patch 2: adds the new interval_tree_gen.h which is the same as the old one but >> uses [a,b) intervals. >> >> Patch 3-9: converts, in baby steps (as much as possible), each interval tree to >> the new [a,b) one. It is done this way also to maintain bisectability. >> Most conversions are pretty straightforward, however, there are some >> creative ways in which some callers use the interval 'end' when going >> through intersecting ranges within a tree. Ie: patch 3, 6 and 9. >> >> Patch 10: deletes the interval_tree_generic.h header; there are no longer any users. >> >> Patch 11: finally simplifies x86 pat tree to use the new interval tree machinery. >> >> This has been lightly tested, and certainly not on driver paths that do non >> trivial conversions. Also needs more eyeballs as conversions can be easily >> missed (even when I've tried mitigating this by renaming the endpoint from 'last' >> to 'end' in each corresponding structure). >> >> Because this touches a lot of drivers, I'm Cc'ing the whole thing to a couple of >> relevant lists (mm, dri, rdma); sorry if you consider this spam. >> >> Applies on top of today's linux-next tree. Please consider for v5.5. >> >> Thanks! >> >> [1] https://lore.kernel.org/lkml/CANN689HVDJXKEwB80yPAVwvRwnV4HfiucQVAho=dupKM_iKozw@mail.gmail.com/ > >Hurm, this is not entirely accurate. Most users do actually want >overlapping and multiple ranges. I just studied this extensively: > >radeon_mn actually wants overlapping but seems to mis-understand the >interval_tree API and actively tries hard to prevent overlapping at >great cost and complexity. I have a patch to delete all of this and >just be overlapping. > >amdgpu_mn copied the wrongness from radeon_mn > >All the DRM drivers are basically the same here, tracking userspace >controlled VAs, so overlapping is essential > >hfi1/mmu_rb definitely needs overlapping as it is dealing with >userspace VA ranges under control of userspace. As do the other >infiniband users. > >vhost probably doesn't overlap in the normal case, but again userspace >could trigger overlap in some pathalogical case. > >The [start,last] allows the interval to cover up to ULONG_MAX. I don't >know if this is needed however. Many users are using userspace VAs >here. Is there any kernel configuration where ULONG_MAX is a valid >userspace pointer? Ie 32 bit 4G userspace? I don't know. > >Many users seemed to have bugs where they were taking a userspace >controlled start + length and converting them into a start/end for >interval tree without overflow protection (woops) > >Also I have a series already cooking to delete several of these >interval tree users, which will terribly conflict with this :\ I have no problem redoing after your changes; if it's worth it at all. > >Is it really necessary to make such churn for such a tiny API change? I agree, and was kind of expecting this. In general the diffstat ended up being larger than I initially hoped for. Maybe after your removals I can look into this again. Thanks, Davidlohr
On Thu, Oct 03, 2019 at 01:32:50PM -0700, Matthew Wilcox wrote: > On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote: > > It has been discussed[1,2] that almost all users of interval trees would better > > be served if the intervals were actually not [a,b], but instead [a, b). This > > So how does a user represent a range from ULONG_MAX to ULONG_MAX now? > > I think the problem is that large parts of the kernel just don't consider > integer overflow. Because we write in C, it's natural to write: > > for (i = start; i < end; i++) > > and just assume that we never need to hit ULONG_MAX or UINT_MAX. > If we're storing addresses, that's generally true -- most architectures > don't allow addresses in the -PAGE_SIZE to ULONG_MAX range (or they'd > have trouble with PTR_ERR). If you're looking at file sizes, that's > not true on 32-bit machines, and we've definitely seen filesystem bugs > with files nudging up on 16TB (on 32 bit with 4k page size). Or block > driver bugs with similarly sized block devices. > > So, yeah, easier to use. But damning corner cases. Yeah, I wanted to ask - is the case where pgoff == ULONG_MAX (i.e., last block of a file that is exactly 16TB) currently supported on 32-bit archs ? I have no idea if I am supposed to care about this or not...
Hi Jason, On Thu, Oct 3, 2019 at 5:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > Hurm, this is not entirely accurate. Most users do actually want > overlapping and multiple ranges. I just studied this extensively: (Just curious, are you the person we discussed this with after the Maple Tree talk at LPC 2019 ?) I think we have two separate API problems there: - overlapping vs non-overlapping intervals (the interval tree API supports overlapping intervals, but some users are confused about this) - closed vs half-open interval definitions It looks like you have been looking mostly at the first issue, which I expect could simplify several interval tree users considerably, while Davidlohr is addressing the second issue here. > radeon_mn actually wants overlapping but seems to mis-understand the > interval_tree API and actively tries hard to prevent overlapping at > great cost and complexity. I have a patch to delete all of this and > just be overlapping. > > amdgpu_mn copied the wrongness from radeon_mn > > All the DRM drivers are basically the same here, tracking userspace > controlled VAs, so overlapping is essential > > hfi1/mmu_rb definitely needs overlapping as it is dealing with > userspace VA ranges under control of userspace. As do the other > infiniband users. Do you have a handle on what usnic is doing with its intervals ? usnic_uiom_insert_interval() has some complicated logic to avoid having overlapping intervals, which is very confusing to me. > vhost probably doesn't overlap in the normal case, but again userspace > could trigger overlap in some pathalogical case. > > The [start,last] allows the interval to cover up to ULONG_MAX. I don't > know if this is needed however. Many users are using userspace VAs > here. Is there any kernel configuration where ULONG_MAX is a valid > userspace pointer? Ie 32 bit 4G userspace? I don't know. > > Many users seemed to have bugs where they were taking a userspace > controlled start + length and converting them into a start/end for > interval tree without overflow protection (woops) > > Also I have a series already cooking to delete several of these > interval tree users, which will terribly conflict with this :\ > > Is it really necessary to make such churn for such a tiny API change? My take is that this (Davidlohr's) patch series does not necessarily need to be applied all at once - we could get the first change in (adding the interval_tree_gen.h header), and convert the first few users, without getting them all at once, as long as we have a plan for finishing the work. So, if you have cleanups in progress in some of the files, just tell us which ones and we can leave them out from the first pass. Thanks,
On Fri, Oct 04, 2019 at 06:15:11AM -0700, Michel Lespinasse wrote: > My take is that this (Davidlohr's) patch series does not necessarily > need to be applied all at once - we could get the first change in > (adding the interval_tree_gen.h header), and convert the first few > users, without getting them all at once, as long as we have a plan for > finishing the work. So, if you have cleanups in progress in some of > the files, just tell us which ones and we can leave them out from the > first pass. Since we have users which do need to use the full ULONG_MAX range (as pointed out by Christian Koenig), I don't think adding a second implementation which is half-open is a good idea. It'll only lead to confusion.
On Fri, Oct 04, 2019 at 06:15:11AM -0700, Michel Lespinasse wrote: > Hi Jason, > > On Thu, Oct 3, 2019 at 5:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > Hurm, this is not entirely accurate. Most users do actually want > > overlapping and multiple ranges. I just studied this extensively: > > (Just curious, are you the person we discussed this with after the > Maple Tree talk at LPC 2019 ?) Possibly! > I think we have two separate API problems there: > - overlapping vs non-overlapping intervals (the interval tree API > supports overlapping intervals, but some users are confused about > this) I think we just have a bunch of confused drivers, ie the two drm drivers sure look confused to me. > - closed vs half-open interval definitions I'm not sure why this is a big problem.. We may actually just have bugs in handling the '-1' as it is supposed to be written as start + (size-1) so that start + size == ULONG_MAX+1 works properly. > > hfi1/mmu_rb definitely needs overlapping as it is dealing with > > userspace VA ranges under control of userspace. As do the other > > infiniband users. > > Do you have a handle on what usnic is doing with its intervals ? > usnic_uiom_insert_interval() has some complicated logic to avoid > having overlapping intervals, which is very confusing to me. I don't know why it is so complicated, but I can say that it is storing userspace VA's in that tree. I have some feeling this driver is trying to use the IOMMU to create a mirror of the userspace VA Userspace can request the HW be able to access any set of overlapping regions and so the driver must intersect all the ranges and compute a list of VA pages to IOMMU map. Just guessing. Jason
On Fri, 04 Oct 2019, Matthew Wilcox wrote: >On Fri, Oct 04, 2019 at 06:15:11AM -0700, Michel Lespinasse wrote: >> My take is that this (Davidlohr's) patch series does not necessarily >> need to be applied all at once - we could get the first change in >> (adding the interval_tree_gen.h header), and convert the first few >> users, without getting them all at once, as long as we have a plan for >> finishing the work. So, if you have cleanups in progress in some of >> the files, just tell us which ones and we can leave them out from the >> first pass. > >Since we have users which do need to use the full ULONG_MAX range >(as pointed out by Christian Koenig), I don't think adding a second >implementation which is half-open is a good idea. It'll only lead to >confusion. Right, we should not have two implementations. Thanks, Davidlohr