Message ID | CAPM=9tzX71UKndfu8JECMOzc9kf4s4pp9cWTMWwE476cQXt_Yw@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [git,pull] drm fixes for 6.11-rc6 | expand |
On Fri, 30 Aug 2024 at 14:08, Dave Airlie <airlied@gmail.com> wrote: > > The TTM revert is due to some stuttering graphical apps probably due > to longer stalls while prefaulting. Yeah, trying to pre-fault a PMD worth of pages in one go is just crazy talk. Now, if it was PMD-aligned and you faulted in a single PMD, that would be different. But just doing prn_insert_page() in a loop is insane. The code doesn't even stop when it hits a page that already existed, and it keeps locking and unlocking the last-level page table over and over again. Honestly, that code is questionable even for the *small* value, much less the "a PMD size" case. Now, if you have an array of 'struct page *", you can use vm_insert_pages(), and that's reasonably efficient. And if you have a *contiguous* are of pfns, you can use remap_pfn_range(). But that "insert one pfn at a time" that the drm layer does is complete garbage. You're not speeding anything up, you're just digging deeper. Linus
The pull request you sent on Fri, 30 Aug 2024 12:08:41 +1000:
> https://gitlab.freedesktop.org/drm/kernel.git tags/drm-fixes-2024-08-30
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/20371ba120635d9ab7fc7670497105af8f33eb08
Thank you!
On Fri, 30 Aug 2024 at 12:32, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 30 Aug 2024 at 14:08, Dave Airlie <airlied@gmail.com> wrote: > > > > The TTM revert is due to some stuttering graphical apps probably due > > to longer stalls while prefaulting. > > Yeah, trying to pre-fault a PMD worth of pages in one go is just crazy talk. > > Now, if it was PMD-aligned and you faulted in a single PMD, that would > be different. But just doing prn_insert_page() in a loop is insane. > > The code doesn't even stop when it hits a page that already existed, > and it keeps locking and unlocking the last-level page table over and > over again. > > Honestly, that code is questionable even for the *small* value, much > less the "a PMD size" case. > > Now, if you have an array of 'struct page *", you can use > vm_insert_pages(), and that's reasonably efficient. > > And if you have a *contiguous* are of pfns, you can use remap_pfn_range(). > > But that "insert one pfn at a time" that the drm layer does is > complete garbage. You're not speeding anything up, you're just digging > deeper. I wonder if there is functionality that could be provided in a common helper, by the mm layers, or if there would be too many locking interactions to make it sane, It seems too fraught with danger for drivers or subsystems to be just doing this in the simplest way that isn't actually that smart. Dave.
On Mon, 2024-09-02 at 08:13 +1000, Dave Airlie wrote: > On Fri, 30 Aug 2024 at 12:32, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Fri, 30 Aug 2024 at 14:08, Dave Airlie <airlied@gmail.com> > > wrote: > > > > > > The TTM revert is due to some stuttering graphical apps probably > > > due > > > to longer stalls while prefaulting. > > > > Yeah, trying to pre-fault a PMD worth of pages in one go is just > > crazy talk. > > > > Now, if it was PMD-aligned and you faulted in a single PMD, that > > would > > be different. But just doing prn_insert_page() in a loop is insane. > > > > The code doesn't even stop when it hits a page that already > > existed, > > and it keeps locking and unlocking the last-level page table over > > and > > over again. > > > > Honestly, that code is questionable even for the *small* value, > > much > > less the "a PMD size" case. > > > > Now, if you have an array of 'struct page *", you can use > > vm_insert_pages(), and that's reasonably efficient. > > > > And if you have a *contiguous* are of pfns, you can use > > remap_pfn_range(). > > > > But that "insert one pfn at a time" that the drm layer does is > > complete garbage. You're not speeding anything up, you're just > > digging > > deeper. > > I wonder if there is functionality that could be provided in a common > helper, by the mm layers, or if there would be too many locking > interactions to make it sane, > > It seems too fraught with danger for drivers or subsystems to be just > doing this in the simplest way that isn't actually that smart. Hmm. I see even the "Don't error on prefaults" check was broken at some point :/. There have been numerous ways to try to address this, The remap_pfn_range was last tried, at least in the context of the i915 driver IIRC by Christoph Hellwig but had to be ripped out since it requires the mmap_lock in write mode. Here we have it only in read mode. Then there's the apply_to_page_range() used by the igfx functionality of the i915 driver. I don't think we should go that route without turning it into something like vm_insert_pfns() with proper checking. This approach populates all entries of a buffer object. Finally there's the huge fault attempt that had to be ripped out due to lack of pmd_special and pud_special flags and resulting clashes with gup_fast. Perhaps a combination of the two latter if properly implemented would be the best choice. /Thomas > > Dave.
Am 02.09.24 um 11:32 schrieb Thomas Hellström: > On Mon, 2024-09-02 at 08:13 +1000, Dave Airlie wrote: >> On Fri, 30 Aug 2024 at 12:32, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> On Fri, 30 Aug 2024 at 14:08, Dave Airlie <airlied@gmail.com> >>> wrote: >>>> The TTM revert is due to some stuttering graphical apps probably >>>> due >>>> to longer stalls while prefaulting. >>> Yeah, trying to pre-fault a PMD worth of pages in one go is just >>> crazy talk. >>> >>> Now, if it was PMD-aligned and you faulted in a single PMD, that >>> would >>> be different. But just doing prn_insert_page() in a loop is insane. >>> >>> The code doesn't even stop when it hits a page that already >>> existed, >>> and it keeps locking and unlocking the last-level page table over >>> and >>> over again. >>> >>> Honestly, that code is questionable even for the *small* value, >>> much >>> less the "a PMD size" case. >>> >>> Now, if you have an array of 'struct page *", you can use >>> vm_insert_pages(), and that's reasonably efficient. >>> >>> And if you have a *contiguous* are of pfns, you can use >>> remap_pfn_range(). >>> >>> But that "insert one pfn at a time" that the drm layer does is >>> complete garbage. You're not speeding anything up, you're just >>> digging >>> deeper. > >> I wonder if there is functionality that could be provided in a common >> helper, by the mm layers, or if there would be too many locking >> interactions to make it sane, >> >> It seems too fraught with danger for drivers or subsystems to be just >> doing this in the simplest way that isn't actually that smart. > Hmm. I see even the "Don't error on prefaults" check was broken at some > point :/. > > There have been numerous ways to try to address this, > > The remap_pfn_range was last tried, at least in the context of the i915 > driver IIRC by Christoph Hellwig but had to be ripped out since it > requires the mmap_lock in write mode. Here we have it only in read > mode. > > Then there's the apply_to_page_range() used by the igfx functionality > of the i915 driver. I don't think we should go that route without > turning it into something like vm_insert_pfns() with proper checking. > This approach populates all entries of a buffer object. > > Finally there's the huge fault attempt that had to be ripped out due to > lack of pmd_special and pud_special flags and resulting clashes with > gup_fast. > > Perhaps a combination of the two latter if properly implemented would > be the best choice. I'm not deep enough into the memory management background to judge which approach is the best, just one more data point to provide: The pre-faulting was increased because of virtualization. When KVM/XEN is mapping a BO into a guest the switching overhead for each fault is so high that mapping a lot of PFNs at the same time becomes beneficial. Regards, Christian. > > /Thomas > >> Dave.
On Mon, 2 Sept 2024 at 03:34, Christian König <christian.koenig@amd.com> wrote: > > Am 02.09.24 um 11:32 schrieb Thomas Hellström: > > > > The remap_pfn_range was last tried, at least in the context of the i915 > > driver IIRC by Christoph Hellwig but had to be ripped out since it > > requires the mmap_lock in write mode. Here we have it only in read > > mode. Yeah, that does make things much more fragile. The "fill in multiple pages" helpers are really meant to be used at mmap() time, not as some kind of fault-around. So remap_pfn_range() sets the vma flags etc, but it also depends on being the only one to fill in the ptes, and will be *very* unhappy if it finds something that has already been filled in. And fault-around is *much* more fragile because unlike the mmap time thing, it can happen concurrently with other faults, and you cannot make assumptions about existing page table layout etc. > The pre-faulting was increased because of virtualization. When KVM/XEN > is mapping a BO into a guest the switching overhead for each fault is so > high that mapping a lot of PFNs at the same time becomes beneficial. Honestly, from a pure performance standpoint, if you can just do all the page mapping at mmap() time, that would be *much* much better. Then you can easily use that remap_pfn_range() function, and latencies are much less of an issue in general (not because it would be any faster, but simply because people don't tend to use mmap() in latency-critical code - but taking a page fault is *easily* very critical indeed). Linus
On Mon, 2024-09-02 at 12:33 +0200, Christian König wrote: > Am 02.09.24 um 11:32 schrieb Thomas Hellström: > > On Mon, 2024-09-02 at 08:13 +1000, Dave Airlie wrote: > > > On Fri, 30 Aug 2024 at 12:32, Linus Torvalds > > > <torvalds@linux-foundation.org> wrote: > > > > On Fri, 30 Aug 2024 at 14:08, Dave Airlie <airlied@gmail.com> > > > > wrote: > > > > > The TTM revert is due to some stuttering graphical apps > > > > > probably > > > > > due > > > > > to longer stalls while prefaulting. > > > > Yeah, trying to pre-fault a PMD worth of pages in one go is > > > > just > > > > crazy talk. > > > > > > > > Now, if it was PMD-aligned and you faulted in a single PMD, > > > > that > > > > would > > > > be different. But just doing prn_insert_page() in a loop is > > > > insane. > > > > > > > > The code doesn't even stop when it hits a page that already > > > > existed, > > > > and it keeps locking and unlocking the last-level page table > > > > over > > > > and > > > > over again. > > > > > > > > Honestly, that code is questionable even for the *small* value, > > > > much > > > > less the "a PMD size" case. > > > > > > > > Now, if you have an array of 'struct page *", you can use > > > > vm_insert_pages(), and that's reasonably efficient. > > > > > > > > And if you have a *contiguous* are of pfns, you can use > > > > remap_pfn_range(). > > > > > > > > But that "insert one pfn at a time" that the drm layer does is > > > > complete garbage. You're not speeding anything up, you're just > > > > digging > > > > deeper. > > > > > I wonder if there is functionality that could be provided in a > > > common > > > helper, by the mm layers, or if there would be too many locking > > > interactions to make it sane, > > > > > > It seems too fraught with danger for drivers or subsystems to be > > > just > > > doing this in the simplest way that isn't actually that smart. > > Hmm. I see even the "Don't error on prefaults" check was broken at > > some > > point :/. > > > > There have been numerous ways to try to address this, > > > > The remap_pfn_range was last tried, at least in the context of the > > i915 > > driver IIRC by Christoph Hellwig but had to be ripped out since it > > requires the mmap_lock in write mode. Here we have it only in read > > mode. > > > > Then there's the apply_to_page_range() used by the igfx > > functionality > > of the i915 driver. I don't think we should go that route without > > turning it into something like vm_insert_pfns() with proper > > checking. > > This approach populates all entries of a buffer object. > > > > Finally there's the huge fault attempt that had to be ripped out > > due to > > lack of pmd_special and pud_special flags and resulting clashes > > with > > gup_fast. > > > > Perhaps a combination of the two latter if properly implemented > > would > > be the best choice. > > I'm not deep enough into the memory management background to judge > which > approach is the best, just one more data point to provide: > > The pre-faulting was increased because of virtualization. When > KVM/XEN > is mapping a BO into a guest the switching overhead for each fault is > so > high that mapping a lot of PFNs at the same time becomes beneficial. Since populating at mmap time is not possible due to eviction / migration, perhaps one way would be to use madvise() to toggle prefaulting size? MADV_RANDOM vs MADV_SEQUENTIAL vs MADV_NORMAL. /Thomas > > Regards, > Christian. > > > > > /Thomas > > > > > Dave. >