Message ID | 20220606202109.1306034-1-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | huge page clearing optimizations | expand |
On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: > > This series introduces two optimizations in the huge page clearing path: > > 1. extends the clear_page() machinery to also handle extents larger > than a single page. > 2. support non-cached page clearing for huge and gigantic pages. > > The first optimization is useful for hugepage fault handling, the > second for prefaulting, or for gigantic pages. Please just split these two issues up into entirely different patch series. That said, I have a few complaints about the individual patches even in this form, to the point where I think the whole series is nasty: - get rid of 3/21 entirely. It's wrong in every possible way: (a) That shouldn't be an inline function in a header file at all. If you're clearing several pages of data, that just shouldn't be an inline function. (b) Get rid of __HAVE_ARCH_CLEAR_USER_PAGES. I hate how people make up those idiotic pointless names. If you have to use a #ifdef, just use the name of the function that the architecture overrides, not some other new name. But you don't need it at all, because (c) Just make a __weak function called clear_user_highpages() in mm/highmem.c, and allow architectures to just create their own non-weak ones. - patch 4/21 and 5/32: can we instead just get rid of that silly "process_huge_page()" thing entirely. It's disgusting, and it's a big part of why 'rep movs/stos' cannot work efficiently. It also makes NO SENSE if you then use non-temporal accesses. So instead of doubling down on the craziness of that function, just get rid of it entirely. There are two users, and they want to clear a hugepage and copy it respectively. Don't make it harder than it is. *Maybe* the code wants to do a "prefetch" afterwards. Who knows. But I really think you sh ould do the crapectomy first, make the code simpler and more straightforward, and just allow architectures to override the *simple* "copy or clear a lage page" rather than keep feeding this butt-ugly monstrosity. - 13/21: see 3/21. - 14-17/21: see 4/21 and 5/21. Once you do the crapectomy and get rid of the crazy process_huge_page() abstraction, and just let architectures do their own clear/copy huge pages, *all* this craziness goes away. Those "when to use which type of clear/copy" becomes a *local* question, no silly arch_clear_page_non_caching_threshold() garbage. So I really don't like this series. A *lot* of it comes from that horrible process_huge_page() model, and the whole model is just wrong and pointless. You're literally trying to fix the mess that that function is, but you're keeping the fundamental problem around. The whole *point* of your patch-set is to use non-temporal stores, which makes all the process_huge_page() things entirely pointless, and only complicates things. And even if we don't use non-temporal stores, that process_huge_page() thing makes for trouble for any "rep stos/movs" implementation that might actualyl do a better job if it was just chunked up in bigger chunks. Yes, yes, you probably still want to chunk that up somewhat due to latency reasons, but even then architectures might as well just make their own decisions, rather than have the core mm code make one clearly bad decision for them. Maybe chunking it up in bigger chunks than one page. Maybe an architecture could do even more radical things like "let's just 'rep stos' for the whole area, but set a special thread flag that causes the interrupt return to break it up on return to kernel space". IOW, the "latency fix" might not even be about chunking it up, it might look more like our exception handling thing. So I really think that crapectomy should be the first thing you do, and that should be that first part of "extends the clear_page() machinery to also handle extents larger than a single page" Linus
[ Fixed email for Joao Martins. ] Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: [snip] > So I really don't like this series. A *lot* of it comes from that > horrible process_huge_page() model, and the whole model is just wrong > and pointless. You're literally trying to fix the mess that that > function is, but you're keeping the fundamental problem around. > > The whole *point* of your patch-set is to use non-temporal stores, > which makes all the process_huge_page() things entirely pointless, and > only complicates things. > > And even if we don't use non-temporal stores, that process_huge_page() > thing makes for trouble for any "rep stos/movs" implementation that > might actualyl do a better job if it was just chunked up in bigger > chunks. This makes sense to me. There is a lot of unnecessary machinery around process_huge_page() and this series adds more of it. For highmem and page-at-a-time archs we would need to keep some of the same optimizations (via the common clear/copy_user_highpages().) Still that rids the arch code from pointless constraints as you say below. > Yes, yes, you probably still want to chunk that up somewhat due to > latency reasons, but even then architectures might as well just make > their own decisions, rather than have the core mm code make one > clearly bad decision for them. Maybe chunking it up in bigger chunks > than one page. Right. Or doing the whole contiguous area in one or a few chunks chunks, and then touching the faulting cachelines towards the end. > Maybe an architecture could do even more radical things like "let's > just 'rep stos' for the whole area, but set a special thread flag that > causes the interrupt return to break it up on return to kernel space". > IOW, the "latency fix" might not even be about chunking it up, it > might look more like our exception handling thing. When I was thinking about this earlier, I had a vague inkling of setting a thread flag and defer writes to the last few cachelines for just before returning to user-space. Can you elaborate a little about what you are describing above? > So I really think that crapectomy should be the first thing you do, > and that should be that first part of "extends the clear_page() > machinery to also handle extents larger than a single page" Ack that. And, thanks for the detailed comments. -- ankur
On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@oracle.com> wrote: > > For highmem and page-at-a-time archs we would need to keep some > of the same optimizations (via the common clear/copy_user_highpages().) Yeah, I guess that we could keep the code for legacy use, just make the existing code be marked __weak so that it can be ignored for any further work. IOW, the first patch might be to just add that __weak to 'clear_huge_page()' and 'copy_user_huge_page()'. At that point, any architecture can just say "I will implement my own versions of these two". In fact, you can start with just one or the other, which is probably nicer to keep the patch series smaller (ie do the simpler "clear_huge_page()" first). I worry a bit about the insanity of the "gigantic" pages, and the mem_map_next() games it plays, but that code is from 2008 and I really doubt it makes any sense to keep around at least for x86. The source of that abomination is powerpc, and I do not think that whole issue with MAX_ORDER_NR_PAGES makes any difference on x86, at least. It most definitely makes no sense when there is no highmem issues, and all those 'struct page' games should just be deleted (or at least relegated entirely to that "legacy __weak function" case so that sane situations don't need to care). For that same HIGHMEM reason it's probably a good idea to limit the new case just to x86-64, and leave 32-bit x86 behind. > Right. Or doing the whole contiguous area in one or a few chunks > chunks, and then touching the faulting cachelines towards the end. Yeah, just add a prefetch for the 'addr_hint' part at the end. > > Maybe an architecture could do even more radical things like "let's > > just 'rep stos' for the whole area, but set a special thread flag that > > causes the interrupt return to break it up on return to kernel space". > > IOW, the "latency fix" might not even be about chunking it up, it > > might look more like our exception handling thing. > > When I was thinking about this earlier, I had a vague inkling of > setting a thread flag and defer writes to the last few cachelines > for just before returning to user-space. > Can you elaborate a little about what you are describing above? So 'process_huge_page()' (and the gigantic page case) does three very different things: (a) that page chunking for highmem accesses (b) the page access _ordering_ for the cache hinting reasons (c) the chunking for _latency_ reasons and I think all of them are basically "bad legacy" reasons, in that (a) HIGHMEM doesn't exist on sane architectures that we care about these days (b) the cache hinting ordering makes no sense if you do non-temporal accesses (and might then be replaced by a possible "prefetch" at the end) (c) the latency reasons still *do* exist, but only with PREEMPT_NONE So what I was alluding to with those "more radical approaches" was that PREEMPT_NONE case: we would probably still want to chunk things up for latency reasons and do that "cond_resched()" in between chunks. Now, there are alternatives here: (a) only override that existing disgusting (but tested) function when both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false (b) do something like this: void clear_huge_page(struct page *page, unsigned long addr_hint, unsigned int pages_per_huge_page) { void *addr = page_address(page); #ifdef CONFIG_PREEMPT_NONE for (int i = 0; i < pages_per_huge_page; i++) clear_page(addr, PAGE_SIZE); cond_preempt(); } #else nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page); prefetch(addr_hint); #endif } or (c), do that "more radical approach", where you do something like this: void clear_huge_page(struct page *page, unsigned long addr_hint, unsigned int pages_per_huge_page) { set_thread_flag(TIF_PREEMPT_ME); nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page); clear_thread_flag(TIF_PREEMPT_ME); prefetch(addr_hint); } and then you make the "return to kernel mode" check the TIF_PREEMPT_ME case and actually force preemption even on a non-preempt kernel. It's _probably_ the case that CONFIG_PREEMPT_NONE is so rare that it's n ot even worth doing. I dunno. And all of the above pseudo-code may _look_ like real code, but is entirely untested and entirely handwavy "something like this". Hmm? Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> For highmem and page-at-a-time archs we would need to keep some >> of the same optimizations (via the common clear/copy_user_highpages().) > > Yeah, I guess that we could keep the code for legacy use, just make > the existing code be marked __weak so that it can be ignored for any > further work. > > IOW, the first patch might be to just add that __weak to > 'clear_huge_page()' and 'copy_user_huge_page()'. > > At that point, any architecture can just say "I will implement my own > versions of these two". > > In fact, you can start with just one or the other, which is probably > nicer to keep the patch series smaller (ie do the simpler > "clear_huge_page()" first). Agreed. Best way to iron out all the kinks too. > I worry a bit about the insanity of the "gigantic" pages, and the > mem_map_next() games it plays, but that code is from 2008 and I really > doubt it makes any sense to keep around at least for x86. The source > of that abomination is powerpc, and I do not think that whole issue > with MAX_ORDER_NR_PAGES makes any difference on x86, at least. Looking at it now, it seems to be caused by the wide range of MAX_ZONEORDER values on powerpc? It made my head hurt so I didn't try to figure it out in detail. But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS? An arch specific clear_huge_page() code could, however handle 1GB pages via some kind of static loop around (30 - MAX_SECTION_BITS). I'm a little fuzzy on CONFIG_SPARSEMEM_EXTREME, and !SPARSEMEM_VMEMMAP configs. But, I think we should be able to not look up pfn_to_page(), page_to_pfn() at all or at least not in the inner loop. > It most definitely makes no sense when there is no highmem issues, and > all those 'struct page' games should just be deleted (or at least > relegated entirely to that "legacy __weak function" case so that sane > situations don't need to care). Yeah, I'm hoping to do exactly that. > For that same HIGHMEM reason it's probably a good idea to limit the > new case just to x86-64, and leave 32-bit x86 behind. Ack that. >> Right. Or doing the whole contiguous area in one or a few chunks >> chunks, and then touching the faulting cachelines towards the end. > > Yeah, just add a prefetch for the 'addr_hint' part at the end. > >> > Maybe an architecture could do even more radical things like "let's >> > just 'rep stos' for the whole area, but set a special thread flag that >> > causes the interrupt return to break it up on return to kernel space". >> > IOW, the "latency fix" might not even be about chunking it up, it >> > might look more like our exception handling thing. >> >> When I was thinking about this earlier, I had a vague inkling of >> setting a thread flag and defer writes to the last few cachelines >> for just before returning to user-space. >> Can you elaborate a little about what you are describing above? > > So 'process_huge_page()' (and the gigantic page case) does three very > different things: > > (a) that page chunking for highmem accesses > > (b) the page access _ordering_ for the cache hinting reasons > > (c) the chunking for _latency_ reasons > > and I think all of them are basically "bad legacy" reasons, in that > > (a) HIGHMEM doesn't exist on sane architectures that we care about these days > > (b) the cache hinting ordering makes no sense if you do non-temporal > accesses (and might then be replaced by a possible "prefetch" at the > end) > > (c) the latency reasons still *do* exist, but only with PREEMPT_NONE > > So what I was alluding to with those "more radical approaches" was > that PREEMPT_NONE case: we would probably still want to chunk things > up for latency reasons and do that "cond_resched()" in between > chunks. Thanks for the detail. That helps. > Now, there are alternatives here: > > (a) only override that existing disgusting (but tested) function when > both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false > > (b) do something like this: > > void clear_huge_page(struct page *page, > unsigned long addr_hint, > unsigned int pages_per_huge_page) > { > void *addr = page_address(page); > #ifdef CONFIG_PREEMPT_NONE > for (int i = 0; i < pages_per_huge_page; i++) > clear_page(addr, PAGE_SIZE); > cond_preempt(); > } > #else > nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page); > prefetch(addr_hint); > #endif > } We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY as well, right? Either way, as you said earlier could chunk up in bigger units than a single page. (In the numbers I had posted earlier, chunking in units of upto 1MB gave ~25% higher clearing BW. Don't think the microcode setup costs are that high, but don't have a good explanation why.) > or (c), do that "more radical approach", where you do something like this: > > void clear_huge_page(struct page *page, > unsigned long addr_hint, > unsigned int pages_per_huge_page) > { > set_thread_flag(TIF_PREEMPT_ME); > nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page); > clear_thread_flag(TIF_PREEMPT_ME); > prefetch(addr_hint); > } > > and then you make the "return to kernel mode" check the TIF_PREEMPT_ME > case and actually force preemption even on a non-preempt kernel. I like this one. I'll try out (b) and (c) and see how the code shakes out. Just one minor point -- seems to me that the choice of nontemporal or temporal might have to be based on a hint to clear_huge_page(). Basically the nontemporal path is only faster for (pages_per_huge_page * PAGE_SIZE > LLC-size). So in the page-fault path it might make sense to use the temporal path (except for gigantic pages.) In the prefault path, nontemporal might be better. Thanks -- ankur
On Wed, Jun 8, 2022 at 12:25 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: > > But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS? > An arch specific clear_huge_page() code could, however handle 1GB pages > via some kind of static loop around (30 - MAX_SECTION_BITS). Even if gigantic pages straddle that area, it simply shouldn't matter. The only reason that MAX_SECTION_BITS matters is for the 'struct page *' lookup. And the only reason for *that* is because of HIGHMEM. So it's all entirely silly and pointless on any sane architecture, I think. > We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY > as well, right? Ahh, yes. I should have looked at the code, and not just gone by my "PREEMPT_NONE vs PREEMPT" thing that entirely forgot about how we split that up. > Just one minor point -- seems to me that the choice of nontemporal or > temporal might have to be based on a hint to clear_huge_page(). Quite possibly. But I'd prefer that as a separate "look, this improves numbers by X%" thing from the whole "let's make the clear_huge_page() interface at least sane". Linus
On Tue, Jun 07, 2022 at 10:56:01AM -0700, Linus Torvalds wrote: > I worry a bit about the insanity of the "gigantic" pages, and the > mem_map_next() games it plays, but that code is from 2008 and I really > doubt it makes any sense to keep around at least for x86. The source > of that abomination is powerpc, and I do not think that whole issue > with MAX_ORDER_NR_PAGES makes any difference on x86, at least. Oh, argh, I meant to delete mem_map_next(), and forgot. If you need to use struct page (a later message hints you don't), just use nth_page() directly. I optimised it so it's not painful except on SPARSEMEM && !SPARSEMEM_VMEMMAP back in December in commit 659508f9c936. And nobody cares about performance on SPARSEMEM && !SPARSEMEM_VMEMMAP systems.
On Wed, Jun 08, 2022 at 08:49:57PM +0100, Matthew Wilcox wrote: > On Tue, Jun 07, 2022 at 10:56:01AM -0700, Linus Torvalds wrote: > > I worry a bit about the insanity of the "gigantic" pages, and the > > mem_map_next() games it plays, but that code is from 2008 and I really > > doubt it makes any sense to keep around at least for x86. The source > > of that abomination is powerpc, and I do not think that whole issue > > with MAX_ORDER_NR_PAGES makes any difference on x86, at least. > > Oh, argh, I meant to delete mem_map_next(), and forgot. > > If you need to use struct page (a later message hints you don't), just > use nth_page() directly. I optimised it so it's not painful except on > SPARSEMEM && !SPARSEMEM_VMEMMAP back in December in commit 659508f9c936. > And nobody cares about performance on SPARSEMEM && !SPARSEMEM_VMEMMAP > systems. Oops, wrong commit. I meant 1cfcee728391 from June 2021.
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, Jun 8, 2022 at 12:25 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS? >> An arch specific clear_huge_page() code could, however handle 1GB pages >> via some kind of static loop around (30 - MAX_SECTION_BITS). > > Even if gigantic pages straddle that area, it simply shouldn't matter. > > The only reason that MAX_SECTION_BITS matters is for the 'struct page *' lookup. > > And the only reason for *that* is because of HIGHMEM. > > So it's all entirely silly and pointless on any sane architecture, I think. > >> We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY >> as well, right? > > Ahh, yes. I should have looked at the code, and not just gone by my > "PREEMPT_NONE vs PREEMPT" thing that entirely forgot about how we > split that up. > >> Just one minor point -- seems to me that the choice of nontemporal or >> temporal might have to be based on a hint to clear_huge_page(). > > Quite possibly. But I'd prefer that as a separate "look, this > improves numbers by X%" thing from the whole "let's make the > clear_huge_page() interface at least sane". Makes sense to me. -- ankur