Message ID | 20241011102445.934409-3-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: don't install PMD mappings when THPs are disabled by the hw/process/vma | expand |
On 11/10/2024 11:24, David Hildenbrand wrote: > We (or rather, readahead logic :) ) might be allocating a THP in the > pagecache and then try mapping it into a process that explicitly disabled > THP: we might end up installing PMD mappings. > > This is a problem for s390x KVM, which explicitly remaps all PMD-mapped > THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before > starting the VM. > > For example, starting a VM backed on a file system with large folios > supported makes the VM crash when the VM tries accessing such a mapping > using KVM. > > Is it also a problem when the HW disabled THP using > TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case > without X86_FEATURE_PSE. > > In the future, we might be able to do better on s390x and only disallow > PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED > really wants. For now, fix it by essentially performing the same check as > would be done in __thp_vma_allowable_orders() or in shmem code, where this > works as expected, and disallow PMD mappings, making us fallback to PTE > mappings. > > Reported-by: Leo Fu <bfu@redhat.com> > Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Will this patch be difficult to backport given it depends on the previous patch and that doesn't have a Fixes tag? > Cc: Thomas Huth <thuth@redhat.com> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com> > Cc: Janosch Frank <frankja@linux.ibm.com> > Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > mm/memory.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 2366578015ad..a2e501489517 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > pmd_t entry; > vm_fault_t ret = VM_FAULT_FALLBACK; > > + /* > + * It is too late to allocate a small folio, we already have a large > + * folio in the pagecache: especially s390 KVM cannot tolerate any > + * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any > + * PMD mappings if THPs are disabled. > + */ > + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags)) > + return ret; Why not just call thp_vma_allowable_orders()? > + > if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) > return ret; >
On 11.10.24 13:29, Ryan Roberts wrote: > On 11/10/2024 11:24, David Hildenbrand wrote: >> We (or rather, readahead logic :) ) might be allocating a THP in the >> pagecache and then try mapping it into a process that explicitly disabled >> THP: we might end up installing PMD mappings. >> >> This is a problem for s390x KVM, which explicitly remaps all PMD-mapped >> THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before >> starting the VM. >> >> For example, starting a VM backed on a file system with large folios >> supported makes the VM crash when the VM tries accessing such a mapping >> using KVM. >> >> Is it also a problem when the HW disabled THP using >> TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case >> without X86_FEATURE_PSE. >> >> In the future, we might be able to do better on s390x and only disallow >> PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED >> really wants. For now, fix it by essentially performing the same check as >> would be done in __thp_vma_allowable_orders() or in shmem code, where this >> works as expected, and disallow PMD mappings, making us fallback to PTE >> mappings. >> >> Reported-by: Leo Fu <bfu@redhat.com> >> Fixes: 793917d997df ("mm/readahead: Add large folio readahead") > > Will this patch be difficult to backport given it depends on the previous patch > and that doesn't have a Fixes tag? "difficult" -- not really. Andrew might want to tag patch #1 with "Fixes:" as well, but I can also send simple stable backports that avoid patch #1. (Thinking again, I assume we want to Cc:stable) > >> Cc: Thomas Huth <thuth@redhat.com> >> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> >> Cc: Ryan Roberts <ryan.roberts@arm.com> >> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> >> Cc: Janosch Frank <frankja@linux.ibm.com> >> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> mm/memory.c | 9 +++++++++ >> 1 file changed, 9 insertions(+) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 2366578015ad..a2e501489517 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) >> pmd_t entry; >> vm_fault_t ret = VM_FAULT_FALLBACK; >> >> + /* >> + * It is too late to allocate a small folio, we already have a large >> + * folio in the pagecache: especially s390 KVM cannot tolerate any >> + * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any >> + * PMD mappings if THPs are disabled. >> + */ >> + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags)) >> + return ret; > > Why not just call thp_vma_allowable_orders()? Why call thp_vma_allowable_orders() that does a lot more work that doesn't really apply here? :) I'd say, just like shmem, we handle this separately here.
On 11/10/2024 12:33, David Hildenbrand wrote: > On 11.10.24 13:29, Ryan Roberts wrote: >> On 11/10/2024 11:24, David Hildenbrand wrote: >>> We (or rather, readahead logic :) ) might be allocating a THP in the >>> pagecache and then try mapping it into a process that explicitly disabled >>> THP: we might end up installing PMD mappings. >>> >>> This is a problem for s390x KVM, which explicitly remaps all PMD-mapped >>> THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before >>> starting the VM. >>> >>> For example, starting a VM backed on a file system with large folios >>> supported makes the VM crash when the VM tries accessing such a mapping >>> using KVM. >>> >>> Is it also a problem when the HW disabled THP using >>> TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case >>> without X86_FEATURE_PSE. >>> >>> In the future, we might be able to do better on s390x and only disallow >>> PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED >>> really wants. For now, fix it by essentially performing the same check as >>> would be done in __thp_vma_allowable_orders() or in shmem code, where this >>> works as expected, and disallow PMD mappings, making us fallback to PTE >>> mappings. >>> >>> Reported-by: Leo Fu <bfu@redhat.com> >>> Fixes: 793917d997df ("mm/readahead: Add large folio readahead") >> >> Will this patch be difficult to backport given it depends on the previous patch >> and that doesn't have a Fixes tag? > > "difficult" -- not really. Andrew might want to tag patch #1 with "Fixes:" as > well, but I can also send simple stable backports that avoid patch #1. > > (Thinking again, I assume we want to Cc:stable) > >> >>> Cc: Thomas Huth <thuth@redhat.com> >>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> >>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> >>> Cc: Janosch Frank <frankja@linux.ibm.com> >>> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> mm/memory.c | 9 +++++++++ >>> 1 file changed, 9 insertions(+) >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 2366578015ad..a2e501489517 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct >>> page *page) >>> pmd_t entry; >>> vm_fault_t ret = VM_FAULT_FALLBACK; >>> + /* >>> + * It is too late to allocate a small folio, we already have a large >>> + * folio in the pagecache: especially s390 KVM cannot tolerate any >>> + * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any >>> + * PMD mappings if THPs are disabled. >>> + */ >>> + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags)) >>> + return ret; >> >> Why not just call thp_vma_allowable_orders()? > > Why call thp_vma_allowable_orders() that does a lot more work that doesn't > really apply here? :) Yeah fair enough, I was just thinking it makes the code simpler to keep all the checks in one place. But no strong opinion. Either way: Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> > > I'd say, just like shmem, we handle this separately here. >
On 11.10.24 13:36, Ryan Roberts wrote: > On 11/10/2024 12:33, David Hildenbrand wrote: >> On 11.10.24 13:29, Ryan Roberts wrote: >>> On 11/10/2024 11:24, David Hildenbrand wrote: >>>> We (or rather, readahead logic :) ) might be allocating a THP in the >>>> pagecache and then try mapping it into a process that explicitly disabled >>>> THP: we might end up installing PMD mappings. >>>> >>>> This is a problem for s390x KVM, which explicitly remaps all PMD-mapped >>>> THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before >>>> starting the VM. >>>> >>>> For example, starting a VM backed on a file system with large folios >>>> supported makes the VM crash when the VM tries accessing such a mapping >>>> using KVM. >>>> >>>> Is it also a problem when the HW disabled THP using >>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case >>>> without X86_FEATURE_PSE. >>>> >>>> In the future, we might be able to do better on s390x and only disallow >>>> PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED >>>> really wants. For now, fix it by essentially performing the same check as >>>> would be done in __thp_vma_allowable_orders() or in shmem code, where this >>>> works as expected, and disallow PMD mappings, making us fallback to PTE >>>> mappings. >>>> >>>> Reported-by: Leo Fu <bfu@redhat.com> >>>> Fixes: 793917d997df ("mm/readahead: Add large folio readahead") >>> >>> Will this patch be difficult to backport given it depends on the previous patch >>> and that doesn't have a Fixes tag? >> >> "difficult" -- not really. Andrew might want to tag patch #1 with "Fixes:" as >> well, but I can also send simple stable backports that avoid patch #1. >> >> (Thinking again, I assume we want to Cc:stable) >> >>> >>>> Cc: Thomas Huth <thuth@redhat.com> >>>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> >>>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>>> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> >>>> Cc: Janosch Frank <frankja@linux.ibm.com> >>>> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> --- >>>> mm/memory.c | 9 +++++++++ >>>> 1 file changed, 9 insertions(+) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 2366578015ad..a2e501489517 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct >>>> page *page) >>>> pmd_t entry; >>>> vm_fault_t ret = VM_FAULT_FALLBACK; >>>> + /* >>>> + * It is too late to allocate a small folio, we already have a large >>>> + * folio in the pagecache: especially s390 KVM cannot tolerate any >>>> + * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any >>>> + * PMD mappings if THPs are disabled. >>>> + */ >>>> + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags)) >>>> + return ret; >>> >>> Why not just call thp_vma_allowable_orders()? >> >> Why call thp_vma_allowable_orders() that does a lot more work that doesn't >> really apply here? :) > > Yeah fair enough, I was just thinking it makes the code simpler to keep all the > checks in one place. But no strong opinion. > > Either way: > > Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Thanks! Also, I decided to not use "thp_vma_allowable_orders" because we are past the allocation phase (as indicated in the comment) and can really just change the way how we map the folio (PMD vs. PTE), not really *what* folio to use. Ideally, in the future we have a different way of just saying "no PMD mappings please", decoupling the mapping from the allocation granularity.
diff --git a/mm/memory.c b/mm/memory.c index 2366578015ad..a2e501489517 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) pmd_t entry; vm_fault_t ret = VM_FAULT_FALLBACK; + /* + * It is too late to allocate a small folio, we already have a large + * folio in the pagecache: especially s390 KVM cannot tolerate any + * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any + * PMD mappings if THPs are disabled. + */ + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags)) + return ret; + if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) return ret;
We (or rather, readahead logic :) ) might be allocating a THP in the pagecache and then try mapping it into a process that explicitly disabled THP: we might end up installing PMD mappings. This is a problem for s390x KVM, which explicitly remaps all PMD-mapped THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before starting the VM. For example, starting a VM backed on a file system with large folios supported makes the VM crash when the VM tries accessing such a mapping using KVM. Is it also a problem when the HW disabled THP using TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case without X86_FEATURE_PSE. In the future, we might be able to do better on s390x and only disallow PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED really wants. For now, fix it by essentially performing the same check as would be done in __thp_vma_allowable_orders() or in shmem code, where this works as expected, and disallow PMD mappings, making us fallback to PTE mappings. Reported-by: Leo Fu <bfu@redhat.com> Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Cc: Thomas Huth <thuth@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- mm/memory.c | 9 +++++++++ 1 file changed, 9 insertions(+)