Message ID | 7ad05bc9299de5d954fb21a2da57f46dd6ec59d0.1742960003.git.baolin.wang@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fix mincore() tmpfs test failure | expand |
On Wed, Mar 26, 2025 at 11:38:11AM +0800, Baolin Wang wrote: > @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > walk->action = ACTION_AGAIN; > return 0; > } > - for (; addr != end; ptep++, addr += PAGE_SIZE) { > + for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { > pte_t pte = ptep_get(ptep); > > + step = 1; > /* We need to do cache lookup too for pte markers */ > if (pte_none_mostly(pte)) > __mincore_unmapped_range(addr, addr + PAGE_SIZE, > vma, vec); > - else if (pte_present(pte)) > - *vec = 1; > - else { /* pte is a swap entry */ > + else if (pte_present(pte)) { > + if (pte_batch_hint(ptep, pte) > 1) { AFAIU, you will only batch if the CONT_PTE is set, but that is only true for arm64, and so we lose the ability to batch in e.g: x86 when we have contiguous entries, right? So why not have folio_pte_batch take care of it directly without involving pte_batch_hint here?
On 2025/3/27 18:49, Oscar Salvador wrote: > On Wed, Mar 26, 2025 at 11:38:11AM +0800, Baolin Wang wrote: >> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> walk->action = ACTION_AGAIN; >> return 0; >> } >> - for (; addr != end; ptep++, addr += PAGE_SIZE) { >> + for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { >> pte_t pte = ptep_get(ptep); >> >> + step = 1; >> /* We need to do cache lookup too for pte markers */ >> if (pte_none_mostly(pte)) >> __mincore_unmapped_range(addr, addr + PAGE_SIZE, >> vma, vec); >> - else if (pte_present(pte)) >> - *vec = 1; >> - else { /* pte is a swap entry */ >> + else if (pte_present(pte)) { >> + if (pte_batch_hint(ptep, pte) > 1) { > > AFAIU, you will only batch if the CONT_PTE is set, but that is only true for arm64, > and so we lose the ability to batch in e.g: x86 when we have contiguous > entries, right? > > So why not have folio_pte_batch take care of it directly without involving > pte_batch_hint here? Good question, this was the first approach I tried. However, I found there was a obvious performance regression with small folios (where CONT_PTE is not set). I think the overhead introduced by vm_normal_folio() and folio_pte_batch() is greater than the optimization gained from batch processing small folios. For large folios where CONT_PTE is set, ptep_get()--->contpte_ptep_get() wastes a significant amount of CPU time, so using folio_pte_batch() can improve the performance obviously.
On 25/03/2025 23:38, Baolin Wang wrote: > When I tested the mincore() syscall, I observed that it takes longer with > 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range() > still checks each PTE individually, even when the PTEs are contiguous, > which is not efficient. > > Thus we can use folio_pte_batch() to get the batch number of the present > contiguous PTEs, which can improve the performance. I tested the mincore() > syscall with 1G anonymous memory populated with 64K mTHP, and observed an > obvious performance improvement: > > w/o patch w/ patch changes > 6022us 1115us +81% > > Moreover, I also tested mincore() with disabling mTHP/THP, and did not > see any obvious regression. > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- > mm/mincore.c | 27 ++++++++++++++++++++++----- > 1 file changed, 22 insertions(+), 5 deletions(-) > > diff --git a/mm/mincore.c b/mm/mincore.c > index 832f29f46767..88be180b5550 100644 > --- a/mm/mincore.c > +++ b/mm/mincore.c > @@ -21,6 +21,7 @@ > > #include <linux/uaccess.h> > #include "swap.h" > +#include "internal.h" > > static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, > unsigned long end, struct mm_walk *walk) > @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > pte_t *ptep; > unsigned char *vec = walk->private; > int nr = (end - addr) >> PAGE_SHIFT; > + int step, i; > > ptl = pmd_trans_huge_lock(pmd, vma); > if (ptl) { > @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > walk->action = ACTION_AGAIN; > return 0; > } > - for (; addr != end; ptep++, addr += PAGE_SIZE) { > + for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { > pte_t pte = ptep_get(ptep); > > + step = 1; > /* We need to do cache lookup too for pte markers */ > if (pte_none_mostly(pte)) > __mincore_unmapped_range(addr, addr + PAGE_SIZE, > vma, vec); > - else if (pte_present(pte)) > - *vec = 1; > - else { /* pte is a swap entry */ > + else if (pte_present(pte)) { > + if (pte_batch_hint(ptep, pte) > 1) { > + struct folio *folio = vm_normal_folio(vma, addr, pte); > + > + if (folio && folio_test_large(folio)) { > + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | > + FPB_IGNORE_SOFT_DIRTY; > + int max_nr = (end - addr) / PAGE_SIZE; > + > + step = folio_pte_batch(folio, addr, ptep, pte, > + max_nr, fpb_flags, NULL, NULL, NULL); > + } > + } You could simplify to the following, I think, to avoid needing to grab the folio and call folio_pte_batch(): else if (pte_present(pte)) { int max_nr = (end - addr) / PAGE_SIZE; step = min(pte_batch_hint(ptep, pte), max_nr); } ... I expect the regression you are seeing here is all due to calling ptep_get() for every pte in the contpte batch, which will cause 16 memory reads per pte (to gather the access/dirty bits). For small folios its just 1 read per pte. pte_batch_hint() will skip forward in blocks of 16 so you now end up with the same number as for the small folio case. You don't need all the fancy extras that folio_pte_batch() gives you here. Thanks, Ryan > + > + for (i = 0; i < step; i++) > + vec[i] = 1; > + } else { /* pte is a swap entry */ > swp_entry_t entry = pte_to_swp_entry(pte); > > if (non_swap_entry(entry)) { > @@ -146,7 +163,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > #endif > } > } > - vec++; > + vec += step; > } > pte_unmap_unlock(ptep - 1, ptl); > out:
On Thu, Mar 27, 2025 at 10:08:56AM -0400, Ryan Roberts wrote: > You could simplify to the following, I think, to avoid needing to grab the folio > and call folio_pte_batch(): > > else if (pte_present(pte)) { > int max_nr = (end - addr) / PAGE_SIZE; > step = min(pte_batch_hint(ptep, pte), max_nr); > } ... Yes, I think this makes much more sense, in the end, as you said, we do not really need what folio_pte_batch gives us here. With the API I am working on, batching will be done in there internally, so we will not have to expose this here.
On 2025/3/27 22:08, Ryan Roberts wrote: > On 25/03/2025 23:38, Baolin Wang wrote: >> When I tested the mincore() syscall, I observed that it takes longer with >> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range() >> still checks each PTE individually, even when the PTEs are contiguous, >> which is not efficient. >> >> Thus we can use folio_pte_batch() to get the batch number of the present >> contiguous PTEs, which can improve the performance. I tested the mincore() >> syscall with 1G anonymous memory populated with 64K mTHP, and observed an >> obvious performance improvement: >> >> w/o patch w/ patch changes >> 6022us 1115us +81% >> >> Moreover, I also tested mincore() with disabling mTHP/THP, and did not >> see any obvious regression. >> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> --- >> mm/mincore.c | 27 ++++++++++++++++++++++----- >> 1 file changed, 22 insertions(+), 5 deletions(-) >> >> diff --git a/mm/mincore.c b/mm/mincore.c >> index 832f29f46767..88be180b5550 100644 >> --- a/mm/mincore.c >> +++ b/mm/mincore.c >> @@ -21,6 +21,7 @@ >> >> #include <linux/uaccess.h> >> #include "swap.h" >> +#include "internal.h" >> >> static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, >> unsigned long end, struct mm_walk *walk) >> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> pte_t *ptep; >> unsigned char *vec = walk->private; >> int nr = (end - addr) >> PAGE_SHIFT; >> + int step, i; >> >> ptl = pmd_trans_huge_lock(pmd, vma); >> if (ptl) { >> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> walk->action = ACTION_AGAIN; >> return 0; >> } >> - for (; addr != end; ptep++, addr += PAGE_SIZE) { >> + for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { >> pte_t pte = ptep_get(ptep); >> >> + step = 1; >> /* We need to do cache lookup too for pte markers */ >> if (pte_none_mostly(pte)) >> __mincore_unmapped_range(addr, addr + PAGE_SIZE, >> vma, vec); >> - else if (pte_present(pte)) >> - *vec = 1; >> - else { /* pte is a swap entry */ >> + else if (pte_present(pte)) { >> + if (pte_batch_hint(ptep, pte) > 1) { >> + struct folio *folio = vm_normal_folio(vma, addr, pte); >> + >> + if (folio && folio_test_large(folio)) { >> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | >> + FPB_IGNORE_SOFT_DIRTY; >> + int max_nr = (end - addr) / PAGE_SIZE; >> + >> + step = folio_pte_batch(folio, addr, ptep, pte, >> + max_nr, fpb_flags, NULL, NULL, NULL); >> + } >> + } > > You could simplify to the following, I think, to avoid needing to grab the folio > and call folio_pte_batch(): > > else if (pte_present(pte)) { > int max_nr = (end - addr) / PAGE_SIZE; > step = min(pte_batch_hint(ptep, pte), max_nr); > } ... > > I expect the regression you are seeing here is all due to calling ptep_get() for > every pte in the contpte batch, which will cause 16 memory reads per pte (to > gather the access/dirty bits). For small folios its just 1 read per pte. Right. > pte_batch_hint() will skip forward in blocks of 16 so you now end up with the > same number as for the small folio case. You don't need all the fancy extras > that folio_pte_batch() gives you here. Sounds reasonable. Your suggestion looks simple, but my method can batch the whole large folio (such as large folios containing more than 16 contiguous PTEs) at once. Anyway, let me do some performance measurements for your suggestion. Thanks.
diff --git a/mm/mincore.c b/mm/mincore.c index 832f29f46767..88be180b5550 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -21,6 +21,7 @@ #include <linux/uaccess.h> #include "swap.h" +#include "internal.h" static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t *ptep; unsigned char *vec = walk->private; int nr = (end - addr) >> PAGE_SHIFT; + int step, i; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, walk->action = ACTION_AGAIN; return 0; } - for (; addr != end; ptep++, addr += PAGE_SIZE) { + for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { pte_t pte = ptep_get(ptep); + step = 1; /* We need to do cache lookup too for pte markers */ if (pte_none_mostly(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, vma, vec); - else if (pte_present(pte)) - *vec = 1; - else { /* pte is a swap entry */ + else if (pte_present(pte)) { + if (pte_batch_hint(ptep, pte) > 1) { + struct folio *folio = vm_normal_folio(vma, addr, pte); + + if (folio && folio_test_large(folio)) { + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | + FPB_IGNORE_SOFT_DIRTY; + int max_nr = (end - addr) / PAGE_SIZE; + + step = folio_pte_batch(folio, addr, ptep, pte, + max_nr, fpb_flags, NULL, NULL, NULL); + } + } + + for (i = 0; i < step; i++) + vec[i] = 1; + } else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); if (non_swap_entry(entry)) { @@ -146,7 +163,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, #endif } } - vec++; + vec += step; } pte_unmap_unlock(ptep - 1, ptl); out:
When I tested the mincore() syscall, I observed that it takes longer with 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range() still checks each PTE individually, even when the PTEs are contiguous, which is not efficient. Thus we can use folio_pte_batch() to get the batch number of the present contiguous PTEs, which can improve the performance. I tested the mincore() syscall with 1G anonymous memory populated with 64K mTHP, and observed an obvious performance improvement: w/o patch w/ patch changes 6022us 1115us +81% Moreover, I also tested mincore() with disabling mTHP/THP, and did not see any obvious regression. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> --- mm/mincore.c | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-)