Message ID | 20240204093526.212636-1-zhangpeng362@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | filemap: avoid unnecessary major faults in filemap_fault() | expand |
Peng Zhang <zhangpeng362@huawei.com> writes: > From: ZhangPeng <zhangpeng362@huawei.com> > > The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) > in application, which leading to an unexpected performance issue[1]. > > This caused by temporarily cleared PTE during a read/modify/write update > of the PTE, eg, do_numa_page()/change_pte_range(). > > For the data segment of the user-mode program, the global variable area > is a private mapping. After the pagecache is loaded, the private anonymous > page is generated after the COW is triggered. Mlockall can lock COW pages > (anonymous pages), but the original file pages cannot be locked and may > be reclaimed. If the global variable (private anon page) is accessed when > vmf->pte is zeroed in numa fault, a file page fault will be triggered. > > At this time, the original private file page may have been reclaimed. > If the page cache is not available at this time, a major fault will be > triggered and the file will be read, causing additional overhead. > > Fix this by rechecking the PTE without acquiring PTL in filemap_fault() > before triggering a major fault. > > Testing file anonymous page read and write page fault performance in ext4 > and ramdisk using will-it-scale[2] on a x86 physical machine. The data > is the average change compared with the mainline after the patch is > applied. The test results are within the range of fluctuation, and there > is no obvious difference. The test results are as follows: > processes processes_idle threads threads_idle > ext4 file write: -1.14% -0.08% -1.87% 0.13% > ext4 file read: 0.03% -0.65% -0.51% -0.08% > ramdisk file write: -1.21% -0.21% -1.12% 0.11% > ramdisk file read: 0.00% -0.68% -0.33% -0.02% IIUC, this is the regression test results. Right? Can you also show improvement test results to justify the change? -- Best Regards, Huang, Ying > [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ > [2] https://github.com/antonblanchard/will-it-scale/ > > Suggested-by: "Huang, Ying" <ying.huang@intel.com> > Suggested-by: Yin Fengwei <fengwei.yin@intel.com> > Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > RFC->v1: > - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox > - Check the PTE without acquiring PTL in filemap_fault(), suggested by > Huang, Ying and Yin Fengwei > - Add pmd_none() check before PTE map > - Update commit message and add performance test information > > mm/filemap.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 142864338ca4..b29cdeb6a03b 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) > mapping_locked = true; > } > } else { > + if (!pmd_none(*vmf->pmd)) { > + pte_t *ptep; > + > + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, > + vmf->address, &vmf->ptl); > + if (unlikely(!ptep)) > + return VM_FAULT_NOPAGE; > + /* > + * Recheck pte as the pte can be cleared temporarily > + * during a read/modify/write update. > + */ > + if (unlikely(!pte_none(ptep_get_lockless(ptep)))) > + ret = VM_FAULT_NOPAGE; > + pte_unmap(ptep); > + if (unlikely(ret)) > + return ret; > + } > + > /* No page in the page cache at all */ > count_vm_event(PGMAJFAULT); > count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
Peng Zhang <zhangpeng362@huawei.com> writes: > From: ZhangPeng <zhangpeng362@huawei.com> > > The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) > in application, which leading to an unexpected performance issue[1]. > > This caused by temporarily cleared PTE during a read/modify/write update > of the PTE, eg, do_numa_page()/change_pte_range(). > > For the data segment of the user-mode program, the global variable area > is a private mapping. After the pagecache is loaded, the private anonymous > page is generated after the COW is triggered. Mlockall can lock COW pages > (anonymous pages), but the original file pages cannot be locked and may > be reclaimed. If the global variable (private anon page) is accessed when > vmf->pte is zeroed in numa fault, a file page fault will be triggered. > > At this time, the original private file page may have been reclaimed. > If the page cache is not available at this time, a major fault will be > triggered and the file will be read, causing additional overhead. > > Fix this by rechecking the PTE without acquiring PTL in filemap_fault() > before triggering a major fault. > > Testing file anonymous page read and write page fault performance in ext4 > and ramdisk using will-it-scale[2] on a x86 physical machine. The data > is the average change compared with the mainline after the patch is > applied. The test results are within the range of fluctuation, and there > is no obvious difference. The test results are as follows: > processes processes_idle threads threads_idle > ext4 file write: -1.14% -0.08% -1.87% 0.13% > ext4 file read: 0.03% -0.65% -0.51% -0.08% > ramdisk file write: -1.21% -0.21% -1.12% 0.11% > ramdisk file read: 0.00% -0.68% -0.33% -0.02% > > [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ > [2] https://github.com/antonblanchard/will-it-scale/ > > Suggested-by: "Huang, Ying" <ying.huang@intel.com> > Suggested-by: Yin Fengwei <fengwei.yin@intel.com> > Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > RFC->v1: > - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox > - Check the PTE without acquiring PTL in filemap_fault(), suggested by > Huang, Ying and Yin Fengwei > - Add pmd_none() check before PTE map > - Update commit message and add performance test information > > mm/filemap.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 142864338ca4..b29cdeb6a03b 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) > mapping_locked = true; > } > } else { > + if (!pmd_none(*vmf->pmd)) { > + pte_t *ptep; > + > + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, > + vmf->address, &vmf->ptl); > + if (unlikely(!ptep)) > + return VM_FAULT_NOPAGE; > + /* > + * Recheck pte as the pte can be cleared temporarily > + * during a read/modify/write update. > + */ I think that we should add some comments here about the racy checking. -- Best Regards, Huang, Ying > + if (unlikely(!pte_none(ptep_get_lockless(ptep)))) > + ret = VM_FAULT_NOPAGE; > + pte_unmap(ptep); > + if (unlikely(ret)) > + return ret; > + } > + > /* No page in the page cache at all */ > count_vm_event(PGMAJFAULT); > count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
On 2024/2/5 10:54, Huang, Ying wrote: > Peng Zhang <zhangpeng362@huawei.com> writes: >> From: ZhangPeng <zhangpeng362@huawei.com> >> >> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >> in application, which leading to an unexpected performance issue[1]. >> >> This caused by temporarily cleared PTE during a read/modify/write update >> of the PTE, eg, do_numa_page()/change_pte_range(). >> >> For the data segment of the user-mode program, the global variable area >> is a private mapping. After the pagecache is loaded, the private anonymous >> page is generated after the COW is triggered. Mlockall can lock COW pages >> (anonymous pages), but the original file pages cannot be locked and may >> be reclaimed. If the global variable (private anon page) is accessed when >> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >> >> At this time, the original private file page may have been reclaimed. >> If the page cache is not available at this time, a major fault will be >> triggered and the file will be read, causing additional overhead. >> >> Fix this by rechecking the PTE without acquiring PTL in filemap_fault() >> before triggering a major fault. >> >> Testing file anonymous page read and write page fault performance in ext4 >> and ramdisk using will-it-scale[2] on a x86 physical machine. The data >> is the average change compared with the mainline after the patch is >> applied. The test results are within the range of fluctuation, and there >> is no obvious difference. The test results are as follows: >> processes processes_idle threads threads_idle >> ext4 file write: -1.14% -0.08% -1.87% 0.13% >> ext4 file read: 0.03% -0.65% -0.51% -0.08% >> ramdisk file write: -1.21% -0.21% -1.12% 0.11% >> ramdisk file read: 0.00% -0.68% -0.33% -0.02% > IIUC, this is the regression test results. Right? Can you also show > improvement test results to justify the change? Sure, I'll add the improvement test results as follows: processes processes_idle threads threads_idle ext4 private file write: -1.14% -0.08% -1.87% 0.13% ext4 shared file write: 0.14% -0.53% 2.88% -0.77% ext4 private file read: 0.03% -0.65% -0.51% -0.08% tmpfs private file write: -0.34% -0.11% 0.20% 0.15% tmpfs shared file write: 0.96% 0.10% 2.78% -0.34% ramdisk private file write: -1.21% -0.21% -1.12% 0.11% ramdisk private file read: 0.00% -0.68% -0.33% -0.02% > > -- > Best Regards, > Huang, Ying > >> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >> [2] https://github.com/antonblanchard/will-it-scale/ >> >> Suggested-by: "Huang, Ying" <ying.huang@intel.com> >> Suggested-by: Yin Fengwei <fengwei.yin@intel.com> >> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >> --- >> RFC->v1: >> - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox >> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >> Huang, Ying and Yin Fengwei >> - Add pmd_none() check before PTE map >> - Update commit message and add performance test information >> >> mm/filemap.c | 18 ++++++++++++++++++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 142864338ca4..b29cdeb6a03b 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >> mapping_locked = true; >> } >> } else { >> + if (!pmd_none(*vmf->pmd)) { >> + pte_t *ptep; >> + >> + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >> + vmf->address, &vmf->ptl); >> + if (unlikely(!ptep)) >> + return VM_FAULT_NOPAGE; >> + /* >> + * Recheck pte as the pte can be cleared temporarily >> + * during a read/modify/write update. >> + */ >> + if (unlikely(!pte_none(ptep_get_lockless(ptep)))) >> + ret = VM_FAULT_NOPAGE; >> + pte_unmap(ptep); >> + if (unlikely(ret)) >> + return ret; >> + } >> + >> /* No page in the page cache at all */ >> count_vm_event(PGMAJFAULT); >> count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
On 2024/2/5 10:56, Huang, Ying wrote: > Peng Zhang <zhangpeng362@huawei.com> writes: >> From: ZhangPeng <zhangpeng362@huawei.com> >> >> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >> in application, which leading to an unexpected performance issue[1]. >> >> This caused by temporarily cleared PTE during a read/modify/write update >> of the PTE, eg, do_numa_page()/change_pte_range(). >> >> For the data segment of the user-mode program, the global variable area >> is a private mapping. After the pagecache is loaded, the private anonymous >> page is generated after the COW is triggered. Mlockall can lock COW pages >> (anonymous pages), but the original file pages cannot be locked and may >> be reclaimed. If the global variable (private anon page) is accessed when >> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >> >> At this time, the original private file page may have been reclaimed. >> If the page cache is not available at this time, a major fault will be >> triggered and the file will be read, causing additional overhead. >> >> Fix this by rechecking the PTE without acquiring PTL in filemap_fault() >> before triggering a major fault. >> >> Testing file anonymous page read and write page fault performance in ext4 >> and ramdisk using will-it-scale[2] on a x86 physical machine. The data >> is the average change compared with the mainline after the patch is >> applied. The test results are within the range of fluctuation, and there >> is no obvious difference. The test results are as follows: >> processes processes_idle threads threads_idle >> ext4 file write: -1.14% -0.08% -1.87% 0.13% >> ext4 file read: 0.03% -0.65% -0.51% -0.08% >> ramdisk file write: -1.21% -0.21% -1.12% 0.11% >> ramdisk file read: 0.00% -0.68% -0.33% -0.02% >> >> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >> [2] https://github.com/antonblanchard/will-it-scale/ >> >> Suggested-by: "Huang, Ying" <ying.huang@intel.com> >> Suggested-by: Yin Fengwei <fengwei.yin@intel.com> >> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >> --- >> RFC->v1: >> - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox >> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >> Huang, Ying and Yin Fengwei >> - Add pmd_none() check before PTE map >> - Update commit message and add performance test information >> >> mm/filemap.c | 18 ++++++++++++++++++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 142864338ca4..b29cdeb6a03b 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >> mapping_locked = true; >> } >> } else { >> + if (!pmd_none(*vmf->pmd)) { >> + pte_t *ptep; >> + >> + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >> + vmf->address, &vmf->ptl); >> + if (unlikely(!ptep)) >> + return VM_FAULT_NOPAGE; >> + /* >> + * Recheck pte as the pte can be cleared temporarily >> + * during a read/modify/write update. >> + */ > I think that we should add some comments here about the racy checking. I'll add comments in a v2 as follows: /* * Recheck PTE as the PTE can be cleared temporarily * during a read/modify/write update of the PTE, eg, * do_numa_page()/change_pte_range(). This will trigger * a major fault, even if we use mlockall, which may * affect performance. */ > > -- > Best Regards, > Huang, Ying > >> + if (unlikely(!pte_none(ptep_get_lockless(ptep)))) >> + ret = VM_FAULT_NOPAGE; >> + pte_unmap(ptep); >> + if (unlikely(ret)) >> + return ret; >> + } >> + >> /* No page in the page cache at all */ >> count_vm_event(PGMAJFAULT); >> count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
"zhangpeng (AS)" <zhangpeng362@huawei.com> writes: > On 2024/2/5 10:56, Huang, Ying wrote: > >> Peng Zhang <zhangpeng362@huawei.com> writes: >>> From: ZhangPeng <zhangpeng362@huawei.com> >>> >>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>> in application, which leading to an unexpected performance issue[1]. >>> >>> This caused by temporarily cleared PTE during a read/modify/write update >>> of the PTE, eg, do_numa_page()/change_pte_range(). >>> >>> For the data segment of the user-mode program, the global variable area >>> is a private mapping. After the pagecache is loaded, the private anonymous >>> page is generated after the COW is triggered. Mlockall can lock COW pages >>> (anonymous pages), but the original file pages cannot be locked and may >>> be reclaimed. If the global variable (private anon page) is accessed when >>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>> >>> At this time, the original private file page may have been reclaimed. >>> If the page cache is not available at this time, a major fault will be >>> triggered and the file will be read, causing additional overhead. >>> >>> Fix this by rechecking the PTE without acquiring PTL in filemap_fault() >>> before triggering a major fault. >>> >>> Testing file anonymous page read and write page fault performance in ext4 >>> and ramdisk using will-it-scale[2] on a x86 physical machine. The data >>> is the average change compared with the mainline after the patch is >>> applied. The test results are within the range of fluctuation, and there >>> is no obvious difference. The test results are as follows: >>> processes processes_idle threads threads_idle >>> ext4 file write: -1.14% -0.08% -1.87% 0.13% >>> ext4 file read: 0.03% -0.65% -0.51% -0.08% >>> ramdisk file write: -1.21% -0.21% -1.12% 0.11% >>> ramdisk file read: 0.00% -0.68% -0.33% -0.02% >>> >>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >>> [2] https://github.com/antonblanchard/will-it-scale/ >>> >>> Suggested-by: "Huang, Ying" <ying.huang@intel.com> >>> Suggested-by: Yin Fengwei <fengwei.yin@intel.com> >>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>> --- >>> RFC->v1: >>> - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox >>> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >>> Huang, Ying and Yin Fengwei >>> - Add pmd_none() check before PTE map >>> - Update commit message and add performance test information >>> >>> mm/filemap.c | 18 ++++++++++++++++++ >>> 1 file changed, 18 insertions(+) >>> >>> diff --git a/mm/filemap.c b/mm/filemap.c >>> index 142864338ca4..b29cdeb6a03b 100644 >>> --- a/mm/filemap.c >>> +++ b/mm/filemap.c >>> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>> mapping_locked = true; >>> } >>> } else { >>> + if (!pmd_none(*vmf->pmd)) { >>> + pte_t *ptep; >>> + >>> + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >>> + vmf->address, &vmf->ptl); >>> + if (unlikely(!ptep)) >>> + return VM_FAULT_NOPAGE; >>> + /* >>> + * Recheck pte as the pte can be cleared temporarily >>> + * during a read/modify/write update. >>> + */ >> I think that we should add some comments here about the racy checking. > > I'll add comments in a v2 as follows: > /* > * Recheck PTE as the PTE can be cleared temporarily > * during a read/modify/write update of the PTE, eg, > * do_numa_page()/change_pte_range(). This will trigger > * a major fault, even if we use mlockall, which may > * affect performance. > */ Sorry, my previous words aren't clear enough. I mean some comments as follows, We don't hold PTL here, so the check is still racy. But acquiring PTL hurts performance and the race window seems small enough. -- Best Regards, Huang, Ying >>> + if (unlikely(!pte_none(ptep_get_lockless(ptep)))) >>> + ret = VM_FAULT_NOPAGE; >>> + pte_unmap(ptep); >>> + if (unlikely(ret)) >>> + return ret; >>> + } >>> + >>> /* No page in the page cache at all */ >>> count_vm_event(PGMAJFAULT); >>> count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
On 2024/2/5 14:52, Huang, Ying wrote: > "zhangpeng (AS)" <zhangpeng362@huawei.com> writes: >> On 2024/2/5 10:56, Huang, Ying wrote: >>> Peng Zhang <zhangpeng362@huawei.com> writes: >>>> From: ZhangPeng <zhangpeng362@huawei.com> >>>> >>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>>> in application, which leading to an unexpected performance issue[1]. >>>> >>>> This caused by temporarily cleared PTE during a read/modify/write update >>>> of the PTE, eg, do_numa_page()/change_pte_range(). >>>> >>>> For the data segment of the user-mode program, the global variable area >>>> is a private mapping. After the pagecache is loaded, the private anonymous >>>> page is generated after the COW is triggered. Mlockall can lock COW pages >>>> (anonymous pages), but the original file pages cannot be locked and may >>>> be reclaimed. If the global variable (private anon page) is accessed when >>>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>>> >>>> At this time, the original private file page may have been reclaimed. >>>> If the page cache is not available at this time, a major fault will be >>>> triggered and the file will be read, causing additional overhead. >>>> >>>> Fix this by rechecking the PTE without acquiring PTL in filemap_fault() >>>> before triggering a major fault. >>>> >>>> Testing file anonymous page read and write page fault performance in ext4 >>>> and ramdisk using will-it-scale[2] on a x86 physical machine. The data >>>> is the average change compared with the mainline after the patch is >>>> applied. The test results are within the range of fluctuation, and there >>>> is no obvious difference. The test results are as follows: >>>> processes processes_idle threads threads_idle >>>> ext4 file write: -1.14% -0.08% -1.87% 0.13% >>>> ext4 file read: 0.03% -0.65% -0.51% -0.08% >>>> ramdisk file write: -1.21% -0.21% -1.12% 0.11% >>>> ramdisk file read: 0.00% -0.68% -0.33% -0.02% >>>> >>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >>>> [2] https://github.com/antonblanchard/will-it-scale/ >>>> >>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com> >>>> Suggested-by: Yin Fengwei <fengwei.yin@intel.com> >>>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>> --- >>>> RFC->v1: >>>> - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox >>>> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >>>> Huang, Ying and Yin Fengwei >>>> - Add pmd_none() check before PTE map >>>> - Update commit message and add performance test information >>>> >>>> mm/filemap.c | 18 ++++++++++++++++++ >>>> 1 file changed, 18 insertions(+) >>>> >>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>> index 142864338ca4..b29cdeb6a03b 100644 >>>> --- a/mm/filemap.c >>>> +++ b/mm/filemap.c >>>> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>> mapping_locked = true; >>>> } >>>> } else { >>>> + if (!pmd_none(*vmf->pmd)) { >>>> + pte_t *ptep; >>>> + >>>> + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >>>> + vmf->address, &vmf->ptl); >>>> + if (unlikely(!ptep)) >>>> + return VM_FAULT_NOPAGE; >>>> + /* >>>> + * Recheck pte as the pte can be cleared temporarily >>>> + * during a read/modify/write update. >>>> + */ >>> I think that we should add some comments here about the racy checking. >> I'll add comments in a v2 as follows: >> /* >> * Recheck PTE as the PTE can be cleared temporarily >> * during a read/modify/write update of the PTE, eg, >> * do_numa_page()/change_pte_range(). This will trigger >> * a major fault, even if we use mlockall, which may >> * affect performance. >> */ > Sorry, my previous words aren't clear enough. I mean some comments as > follows, > > We don't hold PTL here, so the check is still racy. But acquiring PTL > hurts performance and the race window seems small enough. Got it. I'll add comments in a v2 as follows: /* * Recheck PTE as the PTE can be cleared temporarily * during a read/modify/write update of the PTE. * We don't hold PTL here as acquiring PTL hurts * performance. So the check is still racy, but * the race window seems small enough. */ > > -- > Best Regards, > Huang, Ying > >>>> + if (unlikely(!pte_none(ptep_get_lockless(ptep)))) >>>> + ret = VM_FAULT_NOPAGE; >>>> + pte_unmap(ptep); >>>> + if (unlikely(ret)) >>>> + return ret; >>>> + } >>>> + >>>> /* No page in the page cache at all */ >>>> count_vm_event(PGMAJFAULT); >>>> count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
On 05.02.24 08:24, zhangpeng (AS) wrote: > On 2024/2/5 14:52, Huang, Ying wrote: > >> "zhangpeng (AS)" <zhangpeng362@huawei.com> writes: >>> On 2024/2/5 10:56, Huang, Ying wrote: >>>> Peng Zhang <zhangpeng362@huawei.com> writes: >>>>> From: ZhangPeng <zhangpeng362@huawei.com> >>>>> >>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>>>> in application, which leading to an unexpected performance issue[1]. >>>>> >>>>> This caused by temporarily cleared PTE during a read/modify/write update >>>>> of the PTE, eg, do_numa_page()/change_pte_range(). >>>>> >>>>> For the data segment of the user-mode program, the global variable area >>>>> is a private mapping. After the pagecache is loaded, the private anonymous >>>>> page is generated after the COW is triggered. Mlockall can lock COW pages >>>>> (anonymous pages), but the original file pages cannot be locked and may >>>>> be reclaimed. If the global variable (private anon page) is accessed when >>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>>>> >>>>> At this time, the original private file page may have been reclaimed. >>>>> If the page cache is not available at this time, a major fault will be >>>>> triggered and the file will be read, causing additional overhead. >>>>> >>>>> Fix this by rechecking the PTE without acquiring PTL in filemap_fault() >>>>> before triggering a major fault. >>>>> >>>>> Testing file anonymous page read and write page fault performance in ext4 >>>>> and ramdisk using will-it-scale[2] on a x86 physical machine. The data >>>>> is the average change compared with the mainline after the patch is >>>>> applied. The test results are within the range of fluctuation, and there >>>>> is no obvious difference. The test results are as follows: >>>>> processes processes_idle threads threads_idle >>>>> ext4 file write: -1.14% -0.08% -1.87% 0.13% >>>>> ext4 file read: 0.03% -0.65% -0.51% -0.08% >>>>> ramdisk file write: -1.21% -0.21% -1.12% 0.11% >>>>> ramdisk file read: 0.00% -0.68% -0.33% -0.02% >>>>> >>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >>>>> [2] https://github.com/antonblanchard/will-it-scale/ >>>>> >>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com> >>>>> Suggested-by: Yin Fengwei <fengwei.yin@intel.com> >>>>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>> --- >>>>> RFC->v1: >>>>> - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox >>>>> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >>>>> Huang, Ying and Yin Fengwei >>>>> - Add pmd_none() check before PTE map >>>>> - Update commit message and add performance test information >>>>> >>>>> mm/filemap.c | 18 ++++++++++++++++++ >>>>> 1 file changed, 18 insertions(+) >>>>> >>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>> index 142864338ca4..b29cdeb6a03b 100644 >>>>> --- a/mm/filemap.c >>>>> +++ b/mm/filemap.c >>>>> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>> mapping_locked = true; >>>>> } >>>>> } else { >>>>> + if (!pmd_none(*vmf->pmd)) { >>>>> + pte_t *ptep; >>>>> + >>>>> + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >>>>> + vmf->address, &vmf->ptl); >>>>> + if (unlikely(!ptep)) >>>>> + return VM_FAULT_NOPAGE; >>>>> + /* >>>>> + * Recheck pte as the pte can be cleared temporarily >>>>> + * during a read/modify/write update. >>>>> + */ >>>> I think that we should add some comments here about the racy checking. >>> I'll add comments in a v2 as follows: >>> /* >>> * Recheck PTE as the PTE can be cleared temporarily >>> * during a read/modify/write update of the PTE, eg, >>> * do_numa_page()/change_pte_range(). This will trigger >>> * a major fault, even if we use mlockall, which may >>> * affect performance. >>> */ >> Sorry, my previous words aren't clear enough. I mean some comments as >> follows, >> >> We don't hold PTL here, so the check is still racy. But acquiring PTL >> hurts performance and the race window seems small enough. > > Got it. I'll add comments in a v2 as follows: > /* > * Recheck PTE as the PTE can be cleared temporarily > * during a read/modify/write update of the PTE. > * We don't hold PTL here as acquiring PTL hurts > * performance. So the check is still racy, but > * the race window seems small enough. > */ It'd be worth spelling out what happens when we lose the race.
On 2024/2/5 15:31, David Hildenbrand wrote: > On 05.02.24 08:24, zhangpeng (AS) wrote: >> On 2024/2/5 14:52, Huang, Ying wrote: >> >>> "zhangpeng (AS)" <zhangpeng362@huawei.com> writes: >>>> On 2024/2/5 10:56, Huang, Ying wrote: >>>>> Peng Zhang <zhangpeng362@huawei.com> writes: >>>>>> From: ZhangPeng <zhangpeng362@huawei.com> >>>>>> >>>>>> The major fault occurred when using mlockall(MCL_CURRENT | >>>>>> MCL_FUTURE) >>>>>> in application, which leading to an unexpected performance issue[1]. >>>>>> >>>>>> This caused by temporarily cleared PTE during a read/modify/write >>>>>> update >>>>>> of the PTE, eg, do_numa_page()/change_pte_range(). >>>>>> >>>>>> For the data segment of the user-mode program, the global >>>>>> variable area >>>>>> is a private mapping. After the pagecache is loaded, the private >>>>>> anonymous >>>>>> page is generated after the COW is triggered. Mlockall can lock >>>>>> COW pages >>>>>> (anonymous pages), but the original file pages cannot be locked >>>>>> and may >>>>>> be reclaimed. If the global variable (private anon page) is >>>>>> accessed when >>>>>> vmf->pte is zeroed in numa fault, a file page fault will be >>>>>> triggered. >>>>>> >>>>>> At this time, the original private file page may have been >>>>>> reclaimed. >>>>>> If the page cache is not available at this time, a major fault >>>>>> will be >>>>>> triggered and the file will be read, causing additional overhead. >>>>>> >>>>>> Fix this by rechecking the PTE without acquiring PTL in >>>>>> filemap_fault() >>>>>> before triggering a major fault. >>>>>> >>>>>> Testing file anonymous page read and write page fault performance >>>>>> in ext4 >>>>>> and ramdisk using will-it-scale[2] on a x86 physical machine. The >>>>>> data >>>>>> is the average change compared with the mainline after the patch is >>>>>> applied. The test results are within the range of fluctuation, >>>>>> and there >>>>>> is no obvious difference. The test results are as follows: >>>>>> processes processes_idle threads threads_idle >>>>>> ext4 file write: -1.14% -0.08% -1.87% 0.13% >>>>>> ext4 file read: 0.03% -0.65% -0.51% -0.08% >>>>>> ramdisk file write: -1.21% -0.21% -1.12% 0.11% >>>>>> ramdisk file read: 0.00% -0.68% -0.33% -0.02% >>>>>> >>>>>> [1] >>>>>> https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >>>>>> [2] https://github.com/antonblanchard/will-it-scale/ >>>>>> >>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com> >>>>>> Suggested-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>>> --- >>>>>> RFC->v1: >>>>>> - Add error handling when ptep == NULL per Huang, Ying and >>>>>> Matthew Wilcox >>>>>> - Check the PTE without acquiring PTL in filemap_fault(), >>>>>> suggested by >>>>>> Huang, Ying and Yin Fengwei >>>>>> - Add pmd_none() check before PTE map >>>>>> - Update commit message and add performance test information >>>>>> >>>>>> mm/filemap.c | 18 ++++++++++++++++++ >>>>>> 1 file changed, 18 insertions(+) >>>>>> >>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>> index 142864338ca4..b29cdeb6a03b 100644 >>>>>> --- a/mm/filemap.c >>>>>> +++ b/mm/filemap.c >>>>>> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault >>>>>> *vmf) >>>>>> mapping_locked = true; >>>>>> } >>>>>> } else { >>>>>> + if (!pmd_none(*vmf->pmd)) { >>>>>> + pte_t *ptep; >>>>>> + >>>>>> + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >>>>>> + vmf->address, &vmf->ptl); >>>>>> + if (unlikely(!ptep)) >>>>>> + return VM_FAULT_NOPAGE; >>>>>> + /* >>>>>> + * Recheck pte as the pte can be cleared temporarily >>>>>> + * during a read/modify/write update. >>>>>> + */ >>>>> I think that we should add some comments here about the racy >>>>> checking. >>>> I'll add comments in a v2 as follows: >>>> /* >>>> * Recheck PTE as the PTE can be cleared temporarily >>>> * during a read/modify/write update of the PTE, eg, >>>> * do_numa_page()/change_pte_range(). This will trigger >>>> * a major fault, even if we use mlockall, which may >>>> * affect performance. >>>> */ >>> Sorry, my previous words aren't clear enough. I mean some comments as >>> follows, >>> >>> We don't hold PTL here, so the check is still racy. But acquiring PTL >>> hurts performance and the race window seems small enough. >> >> Got it. I'll add comments in a v2 as follows: >> /* >> * Recheck PTE as the PTE can be cleared temporarily >> * during a read/modify/write update of the PTE. >> * We don't hold PTL here as acquiring PTL hurts >> * performance. So the check is still racy, but >> * the race window seems small enough. >> */ > > It'd be worth spelling out what happens when we lose the race. > I'll add what happens when we lose the race as follows: /* * Recheck PTE as the PTE can be cleared temporarily * during a read/modify/write update of the PTE, eg, * do_numa_page()/change_pte_range(). This will trigger * a major fault, even if we use mlockall, which may * affect performance. * We don't hold PTL here as acquiring PTL hurts * performance. So the check is still racy, but * the race window seems small enough. */
On 2/5/24 15:36, zhangpeng (AS) wrote: > On 2024/2/5 15:31, David Hildenbrand wrote: > >> On 05.02.24 08:24, zhangpeng (AS) wrote: >>> On 2024/2/5 14:52, Huang, Ying wrote: >>> >>>> "zhangpeng (AS)" <zhangpeng362@huawei.com> writes: >>>>> On 2024/2/5 10:56, Huang, Ying wrote: >>>>>> Peng Zhang <zhangpeng362@huawei.com> writes: >>>>>>> From: ZhangPeng <zhangpeng362@huawei.com> >>>>>>> >>>>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>>>>>> in application, which leading to an unexpected performance issue[1]. >>>>>>> >>>>>>> This caused by temporarily cleared PTE during a read/modify/write update >>>>>>> of the PTE, eg, do_numa_page()/change_pte_range(). >>>>>>> >>>>>>> For the data segment of the user-mode program, the global variable area >>>>>>> is a private mapping. After the pagecache is loaded, the private anonymous >>>>>>> page is generated after the COW is triggered. Mlockall can lock COW pages >>>>>>> (anonymous pages), but the original file pages cannot be locked and may >>>>>>> be reclaimed. If the global variable (private anon page) is accessed when >>>>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>>>>>> >>>>>>> At this time, the original private file page may have been reclaimed. >>>>>>> If the page cache is not available at this time, a major fault will be >>>>>>> triggered and the file will be read, causing additional overhead. >>>>>>> >>>>>>> Fix this by rechecking the PTE without acquiring PTL in filemap_fault() >>>>>>> before triggering a major fault. >>>>>>> >>>>>>> Testing file anonymous page read and write page fault performance in ext4 >>>>>>> and ramdisk using will-it-scale[2] on a x86 physical machine. The data >>>>>>> is the average change compared with the mainline after the patch is >>>>>>> applied. The test results are within the range of fluctuation, and there >>>>>>> is no obvious difference. The test results are as follows: >>>>>>> processes processes_idle threads threads_idle >>>>>>> ext4 file write: -1.14% -0.08% -1.87% 0.13% >>>>>>> ext4 file read: 0.03% -0.65% -0.51% -0.08% >>>>>>> ramdisk file write: -1.21% -0.21% -1.12% 0.11% >>>>>>> ramdisk file read: 0.00% -0.68% -0.33% -0.02% >>>>>>> >>>>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >>>>>>> [2] https://github.com/antonblanchard/will-it-scale/ >>>>>>> >>>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com> >>>>>>> Suggested-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>>>> --- >>>>>>> RFC->v1: >>>>>>> - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox >>>>>>> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >>>>>>> Huang, Ying and Yin Fengwei >>>>>>> - Add pmd_none() check before PTE map >>>>>>> - Update commit message and add performance test information >>>>>>> >>>>>>> mm/filemap.c | 18 ++++++++++++++++++ >>>>>>> 1 file changed, 18 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>>> index 142864338ca4..b29cdeb6a03b 100644 >>>>>>> --- a/mm/filemap.c >>>>>>> +++ b/mm/filemap.c >>>>>>> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>>>> mapping_locked = true; >>>>>>> } >>>>>>> } else { >>>>>>> + if (!pmd_none(*vmf->pmd)) { >>>>>>> + pte_t *ptep; >>>>>>> + >>>>>>> + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >>>>>>> + vmf->address, &vmf->ptl); >>>>>>> + if (unlikely(!ptep)) >>>>>>> + return VM_FAULT_NOPAGE; >>>>>>> + /* >>>>>>> + * Recheck pte as the pte can be cleared temporarily >>>>>>> + * during a read/modify/write update. >>>>>>> + */ >>>>>> I think that we should add some comments here about the racy checking. >>>>> I'll add comments in a v2 as follows: >>>>> /* >>>>> * Recheck PTE as the PTE can be cleared temporarily >>>>> * during a read/modify/write update of the PTE, eg, >>>>> * do_numa_page()/change_pte_range(). This will trigger >>>>> * a major fault, even if we use mlockall, which may >>>>> * affect performance. >>>>> */ >>>> Sorry, my previous words aren't clear enough. I mean some comments as >>>> follows, >>>> >>>> We don't hold PTL here, so the check is still racy. But acquiring PTL >>>> hurts performance and the race window seems small enough. >>> >>> Got it. I'll add comments in a v2 as follows: >>> /* >>> * Recheck PTE as the PTE can be cleared temporarily >>> * during a read/modify/write update of the PTE. >>> * We don't hold PTL here as acquiring PTL hurts >>> * performance. So the check is still racy, but >>> * the race window seems small enough. >>> */ >> >> It'd be worth spelling out what happens when we lose the race. >> > I'll add what happens when we lose the race as follows: > /* > * Recheck PTE as the PTE can be cleared temporarily > * during a read/modify/write update of the PTE, eg, > * do_numa_page()/change_pte_range(). This will trigger > * a major fault, even if we use mlockall, which may > * affect performance. > * We don't hold PTL here as acquiring PTL hurts > * performance. So the check is still racy, but > * the race window seems small enough. > */ > I believe David was asking to add: "...but the race window seems small enough. If we lose the race during the check, the page_fault will be triggered. Butthe page table entry lock still make sure the correctness: - If the page cache is not reclaimed, the page_fault will work like the page fault was served already and bail out. - If the page cache is reclaimed, the major fault will be triggered, page cache is filled, page_fault also work like the page fault was served already and bail out. "
On 2024/2/5 16:40, Yin Fengwei wrote: > > On 2/5/24 15:36, zhangpeng (AS) wrote: >> On 2024/2/5 15:31, David Hildenbrand wrote: >> >>> On 05.02.24 08:24, zhangpeng (AS) wrote: >>>> On 2024/2/5 14:52, Huang, Ying wrote: >>>> >>>>> "zhangpeng (AS)" <zhangpeng362@huawei.com> writes: >>>>>> On 2024/2/5 10:56, Huang, Ying wrote: >>>>>>> Peng Zhang <zhangpeng362@huawei.com> writes: >>>>>>>> From: ZhangPeng <zhangpeng362@huawei.com> >>>>>>>> >>>>>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>>>>>>> in application, which leading to an unexpected performance issue[1]. >>>>>>>> >>>>>>>> This caused by temporarily cleared PTE during a read/modify/write update >>>>>>>> of the PTE, eg, do_numa_page()/change_pte_range(). >>>>>>>> >>>>>>>> For the data segment of the user-mode program, the global variable area >>>>>>>> is a private mapping. After the pagecache is loaded, the private anonymous >>>>>>>> page is generated after the COW is triggered. Mlockall can lock COW pages >>>>>>>> (anonymous pages), but the original file pages cannot be locked and may >>>>>>>> be reclaimed. If the global variable (private anon page) is accessed when >>>>>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>>>>>>> >>>>>>>> At this time, the original private file page may have been reclaimed. >>>>>>>> If the page cache is not available at this time, a major fault will be >>>>>>>> triggered and the file will be read, causing additional overhead. >>>>>>>> >>>>>>>> Fix this by rechecking the PTE without acquiring PTL in filemap_fault() >>>>>>>> before triggering a major fault. >>>>>>>> >>>>>>>> Testing file anonymous page read and write page fault performance in ext4 >>>>>>>> and ramdisk using will-it-scale[2] on a x86 physical machine. The data >>>>>>>> is the average change compared with the mainline after the patch is >>>>>>>> applied. The test results are within the range of fluctuation, and there >>>>>>>> is no obvious difference. The test results are as follows: >>>>>>>> processes processes_idle threads threads_idle >>>>>>>> ext4 file write: -1.14% -0.08% -1.87% 0.13% >>>>>>>> ext4 file read: 0.03% -0.65% -0.51% -0.08% >>>>>>>> ramdisk file write: -1.21% -0.21% -1.12% 0.11% >>>>>>>> ramdisk file read: 0.00% -0.68% -0.33% -0.02% >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >>>>>>>> [2] https://github.com/antonblanchard/will-it-scale/ >>>>>>>> >>>>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com> >>>>>>>> Suggested-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>>>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> >>>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>>>>> --- >>>>>>>> RFC->v1: >>>>>>>> - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox >>>>>>>> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >>>>>>>> Huang, Ying and Yin Fengwei >>>>>>>> - Add pmd_none() check before PTE map >>>>>>>> - Update commit message and add performance test information >>>>>>>> >>>>>>>> mm/filemap.c | 18 ++++++++++++++++++ >>>>>>>> 1 file changed, 18 insertions(+) >>>>>>>> >>>>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>>>> index 142864338ca4..b29cdeb6a03b 100644 >>>>>>>> --- a/mm/filemap.c >>>>>>>> +++ b/mm/filemap.c >>>>>>>> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>>>>> mapping_locked = true; >>>>>>>> } >>>>>>>> } else { >>>>>>>> + if (!pmd_none(*vmf->pmd)) { >>>>>>>> + pte_t *ptep; >>>>>>>> + >>>>>>>> + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >>>>>>>> + vmf->address, &vmf->ptl); >>>>>>>> + if (unlikely(!ptep)) >>>>>>>> + return VM_FAULT_NOPAGE; >>>>>>>> + /* >>>>>>>> + * Recheck pte as the pte can be cleared temporarily >>>>>>>> + * during a read/modify/write update. >>>>>>>> + */ >>>>>>> I think that we should add some comments here about the racy checking. >>>>>> I'll add comments in a v2 as follows: >>>>>> /* >>>>>> * Recheck PTE as the PTE can be cleared temporarily >>>>>> * during a read/modify/write update of the PTE, eg, >>>>>> * do_numa_page()/change_pte_range(). This will trigger >>>>>> * a major fault, even if we use mlockall, which may >>>>>> * affect performance. >>>>>> */ >>>>> Sorry, my previous words aren't clear enough. I mean some comments as >>>>> follows, >>>>> >>>>> We don't hold PTL here, so the check is still racy. But acquiring PTL >>>>> hurts performance and the race window seems small enough. >>>> Got it. I'll add comments in a v2 as follows: >>>> /* >>>> * Recheck PTE as the PTE can be cleared temporarily >>>> * during a read/modify/write update of the PTE. >>>> * We don't hold PTL here as acquiring PTL hurts >>>> * performance. So the check is still racy, but >>>> * the race window seems small enough. >>>> */ >>> It'd be worth spelling out what happens when we lose the race. >>> >> I'll add what happens when we lose the race as follows: >> /* >> * Recheck PTE as the PTE can be cleared temporarily >> * during a read/modify/write update of the PTE, eg, >> * do_numa_page()/change_pte_range(). This will trigger >> * a major fault, even if we use mlockall, which may >> * affect performance. >> * We don't hold PTL here as acquiring PTL hurts >> * performance. So the check is still racy, but >> * the race window seems small enough. >> */ >> > I believe David was asking to add: > > "...but the race window seems small enough. > > If we lose the race during the check, the page_fault will > be triggered. Butthe page table entry lock still make sure > the correctness: > - If the page cache is not reclaimed, the page_fault will > work like the page fault was served already and bail out. > - If the page cache is reclaimed, the major fault will be > triggered, page cache is filled, page_fault also work > like the page fault was served already and bail out. > " Got it. Thanks!
diff --git a/mm/filemap.c b/mm/filemap.c index 142864338ca4..b29cdeb6a03b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) mapping_locked = true; } } else { + if (!pmd_none(*vmf->pmd)) { + pte_t *ptep; + + ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + if (unlikely(!ptep)) + return VM_FAULT_NOPAGE; + /* + * Recheck pte as the pte can be cleared temporarily + * during a read/modify/write update. + */ + if (unlikely(!pte_none(ptep_get_lockless(ptep)))) + ret = VM_FAULT_NOPAGE; + pte_unmap(ptep); + if (unlikely(ret)) + return ret; + } + /* No page in the page cache at all */ count_vm_event(PGMAJFAULT); count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);