Message ID | 1526555193-7242-11-git-send-email-ldufour@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Laurent, On Thu, May 17, 2018 at 4:37 PM Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote: > > The VMA sequence count has been introduced to allow fast detection of > VMA modification when running a page fault handler without holding > the mmap_sem. > > This patch provides protection against the VMA modification done in : > - madvise() > - mpol_rebind_policy() > - vma_replace_policy() > - change_prot_numa() > - mlock(), munlock() > - mprotect() > - mmap_region() > - collapse_huge_page() > - userfaultd registering services > > In addition, VMA fields which will be read during the speculative fault > path needs to be written using WRITE_ONCE to prevent write to be split > and intermediate values to be pushed to other CPUs. > > Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> > --- > fs/proc/task_mmu.c | 5 ++++- > fs/userfaultfd.c | 17 +++++++++++++---- > mm/khugepaged.c | 3 +++ > mm/madvise.c | 6 +++++- > mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++++++----------------- > mm/mlock.c | 13 ++++++++----- > mm/mmap.c | 22 +++++++++++++--------- > mm/mprotect.c | 4 +++- > mm/swap_state.c | 8 ++++++-- > 9 files changed, 89 insertions(+), 40 deletions(-) > > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, > struct vm_fault *vmf) > @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, > unsigned long *start, > unsigned long *end) > { > - *start = max3(lpfn, PFN_DOWN(vma->vm_start), > + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), > PFN_DOWN(faddr & PMD_MASK)); > - *end = min3(rpfn, PFN_DOWN(vma->vm_end), > + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), > PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); > } > > -- > 2.7.4 > I have got a crash on 4.14 kernel with speculative page faults enabled and here is my analysis of the problem. The issue was reported only once. [23409.303395] el1_da+0x24/0x84 [23409.303400] __radix_tree_lookup+0x8/0x90 [23409.303407] find_get_entry+0x64/0x14c [23409.303410] pagecache_get_page+0x5c/0x27c [23409.303416] __read_swap_cache_async+0x80/0x260 [23409.303420] swap_vma_readahead+0x264/0x37c [23409.303423] swapin_readahead+0x5c/0x6c [23409.303428] do_swap_page+0x128/0x6e4 [23409.303431] handle_pte_fault+0x230/0xca4 [23409.303435] __handle_speculative_fault+0x57c/0x7c8 [23409.303438] do_page_fault+0x228/0x3e8 [23409.303442] do_translation_fault+0x50/0x6c [23409.303445] do_mem_abort+0x5c/0xe0 [23409.303447] el0_da+0x20/0x24 Process A accesses address ADDR (part of VMA A) and that results in a translation fault. Kernel enters __handle_speculative_fault to fix the fault. Process A enters do_swap_page->swapin_readahead->swap_vma_readahead from speculative path. During this time, another process B which shares the same mm, does a mprotect from another CPU which follows mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. After the split, ADDR falls into VMA B, but process A is still using VMA A. Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. swap_vma_readahead->swap_ra_info uses start and end of vma to calculate ptes and nr_pte, which goes wrong due to this and finally resulting in wrong "entry" passed to swap_vma_readahead->__read_swap_cache_async, and in turn causing invalid swapper_space being passed to __read_swap_cache_async->find_get_page, causing an abort. The fix I have tried is to cache vm_start and vm_end also in vmf and use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can send the patch I am a using if you feel that is the right thing to do. Thanks, Vinayak
Le 05/11/2018 à 08:04, vinayak menon a écrit : > Hi Laurent, > > On Thu, May 17, 2018 at 4:37 PM Laurent Dufour > <ldufour@linux.vnet.ibm.com> wrote: >> >> The VMA sequence count has been introduced to allow fast detection of >> VMA modification when running a page fault handler without holding >> the mmap_sem. >> >> This patch provides protection against the VMA modification done in : >> - madvise() >> - mpol_rebind_policy() >> - vma_replace_policy() >> - change_prot_numa() >> - mlock(), munlock() >> - mprotect() >> - mmap_region() >> - collapse_huge_page() >> - userfaultd registering services >> >> In addition, VMA fields which will be read during the speculative fault >> path needs to be written using WRITE_ONCE to prevent write to be split >> and intermediate values to be pushed to other CPUs. >> >> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> >> --- >> fs/proc/task_mmu.c | 5 ++++- >> fs/userfaultfd.c | 17 +++++++++++++---- >> mm/khugepaged.c | 3 +++ >> mm/madvise.c | 6 +++++- >> mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++++++----------------- >> mm/mlock.c | 13 ++++++++----- >> mm/mmap.c | 22 +++++++++++++--------- >> mm/mprotect.c | 4 +++- >> mm/swap_state.c | 8 ++++++-- >> 9 files changed, 89 insertions(+), 40 deletions(-) >> >> struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, >> struct vm_fault *vmf) >> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, >> unsigned long *start, >> unsigned long *end) >> { >> - *start = max3(lpfn, PFN_DOWN(vma->vm_start), >> + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), >> PFN_DOWN(faddr & PMD_MASK)); >> - *end = min3(rpfn, PFN_DOWN(vma->vm_end), >> + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), >> PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); >> } >> >> -- >> 2.7.4 >> > > I have got a crash on 4.14 kernel with speculative page faults enabled > and here is my analysis of the problem. > The issue was reported only once. Hi Vinayak, Thanks for reporting this. > > [23409.303395] el1_da+0x24/0x84 > [23409.303400] __radix_tree_lookup+0x8/0x90 > [23409.303407] find_get_entry+0x64/0x14c > [23409.303410] pagecache_get_page+0x5c/0x27c > [23409.303416] __read_swap_cache_async+0x80/0x260 > [23409.303420] swap_vma_readahead+0x264/0x37c > [23409.303423] swapin_readahead+0x5c/0x6c > [23409.303428] do_swap_page+0x128/0x6e4 > [23409.303431] handle_pte_fault+0x230/0xca4 > [23409.303435] __handle_speculative_fault+0x57c/0x7c8 > [23409.303438] do_page_fault+0x228/0x3e8 > [23409.303442] do_translation_fault+0x50/0x6c > [23409.303445] do_mem_abort+0x5c/0xe0 > [23409.303447] el0_da+0x20/0x24 > > Process A accesses address ADDR (part of VMA A) and that results in a > translation fault. > Kernel enters __handle_speculative_fault to fix the fault. > Process A enters do_swap_page->swapin_readahead->swap_vma_readahead > from speculative path. > During this time, another process B which shares the same mm, does a > mprotect from another CPU which follows > mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. > After the split, ADDR falls into VMA B, but process A is still using > VMA A. > Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. > swap_vma_readahead->swap_ra_info uses start and end of vma to > calculate ptes and nr_pte, which goes wrong due to this and finally > resulting in wrong "entry" passed to > swap_vma_readahead->__read_swap_cache_async, and in turn causing > invalid swapper_space > being passed to __read_swap_cache_async->find_get_page, causing an abort. > > The fix I have tried is to cache vm_start and vm_end also in vmf and > use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can > send > the patch I am a using if you feel that is the right thing to do. I think the best would be to don't do swap readahead during the speculatvive page fault. If the page is found in the swap cache, that's fine, but otherwise, we should f allback to the regular page fault. The attached -untested- patch is doing this, if you want to give it a try. I'll review that for the next series. Thanks, Laurent. From 056afafb0bccea6a356f80f4253ffcd3ef4a1f8d Mon Sep 17 00:00:00 2001 From: Laurent Dufour <ldufour@linux.vnet.ibm.com> Date: Mon, 5 Nov 2018 18:43:01 +0100 Subject: [PATCH] mm: don't do swap readahead during speculative page fault Vinayak Menon faced a panic because one thread was page faulting a page in swap, while another one was mprotecting a part of the VMA leading to a VMA split. This raise a panic in swap_vma_readahead() because the VMA's boundaries were not more matching the faulting address. To avoid this, if the page is not found in the swap, the speculative page fault is aborted to retry a regular page fault. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> --- mm/memory.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 9dd5ffeb1f7e..720dc9a1b99f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3139,6 +3139,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) lru_cache_add_anon(page); swap_readpage(page, true); } + } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) { + /* + * Don't try readahead during a speculative page fault as + * the VMA's boundaries may change in our back. + * If the page is not in the swap cache and synchronous read + * is disabled, fall back to the regular page fault mechanism. + */ + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + ret = VM_FAULT_RETRY; + goto out; } else { page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
On 11/5/2018 11:52 PM, Laurent Dufour wrote: > Le 05/11/2018 à 08:04, vinayak menon a écrit : >> Hi Laurent, >> >> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour >> <ldufour@linux.vnet.ibm.com> wrote: >>> >>> The VMA sequence count has been introduced to allow fast detection of >>> VMA modification when running a page fault handler without holding >>> the mmap_sem. >>> >>> This patch provides protection against the VMA modification done in : >>> - madvise() >>> - mpol_rebind_policy() >>> - vma_replace_policy() >>> - change_prot_numa() >>> - mlock(), munlock() >>> - mprotect() >>> - mmap_region() >>> - collapse_huge_page() >>> - userfaultd registering services >>> >>> In addition, VMA fields which will be read during the speculative fault >>> path needs to be written using WRITE_ONCE to prevent write to be split >>> and intermediate values to be pushed to other CPUs. >>> >>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> >>> --- >>> fs/proc/task_mmu.c | 5 ++++- >>> fs/userfaultfd.c | 17 +++++++++++++---- >>> mm/khugepaged.c | 3 +++ >>> mm/madvise.c | 6 +++++- >>> mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++++++----------------- >>> mm/mlock.c | 13 ++++++++----- >>> mm/mmap.c | 22 +++++++++++++--------- >>> mm/mprotect.c | 4 +++- >>> mm/swap_state.c | 8 ++++++-- >>> 9 files changed, 89 insertions(+), 40 deletions(-) >>> >>> struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, >>> struct vm_fault *vmf) >>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, >>> unsigned long *start, >>> unsigned long *end) >>> { >>> - *start = max3(lpfn, PFN_DOWN(vma->vm_start), >>> + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), >>> PFN_DOWN(faddr & PMD_MASK)); >>> - *end = min3(rpfn, PFN_DOWN(vma->vm_end), >>> + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), >>> PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); >>> } >>> >>> -- >>> 2.7.4 >>> >> >> I have got a crash on 4.14 kernel with speculative page faults enabled >> and here is my analysis of the problem. >> The issue was reported only once. > > Hi Vinayak, > > Thanks for reporting this. > >> >> [23409.303395] el1_da+0x24/0x84 >> [23409.303400] __radix_tree_lookup+0x8/0x90 >> [23409.303407] find_get_entry+0x64/0x14c >> [23409.303410] pagecache_get_page+0x5c/0x27c >> [23409.303416] __read_swap_cache_async+0x80/0x260 >> [23409.303420] swap_vma_readahead+0x264/0x37c >> [23409.303423] swapin_readahead+0x5c/0x6c >> [23409.303428] do_swap_page+0x128/0x6e4 >> [23409.303431] handle_pte_fault+0x230/0xca4 >> [23409.303435] __handle_speculative_fault+0x57c/0x7c8 >> [23409.303438] do_page_fault+0x228/0x3e8 >> [23409.303442] do_translation_fault+0x50/0x6c >> [23409.303445] do_mem_abort+0x5c/0xe0 >> [23409.303447] el0_da+0x20/0x24 >> >> Process A accesses address ADDR (part of VMA A) and that results in a >> translation fault. >> Kernel enters __handle_speculative_fault to fix the fault. >> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead >> from speculative path. >> During this time, another process B which shares the same mm, does a >> mprotect from another CPU which follows >> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. >> After the split, ADDR falls into VMA B, but process A is still using >> VMA A. >> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. >> swap_vma_readahead->swap_ra_info uses start and end of vma to >> calculate ptes and nr_pte, which goes wrong due to this and finally >> resulting in wrong "entry" passed to >> swap_vma_readahead->__read_swap_cache_async, and in turn causing >> invalid swapper_space >> being passed to __read_swap_cache_async->find_get_page, causing an abort. >> >> The fix I have tried is to cache vm_start and vm_end also in vmf and >> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can >> send >> the patch I am a using if you feel that is the right thing to do. > > I think the best would be to don't do swap readahead during the speculatvive page fault. If the page is found in the swap cache, that's fine, but otherwise, we should f allback to the regular page fault. > > The attached -untested- patch is doing this, if you want to give it a try. I'll review that for the next series. > Thanks Laurent. I and going to try this patch. With this patch, since all non-SWP_SYNCHRONOUS_IO swapins result in non-speculative fault and a retry, wouldn't this have an impact on some perf numbers ? If so, would caching start and end be a better option ? Also, would it make sense to move the FAULT_FLAG_SPECULATIVE check inside swapin_readahead, in a way that swap_cluster_readahead can take the speculative path ? swap_cluster_readahead doesn't seem to use vma values. Thanks, Vinayak
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 597969db9e90..7247d6d5afba 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1137,8 +1137,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, goto out_mm; } for (vma = mm->mmap; vma; vma = vma->vm_next) { - vma->vm_flags &= ~VM_SOFTDIRTY; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, + vma->vm_flags & ~VM_SOFTDIRTY); vma_set_page_prot(vma); + vm_write_end(vma); } downgrade_write(&mm->mmap_sem); break; diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index cec550c8468f..b8212ba17695 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) octx = vma->vm_userfaultfd_ctx.ctx; if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { + vm_write_begin(vma); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); + WRITE_ONCE(vma->vm_flags, + vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING)); + vm_write_end(vma); return 0; } @@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, struct file *file) vma = prev; else prev = vma; - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vm_write_end(vma); } up_write(&mm->mmap_sem); mmput(mm); @@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, * the next vma was merged into the current one and * the current one has not been updated yet. */ - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx.ctx = ctx; + vm_write_end(vma); skip: prev = vma; @@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, * the next vma was merged into the current one and * the current one has not been updated yet. */ - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vm_write_end(vma); skip: prev = vma; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d7b2a4bf8671..0b28af4b950d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1011,6 +1011,7 @@ static void collapse_huge_page(struct mm_struct *mm, if (mm_find_pmd(mm, address) != pmd) goto out; + vm_write_begin(vma); anon_vma_lock_write(vma->anon_vma); pte = pte_offset_map(pmd, address); @@ -1046,6 +1047,7 @@ static void collapse_huge_page(struct mm_struct *mm, pmd_populate(mm, pmd, pmd_pgtable(_pmd)); spin_unlock(pmd_ptl); anon_vma_unlock_write(vma->anon_vma); + vm_write_end(vma); result = SCAN_FAIL; goto out; } @@ -1080,6 +1082,7 @@ static void collapse_huge_page(struct mm_struct *mm, set_pmd_at(mm, address, pmd, _pmd); update_mmu_cache_pmd(vma, address, pmd); spin_unlock(pmd_ptl); + vm_write_end(vma); *hpage = NULL; diff --git a/mm/madvise.c b/mm/madvise.c index 4d3c922ea1a1..e328f7ab5942 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -184,7 +184,9 @@ static long madvise_behavior(struct vm_area_struct *vma, /* * vm_flags is protected by the mmap_sem held in write mode. */ - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); + vm_write_end(vma); out: return error; } @@ -450,9 +452,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb, .private = tlb, }; + vm_write_begin(vma); tlb_start_vma(tlb, vma); walk_page_range(addr, end, &free_walk); tlb_end_vma(tlb, vma); + vm_write_end(vma); } static int madvise_free_single_vma(struct vm_area_struct *vma, diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 9ac49ef17b4e..898d325c9fea 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -380,8 +380,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new) struct vm_area_struct *vma; down_write(&mm->mmap_sem); - for (vma = mm->mmap; vma; vma = vma->vm_next) + for (vma = mm->mmap; vma; vma = vma->vm_next) { + vm_write_begin(vma); mpol_rebind_policy(vma->vm_policy, new); + vm_write_end(vma); + } up_write(&mm->mmap_sem); } @@ -554,9 +557,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, { int nr_updated; + vm_write_begin(vma); nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1); if (nr_updated) count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); + vm_write_end(vma); return nr_updated; } @@ -657,6 +662,7 @@ static int vma_replace_policy(struct vm_area_struct *vma, if (IS_ERR(new)) return PTR_ERR(new); + vm_write_begin(vma); if (vma->vm_ops && vma->vm_ops->set_policy) { err = vma->vm_ops->set_policy(vma, new); if (err) @@ -664,11 +670,17 @@ static int vma_replace_policy(struct vm_area_struct *vma, } old = vma->vm_policy; - vma->vm_policy = new; /* protected by mmap_sem */ + /* + * The speculative page fault handler accesses this field without + * hodling the mmap_sem. + */ + WRITE_ONCE(vma->vm_policy, new); + vm_write_end(vma); mpol_put(old); return 0; err_out: + vm_write_end(vma); mpol_put(new); return err; } @@ -1614,23 +1626,28 @@ COMPAT_SYSCALL_DEFINE4(migrate_pages, compat_pid_t, pid, struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, unsigned long addr) { - struct mempolicy *pol = NULL; + struct mempolicy *pol; - if (vma) { - if (vma->vm_ops && vma->vm_ops->get_policy) { - pol = vma->vm_ops->get_policy(vma, addr); - } else if (vma->vm_policy) { - pol = vma->vm_policy; + if (!vma) + return NULL; - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); - } + if (vma->vm_ops && vma->vm_ops->get_policy) + return vma->vm_ops->get_policy(vma, addr); + + /* + * This could be called without holding the mmap_sem in the + * speculative page fault handler's path. + */ + pol = READ_ONCE(vma->vm_policy); + if (pol) { + /* + * shmem_alloc_page() passes MPOL_F_SHARED policy with + * a pseudo vma whose vma->vm_ops=NULL. Take a reference + * count on these policies which will be dropped by + * mpol_cond_put() later + */ + if (mpol_needs_cond_ref(pol)) + mpol_get(pol); } return pol; diff --git a/mm/mlock.c b/mm/mlock.c index 74e5a6547c3d..c40285c94ced 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -445,7 +445,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - vma->vm_flags &= VM_LOCKED_CLEAR_MASK; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, vma->vm_flags & VM_LOCKED_CLEAR_MASK); + vm_write_end(vma); while (start < end) { struct page *page; @@ -568,10 +570,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, * It's okay if try_to_unmap_one unmaps a page just after we * set VM_LOCKED, populate_vma_page_range will bring it back. */ - - if (lock) - vma->vm_flags = newflags; - else + if (lock) { + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, newflags); + vm_write_end(vma); + } else munlock_vma_pages_range(vma, start, end); out: diff --git a/mm/mmap.c b/mm/mmap.c index eeafd0bc8b36..add13b4e1d8d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -852,17 +852,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, } if (start != vma->vm_start) { - vma->vm_start = start; + WRITE_ONCE(vma->vm_start, start); start_changed = true; } if (end != vma->vm_end) { - vma->vm_end = end; + WRITE_ONCE(vma->vm_end, end); end_changed = true; } - vma->vm_pgoff = pgoff; + WRITE_ONCE(vma->vm_pgoff, pgoff); if (adjust_next) { - next->vm_start += adjust_next << PAGE_SHIFT; - next->vm_pgoff += adjust_next; + WRITE_ONCE(next->vm_start, + next->vm_start + (adjust_next << PAGE_SHIFT)); + WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next); } if (root) { @@ -1793,13 +1794,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr, out: perf_event_mmap(vma); + vm_write_begin(vma); vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))) mm->locked_vm += (len >> PAGE_SHIFT); else - vma->vm_flags &= VM_LOCKED_CLEAR_MASK; + WRITE_ONCE(vma->vm_flags, + vma->vm_flags & VM_LOCKED_CLEAR_MASK); } if (file) @@ -1812,9 +1815,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * then new mapped in-place (which must be aimed as * a completely new data area). */ - vma->vm_flags |= VM_SOFTDIRTY; + WRITE_ONCE(vma->vm_flags, vma->vm_flags | VM_SOFTDIRTY); vma_set_page_prot(vma); + vm_write_end(vma); return addr; @@ -2443,8 +2447,8 @@ int expand_downwards(struct vm_area_struct *vma, mm->locked_vm += grow; vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); - vma->vm_start = address; - vma->vm_pgoff -= grow; + WRITE_ONCE(vma->vm_start, address); + WRITE_ONCE(vma->vm_pgoff, vma->vm_pgoff - grow); anon_vma_interval_tree_post_update_vma(vma); vma_gap_update(vma); spin_unlock(&mm->page_table_lock); diff --git a/mm/mprotect.c b/mm/mprotect.c index 625608bc8962..83594cc68062 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -375,12 +375,14 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, * vm_flags and vm_page_prot are protected by the mmap_sem * held in write mode. */ - vma->vm_flags = newflags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, newflags); dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot); vma_set_page_prot(vma); change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable, 0); + vm_write_end(vma); /* * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major diff --git a/mm/swap_state.c b/mm/swap_state.c index c6b3eab73fde..2ee7198df281 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -572,6 +572,10 @@ static unsigned long swapin_nr_pages(unsigned long offset) * the readahead. * * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL. + * This is needed to ensure the VMA will not be freed in our back. In the case + * of the speculative page fault handler, this cannot happen, even if we don't + * hold the mmap_sem. Callees are assumed to take care of reading VMA's fields + * using READ_ONCE() to read consistent values. */ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, struct vm_fault *vmf) @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, unsigned long *start, unsigned long *end) { - *start = max3(lpfn, PFN_DOWN(vma->vm_start), + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), PFN_DOWN(faddr & PMD_MASK)); - *end = min3(rpfn, PFN_DOWN(vma->vm_end), + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); }
The VMA sequence count has been introduced to allow fast detection of VMA modification when running a page fault handler without holding the mmap_sem. This patch provides protection against the VMA modification done in : - madvise() - mpol_rebind_policy() - vma_replace_policy() - change_prot_numa() - mlock(), munlock() - mprotect() - mmap_region() - collapse_huge_page() - userfaultd registering services In addition, VMA fields which will be read during the speculative fault path needs to be written using WRITE_ONCE to prevent write to be split and intermediate values to be pushed to other CPUs. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> --- fs/proc/task_mmu.c | 5 ++++- fs/userfaultfd.c | 17 +++++++++++++---- mm/khugepaged.c | 3 +++ mm/madvise.c | 6 +++++- mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++++++----------------- mm/mlock.c | 13 ++++++++----- mm/mmap.c | 22 +++++++++++++--------- mm/mprotect.c | 4 +++- mm/swap_state.c | 8 ++++++-- 9 files changed, 89 insertions(+), 40 deletions(-)