Message ID | 20210107200402.31095-3-aarcange@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | page_count can't be used to decide when wp_page_copy | expand |
On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > However there are two cases that could wrprotecting exclusive anon > pages with only the mmap_read_lock: I still think the real fix is "Don't do that then", and just take the write lock. The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a normal operation. Same goes for softdirty. Why have those become _so_ magical that they can break the VM for everybody else? Linus
On Thu, Jan 7, 2021 at 12:17 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I still think the real fix is "Don't do that then", and just take the > write lock. The alternative, of course, is to just make sure the page table flush is done inside the page table lock (and then we make COW do the copy inside of it). But this whole "we know UFFD breaks all rules, we'll add even more crap to it" approach is horrendous. Linus
Hi Linus, On Thu, Jan 07, 2021 at 12:17:40PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > However there are two cases that could wrprotecting exclusive anon > > pages with only the mmap_read_lock: > > I still think the real fix is "Don't do that then", and just take the > write lock. > > The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a > normal operation. Same goes for softdirty. > > Why have those become _so_ magical that they can break the VM for > everybody else? I see what you mean above and I agree. Like said below: == In simple terms: the page_count check in do_wp_page makes it impossible to wrprotect memory, if such memory is under a !FOLL_WRITE GUP pin. == So to simplify let's ignore UFFDIO_WRITEPROTECT here, which is new and adds no dependency on top of clear_refs in this respect. So yes, if we drop any code that has to wrprotect memory in place in the kernel (since all userland memory can be under GUP pin in read mode) and we make such an operation illegal, then it's fine, but that means clear_refs has to go or it has to fail if there's a GUP pin during the wrprotection. The problem is it's not even possible to detect reliably if there's really a long term GUP pin because of speculative pagecache lookups. We would need to declare that any secondary MMU which is supposed to be VM-neutral using mmu notifier like a GPU or a RDMA device, cannot be used in combination on clear_refs and it would need to be enforced fully in userland. Most mmu notifier users drop the GUP pin during the invalidate for extra safety in case an invalidate goes missing: they would all need to drop FOLL_GET to be compliant and stop causing memory corruption if clear_refs shall be still allowed to happen on mmu notifier capable secondary MMUs. Even then how does userland know which devices attaches to the memory with mmu notifer and never using FOLL_GET and which aren't? It doesn't sound reliable to enforce this in userland. So I don't see how clear_refs can be saved. Now let's make another example that still shows at least some fundamental inefficiency that has nothing to do with clear_refs. Let's suppose a GUP pin is taken on a subpageA by a RDMA device by process A (parent). Let's now assume subpageB is mapped in process B (child of process A). Both subpageA and subpageB are exclusive (mapcount == 1) and read-write but they share the same page_count atomic counter (only the page_mapcounts are subpage granular). To still tame the zygote concern with the page_count in do_wp_page, then process B when it forks a child (processC) would forever have to do an extra superflous COW even after process C exits. Is that what we want on top of the fundamental unsafety added to clear_refs? Thanks, Andrea
On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > The problem is it's not even possible to detect reliably if there's > really a long term GUP pin because of speculative pagecache lookups. So none of the normal code _needs_ that any more these days, which is what I think is so nice. Any pinning will do the COW, and then we have the logic to make sure it stays writable, and that keeps everything nicely coherent and is all fairly simple. And yes, it does mean that if somebody then explicitly write-protects a page, it may end up being COW'ed after all, but if you first pinned it, and then started playing with the protections of that page, why should you be surprised? So to me, this sounds like a "don't do that then" situation. Anybody who does page pinning and wants coherency should NOT TOUCH THE MAPPING IT PINNED. (And if you do touch it, it's your own fault, and you get to keep both of the broken pieces) Now, I do agree that from a QoI standpoint, it would be really lovely if we actually enforced it. I'm not entirely sure we can, but maybe it would be reasonable to use that mm->has_pinned && page_maybe_dma_pinned(page) at least as the beginning of a heuristic. In fact, I do think that "page_maybe_dma_pinned()" could possibly be made stronger than it is. Because at *THAT* point, we might say "we know a pinned page always must have a page_mapcount() of 1" - since as part of pinning it and doing the GUP_PIN, we forced the COW, and then subsequent fork() operations enforce it too. So I do think that it might be possible to make that clear_refs code notice "this page is pinned, I can't mark it WP without the pinning coherency breaking". It might not even be hard. But admittedly I'm somewhat handwaving here, and I might not have thought of some situation. Linus
Hi Andrea, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on linux/master] [also build test WARNING on linus/master hnaz-linux-mm/master v5.11-rc2 next-20210104] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/page_count-can-t-be-used-to-decide-when-wp_page_copy/20210108-040616 base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 09162bc32c880a791c6c0668ce0745cf7958f576 compiler: nds32le-linux-gcc (GCC) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> "cppcheck warnings: (new ones prefixed by >>)" >> fs/proc/task_mmu.c:1248:33: warning: Uninitialized variable: tmp [uninitvar] for (vma = mm->mmap; vma != tmp; ^ vim +1248 fs/proc/task_mmu.c 1183 1184 static ssize_t clear_refs_write(struct file *file, const char __user *buf, 1185 size_t count, loff_t *ppos) 1186 { 1187 struct task_struct *task; 1188 char buffer[PROC_NUMBUF]; 1189 struct mm_struct *mm; 1190 struct vm_area_struct *vma; 1191 enum clear_refs_types type; 1192 int itype; 1193 int rv; 1194 1195 memset(buffer, 0, sizeof(buffer)); 1196 if (count > sizeof(buffer) - 1) 1197 count = sizeof(buffer) - 1; 1198 if (copy_from_user(buffer, buf, count)) 1199 return -EFAULT; 1200 rv = kstrtoint(strstrip(buffer), 10, &itype); 1201 if (rv < 0) 1202 return rv; 1203 type = (enum clear_refs_types)itype; 1204 if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST) 1205 return -EINVAL; 1206 1207 task = get_proc_task(file_inode(file)); 1208 if (!task) 1209 return -ESRCH; 1210 mm = get_task_mm(task); 1211 if (mm) { 1212 struct mmu_notifier_range range; 1213 struct clear_refs_private cp = { 1214 .type = type, 1215 }; 1216 1217 if (type == CLEAR_REFS_MM_HIWATER_RSS) { 1218 if (mmap_write_lock_killable(mm)) { 1219 count = -EINTR; 1220 goto out_mm; 1221 } 1222 1223 /* 1224 * Writing 5 to /proc/pid/clear_refs resets the peak 1225 * resident set size to this mm's current rss value. 1226 */ 1227 reset_mm_hiwater_rss(mm); 1228 mmap_write_unlock(mm); 1229 goto out_mm; 1230 } 1231 1232 if (mmap_read_lock_killable(mm)) { 1233 count = -EINTR; 1234 goto out_mm; 1235 } 1236 if (type == CLEAR_REFS_SOFT_DIRTY) { 1237 for (vma = mm->mmap; vma; vma = vma->vm_next) { 1238 struct vm_area_struct *tmp; 1239 if (!(vma->vm_flags & VM_SOFTDIRTY)) { 1240 inc_wrprotect_tlb_flush_pending(vma); 1241 continue; 1242 } 1243 1244 /* 1245 * Rollback wrprotect_tlb_flush_pending before 1246 * releasing the mmap_lock. 1247 */ > 1248 for (vma = mm->mmap; vma != tmp; 1249 vma = vma->vm_next) 1250 dec_wrprotect_tlb_flush_pending(vma); 1251 1252 mmap_read_unlock(mm); 1253 if (mmap_write_lock_killable(mm)) { 1254 count = -EINTR; 1255 goto out_mm; 1256 } 1257 for (vma = mm->mmap; vma; vma = vma->vm_next) { 1258 vma->vm_flags &= ~VM_SOFTDIRTY; 1259 vma_set_page_prot(vma); 1260 inc_wrprotect_tlb_flush_pending(vma); 1261 } 1262 mmap_write_downgrade(mm); 1263 break; 1264 } 1265 1266 inc_tlb_flush_pending(mm); 1267 mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, 1268 0, NULL, mm, 0, -1UL); 1269 mmu_notifier_invalidate_range_start(&range); 1270 } 1271 walk_page_range(mm, 0, mm->highest_vm_end, &clear_refs_walk_ops, 1272 &cp); 1273 if (type == CLEAR_REFS_SOFT_DIRTY) { 1274 mmu_notifier_invalidate_range_end(&range); 1275 flush_tlb_mm(mm); 1276 for (vma = mm->mmap; vma; vma = vma->vm_next) 1277 dec_wrprotect_tlb_flush_pending(vma); 1278 dec_tlb_flush_pending(mm); 1279 } 1280 mmap_read_unlock(mm); 1281 out_mm: 1282 mmput(mm); 1283 } 1284 put_task_struct(task); 1285 1286 return count; 1287 } 1288 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On 1/7/21 1:29 PM, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote: >> >> The problem is it's not even possible to detect reliably if there's >> really a long term GUP pin because of speculative pagecache lookups. > > So none of the normal code _needs_ that any more these days, which is > what I think is so nice. Any pinning will do the COW, and then we have > the logic to make sure it stays writable, and that keeps everything > nicely coherent and is all fairly simple. > > And yes, it does mean that if somebody then explicitly write-protects > a page, it may end up being COW'ed after all, but if you first pinned > it, and then started playing with the protections of that page, why > should you be surprised? > > So to me, this sounds like a "don't do that then" situation. > > Anybody who does page pinning and wants coherency should NOT TOUCH THE > MAPPING IT PINNED. > > (And if you do touch it, it's your own fault, and you get to keep both > of the broken pieces) > > Now, I do agree that from a QoI standpoint, it would be really lovely > if we actually enforced it. I'm not entirely sure we can, but maybe it > would be reasonable to use that > > mm->has_pinned && page_maybe_dma_pinned(page) > > at least as the beginning of a heuristic. > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > made stronger than it is. Because at *THAT* point, we might say "we What exactly did you have in mind, to make it stronger? I think the answer is in this email but I don't quite see it yet... Also, now seems to be a good time to mention that I've been thinking about a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer to peer with GUP/NIC, and HMM support for atomic operations from a device). And it seems like the following approach would help: * Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks to Jason for this point) means "user space owns the lifetime". We might even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would make them mean the same thing. The idea is that there are no "short term" pins of this kind of memory. * Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans, because several of us had thought that Direct IO needs FOLL_PIN. However, this recent conversation, plus my list of cases above, seems to indicate otherwise. That's because we only have one refcount approach for marking pages in this way, and we should spend it on the long-term pinned pages. Those are both hard to identify otherwise, and actionable once we identify them. Direct IO pins, on the other hand, are more transient. We can probably live without tagging Direct IO pages as FOLL_PIN. I think. This is all assuming that we make progress in the area of "if it's not a page_maybe_dma_pinned() page, then we can wait for it or otherwise do reasonable things about the refcount". So we end up with a clear (-ish) difference between pages that can be waited for, and pages that should not be waited for in the kernel. I hope this helps, but if it's too much of a side-track, please disregard. thanks,
On Thu, Jan 7, 2021 at 1:53 PM John Hubbard <jhubbard@nvidia.com> wrote: > > > > > Now, I do agree that from a QoI standpoint, it would be really lovely > > if we actually enforced it. I'm not entirely sure we can, but maybe it > > would be reasonable to use that > > > > mm->has_pinned && page_maybe_dma_pinned(page) > > > > at least as the beginning of a heuristic. > > > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > > made stronger than it is. Because at *THAT* point, we might say "we > > What exactly did you have in mind, to make it stronger? I think the > answer is in this email but I don't quite see it yet... Literally just adding a " && page_mapcount(page) == 1" in there (probably best done inside page_maybe_dma_pinned() itself) > Direct IO pins, on the other hand, are more transient. We can probably live > without tagging Direct IO pages as FOLL_PIN. I think. Yes. I think direct-IO writes should be able to just do a transient GUP, and if it causes a COW fault that isn't coherent, that's the correct semantics, I think (ie the direct-IO will see the original data, the COW faulter will get it's own private copy to make changes to). I think pinning should be primarily limited to things that _require_ coherency (ie you pin because you're going to do some active two-way communication using that page) Does that match your thinking? Linus
On 1/7/21 2:00 PM, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 1:53 PM John Hubbard <jhubbard@nvidia.com> wrote: >> >>> >>> Now, I do agree that from a QoI standpoint, it would be really lovely >>> if we actually enforced it. I'm not entirely sure we can, but maybe it >>> would be reasonable to use that >>> >>> mm->has_pinned && page_maybe_dma_pinned(page) >>> >>> at least as the beginning of a heuristic. >>> >>> In fact, I do think that "page_maybe_dma_pinned()" could possibly be >>> made stronger than it is. Because at *THAT* point, we might say "we >> >> What exactly did you have in mind, to make it stronger? I think the >> answer is in this email but I don't quite see it yet... > > Literally just adding a " && page_mapcount(page) == 1" in there > (probably best done inside page_maybe_dma_pinned() itself) Well, that means that pages that are used for pinned DMA like this, can not be shared with other processes. Is that an acceptable limitation for the RDMA users? It seems a bit constraining, at first glance anyway. > >> Direct IO pins, on the other hand, are more transient. We can probably live >> without tagging Direct IO pages as FOLL_PIN. I think. > > Yes. I think direct-IO writes should be able to just do a transient > GUP, and if it causes a COW fault that isn't coherent, that's the > correct semantics, I think (ie the direct-IO will see the original > data, the COW faulter will get it's own private copy to make changes > to). > > I think pinning should be primarily limited to things that _require_ > coherency (ie you pin because you're going to do some active two-way > communication using that page) > > Does that match your thinking? > Yes, perfectly. I'm going to update Documentation/core-api/pin_user_pages.rst accordingly, once the dust settles on these discussions. thanks,
On Thu, Jan 7, 2021 at 2:14 PM John Hubbard <jhubbard@nvidia.com> wrote: > > > > Literally just adding a " && page_mapcount(page) == 1" in there > > (probably best done inside page_maybe_dma_pinned() itself) > > Well, that means that pages that are used for pinned DMA like this, can > not be shared with other processes. Is that an acceptable limitation > for the RDMA users? It seems a bit constraining, at first glance anyway. Hmm, add a check for the page being PageAnon(), perhaps? If it's a shared vma, then the page can be pinned shared with multiple mappings, I agree. So yeah, I didn't think it through entirely.. And maybe I'm still missing something else.. Linus
On Thu, Jan 7, 2021 at 2:20 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Hmm, add a check for the page being PageAnon(), perhaps? > > If it's a shared vma, then the page can be pinned shared with multiple > mappings, I agree. Or check the vma directly for whether it's a COW vma. That's probably a more obvious test, but would have to be done outside of page_maybe_dma_pinned(). For example, in copy_present_page(), we've already done that COW-vma test, so if we want to strengthen just _that_ test, then it would be sufficient to just add a /* This cannot be a pinned page if it has more than one mapping */ if (page_mappings(page) != 1) return 1; to copy_present_page() along with the existing page_maybe_dma_pinned() test. No? Linus
On Thu, Jan 07, 2021 at 01:29:43PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > The problem is it's not even possible to detect reliably if there's > > really a long term GUP pin because of speculative pagecache lookups. > > So none of the normal code _needs_ that any more these days, which is > what I think is so nice. Any pinning will do the COW, and then we have > the logic to make sure it stays writable, and that keeps everything > nicely coherent and is all fairly simple. > > And yes, it does mean that if somebody then explicitly write-protects > a page, it may end up being COW'ed after all, but if you first pinned > it, and then started playing with the protections of that page, why > should you be surprised? > > So to me, this sounds like a "don't do that then" situation. > > Anybody who does page pinning and wants coherency should NOT TOUCH THE > MAPPING IT PINNED. > > (And if you do touch it, it's your own fault, and you get to keep both > of the broken pieces) > > Now, I do agree that from a QoI standpoint, it would be really lovely > if we actually enforced it. I'm not entirely sure we can, but maybe it > would be reasonable to use that > > mm->has_pinned && page_maybe_dma_pinned(page) > > at least as the beginning of a heuristic. > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > made stronger than it is. Because at *THAT* point, we might say "we > know a pinned page always must have a page_mapcount() of 1" - since as > part of pinning it and doing the GUP_PIN, we forced the COW, and then > subsequent fork() operations enforce it too. > > So I do think that it might be possible to make that clear_refs code > notice "this page is pinned, I can't mark it WP without the pinning > coherency breaking". > > It might not even be hard. But admittedly I'm somewhat handwaving > here, and I might not have thought of some situation. I suppose the objective would be to detect when it's a transient pin (as an O_DIRECT write) and fail clear_refs with an error for all other cases of real long term pins that need to keep reading at full PCI bandwidth, without extra GUP invocations after the wp_copy_page run. Because of the speculative lookups are making the count unstable, it's not even enough to use mmu notifier and never use FOLL_GET in GUP to make it safe again (unlike what I mentioned in a previous email). Random memory corruption will still silently materialize as result of the speculative lookups in the above scenario. My whole point here in starting this new thread to suggest page_count in do_wp_page is an untenable solution is that such commit silently broke every single long term PIN user that may be used in combination of clear_refs since 2013. Silent memory corruption undetected or a detectable error out of clear_refs, are both different side effects the same technical ABI break that rendered clear_refs fundamentally incompatible with clear_refs, so detecting it or not still an ABI break that is. I felt obliged to report there's something much deeper and fundamentally incompatible between the page_count in do_wp_page any wrprotection of exclusive memory in place, as if used in combination with any RDMA for example. The TLB flushing and the mmap_read/write_lock were just the tip of the iceberg and they're not the main concern anymore. In addition, the inefficiency caused by the fact the page_count effect is multiplied by 512 times or 262144 while the mapcount remains 4k granular, makes me think the page_count is unsuitable to be used there and this specific cure with page_count in do_wp_page, looks worse than the initial zygote disease. Thanks, Andrea
On 1/7/21 2:24 PM, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 2:20 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Hmm, add a check for the page being PageAnon(), perhaps? >> >> If it's a shared vma, then the page can be pinned shared with multiple >> mappings, I agree. > > Or check the vma directly for whether it's a COW vma. That's probably > a more obvious test, but would have to be done outside of > page_maybe_dma_pinned(). > > For example, in copy_present_page(), we've already done that COW-vma > test, so if we want to strengthen just _that_ test, then it would be > sufficient to just add a > > /* This cannot be a pinned page if it has more than one mapping */ > if (page_mappings(page) != 1) > return 1; > > to copy_present_page() along with the existing page_maybe_dma_pinned() test. > > No? > > Linus That approach makes me a lot happier, yes. Because it doesn't add constraints to the RDMA cases, but still does what we need here. thanks,
On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > Random memory corruption will still silently materialize as result of > the speculative lookups in the above scenario. Explain. Yes, you'll get random memory corruption if you keep doing wrprotect() without mmap_sem held for writing. But I thought we agreed earlier that that is a bug. And I thought the softdirty code already got it for writing. Linus
On Thu, Jan 7, 2021 at 2:42 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But I thought we agreed earlier that that is a bug. And I thought the > softdirty code already got it for writing. Ho humm. I had obviously not looked very much at that code. I had done a quick git grep, but now that I look closer, it *does* get the mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and then it does a mmap_write_downgrade(). So that's why I had looked more at the UFFD code, because that one was the one I was aware of doing this all with just the read lock. I _thought_ the softdirty code already took the write lock and wouldn't race with page faults. But I had missed that write_downgrade. So yeah, this code has the same issue. Anyway, the fix is - I think - the same I outlined earlier when I was talking about UFFD: take the thing for writing, so that you can't race. The alternate fix remains "make sure we always flush the TLB before releasing the page table lock, and make COW do the copy under the page table lock". But I really would prefer to just have this code follow all the usual rules, and if it does a write protect, then it should take the mmap_sem for writing. Why is that very simple rule so bad? (And see my unrelated but incidental note on it being a good idea to try to minimize latency by making surfe we don't do any IO under the mmap lock - whether held for reading _or_ writing. Because I do think we can improve in that area, if you have some good test-case). Linus
On Thu, Jan 07, 2021 at 02:42:17PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Random memory corruption will still silently materialize as result of > > the speculative lookups in the above scenario. > > Explain. > > Yes, you'll get random memory corruption if you keep doing wrprotect() > without mmap_sem held for writing. I didn't meant that. > But I thought we agreed earlier that that is a bug. And I thought the > softdirty code already got it for writing. softdirty used mmap_read_lock too but this again isn't relevant here and for the sake of discussion we can safely assume mmap_read_lock doesn't exist in the kernel, and everything takes the mmap_write_lock whenever a mmap_lock is taken at all. I mean something bad will happen if a write happens, but soft dirty cannot register it because we didn't wrprotect the pte? Some dirty page won't be transferred to destination and it will be assumed there was no softy dirty event for such page? Otherwise it would mean that wrprotecting is simply optional for all pages under clear_refs? Not doing the final TLB flush in softdirty caused some issue even when there was no COW and the deferred flush only would delay the wrprotect fault: https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=RDJNU6vGF3hLO+j-g@mail.gmail.com/ https://lore.kernel.org/linux-mm/20210105221628.GA12854@willie-the-truck/ Skipping the wrprotection of the pte because of a speculative pagecache lookup elevating a random page_count, from the userland point of view, I guessed would behave as missing the final TLB flush before clear_refs returns to userland, just worse. Thanks, Andrea
On Thu, Jan 07, 2021 at 02:51:24PM -0800, Linus Torvalds wrote: > Ho humm. I had obviously not looked very much at that code. I had done > a quick git grep, but now that I look closer, it *does* get the > mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and > then it does a mmap_write_downgrade(). > > So that's why I had looked more at the UFFD code, because that one was > the one I was aware of doing this all with just the read lock. I > _thought_ the softdirty code already took the write lock and wouldn't > race with page faults. > > But I had missed that write_downgrade. So yeah, this code has the same issue. I overlooked the same thing initially. It's only when I noticed it also used mmap_read_lock, that I figured that the group lock thingy uffd-wp ad-hoc solution, despite it was fully self contained thanks to the handle_userfault() catcher for the uffd-wp bit in the pagetable, wasn't worth it since uffd-wp could always use whatever clear_refs used to solve it. > Anyway, the fix is - I think - the same I outlined earlier when I was > talking about UFFD: take the thing for writing, so that you can't > race. Sure. > The alternate fix remains "make sure we always flush the TLB before > releasing the page table lock, and make COW do the copy under the page > table lock". But I really would prefer to just have this code follow The copy under PT lock isn't enough. Flush TLB before releasing is enough of course. Note also the patch in 2/2 patch that I sent: https://lkml.kernel.org/r/20210107200402.31095-3-aarcange@redhat.com 2/2 would have been my preferred solution for both and it works fine. All corruption that was trivially reproducible with heavy selftest program in the kernel, is all gone. If only the TLB pending issue was the only regression page_count in do_wp_page introduced, I would have never suggested we should re-evaluate it. It'd be a good tradeoff in such case, even if it'd penalize the soft-dirty runtime, especially if we were allowed to deploy 2/2 as a non-blocking solution. Until yesterday I fully intended to just propose 1/2 and 2/2, with a whole different cover letter, CC stable and close this case. > all the usual rules, and if it does a write protect, then it should > take the mmap_sem for writing. The problem isn't about performance anymore, the problem is a silent ABI break to long term PIN user attached to an mm under clear_refs. > Why is that very simple rule so bad? > > (And see my unrelated but incidental note on it being a good idea to > try to minimize latency by making surfe we don't do any IO under the > mmap lock - whether held for reading _or_ writing. Because I do think > we can improve in that area, if you have some good test-case). That would be great indeed. Thanks, Andrea
On Thu, Jan 7, 2021 at 3:48 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > The alternate fix remains "make sure we always flush the TLB before > > releasing the page table lock, and make COW do the copy under the page > > table lock". But I really would prefer to just have this code follow > The copy under PT lock isn't enough. > > Flush TLB before releasing is enough of course. Right. That's why I said "and". You need both, afaik. But if we just do the mmap write lock, you need neither - then you just need to flush before you release the write lock. > Note also the patch in 2/2 patch that I sent: Yes, yes, and that's what I'm objecting to. All these stupid games with "flush_pending(" counts are complete garbage. They all come from the fact that this code doesn't hold the right lock. I don't understand you: in one breath you seem to say "yes, taking the write lock is the right thing to do", and then in the next one you point to this patch that adds all this garbage *because* it's not holding the write lock. All of those "tlb_flush_pending" things are wrong. They should not exist. The code in clear_refs_write() should hold the mmap_sem for writing, and do the TLB flush before it releases the mmap sem, and then it *cannot* race with the page faults. See what I'm saying? I refuse to apply your patch 2/2, because it all seems entirely wrong. What's doubly ludicrous about that is that the coide already _took_ the mmap_sem for writing, and spent extra cycles to downgrade it - INCORRECTLY - to a read-lock. And as far as I can tell, it doesn't even do anything expensive inside that (now downgraded) region, so the downgrading was (a) buggy (b) slower than just keeping the lock the way it had and (b) is because it had already done the expensive part (which was to get the lock in the first place). Just as an example, the whole "Rollback wrprotect_tlb_flush_pending" is all because it got the lock - again wrongly - as a read-lock initially, then it says "oh, I need to get a write lock", releases it, re-takes it as a write lock, does a tiny amount of work, and then - again incorrectly - downgrades it to a read-lock. To make it even worse (if that is possible) it actually had YET ANOTHER case - that CLEAR_REFS_MM_HIWATER_RSS - where it took the mmap sem for writing, did its thing, and then released it. So there's like *four* different locking mistakes in that single function. And it's not even an important function to begin with. It shgould just have done a single mmap_write_lock_killable(mm); ... mmap_write_unlock(mm); around the whole thing, instead of _any_ of that crazy stuff. That code is WRONG. And your PATCH 2/2 makes that insane code EVEN WORSE. Why the heck is that magic fs/proc/ interface allowed to get VM internals so wrong, and make things so much worse? Can you not see why I'm arguing with you? Please. Why is the correct patch not the attached one (apart from the obvious fact that I haven't tested it and maybe just screwed up completely - but you get the idea)? Linus
On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote: > Please. Why is the correct patch not the attached one (apart from the > obvious fact that I haven't tested it and maybe just screwed up > completely - but you get the idea)? It certainly looks simple and correct to me, although it means we're now taking the mmap sem for write in the case where we only want to clear the access flag, which should be fine with the thing only held for read, no? Will
Hello everyone, On Fri, Jan 08, 2021 at 12:48:16PM +0000, Will Deacon wrote: > On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote: > > Please. Why is the correct patch not the attached one (apart from the > > obvious fact that I haven't tested it and maybe just screwed up > > completely - but you get the idea)? > > It certainly looks simple and correct to me, although it means we're now > taking the mmap sem for write in the case where we only want to clear the > access flag, which should be fine with the thing only held for read, no? I'm curious, would you also suggest that fixing just the TLB flushing symptom is enough and we can forget about the ABI break coming from page_count used in do_wp_page? One random example: clear_refs will still break all long term GUP pins, are you ok with that too? page_count in do_wp_page is a fix for the original security issue from vmsplice (where the child is fooling the parent in taking the exclusive page in do_wp_page), that appears worse than the bug itself. page_count in do_wp_page, instead of isolating as malicious when the parent is reusing the page queued in the vmsplice pipe, is treating as malicious also all legit cases that had to reliably reuse the page to avoid the secondary MMUs to go out of sync. page_count in do_wp_page is like a credit card provider blocking all credit cards of all customers, because one credit card may have been cloned (by vmsplice), but nobody can know which one was it. Of course this technique will work perfectly as security fix because it will treat all credit card users as malicious and it'll block them all ("block as in preventing re-use of the anon page"). The problem are those other credit card users that weren't malicious that get their COW broken too. Those are the very long term GUP pins if any anon page can be still wrprotected anywhere in the VM. At the same time the real hanging fruit (vmsplice) that, if taken care of, would have rendered the bug purely theoretical in security terms hasn't been fixed yet, despite those unprivileged long term GUP pins causes more reproducible security issues than just the COW, since they can still DoS the OOM killer and they bypass at least the mlock enforcement, even for non compound pages. Of course just fixing vmsplice to require some privilege won't fix the bug in full, so it's not suitable long term solution, but it has to happen orthogonality for other reason, and it'd at least remove the short term security concern. In addition you're not experiencing the full fallout of the side effects of page_count used to decide if to re-use all anon COW pages because the bug is still there (with enterprise default config options at least). Not all credit cards are blocked yet with only 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 applied. Only after you will block them all, you will experience all the side effects of replacing the per-subpage finegrined mapcount with the compound-wide page count. The two statements above combined, result in my recommendation at this point to resolve this in userland by rendering the security issue theoretical by removing vmsplice from the OCI schema allowlist or by enforcing it fixing in userland by always using execve after drop privs (as crun always does when it starts the container of course). For the long term, I can't see how using page_count in do_wp_page is a tenable proposition, unless we either drop all secondary MMUs from the kernel or VM features like clear_refs are dropped or unless the page_count is magically stabilized and the speculative pagecache lookups are also dropped. If trying to manage the fallout by enforcing no anon page can ever be wrprotected in place (i.e. dropping clear_refs feature or rendering it unreliable by skipping elevated counts caused by spurious pagecache lookups), it'd still sounds a too fragile design and too prone to break to rely on that. There's random arch stuff even wrprotecting memory, even very vm86 does it under the hood (vm86 is unlikely it has a long term GUP pin on it of course, but still who knows?). I mean the VM core cannot make assumptions like: "this vm86 case can still wrprotect without worry because probably vm86 isn't used anymore with any advanced secondary MMU, so if there's a GUP pin it probably is a malicious vmsplice and not a RDMA or GPU or Virtualization secondary MMU". Then there's the secondary concern of the inefficiency it introduces with extra unnecessary copies when a single GUP pin will prevent reuse of 512 or 262144 subpages, in the 512 case also potentially mapped in different processes. The TLB flushing discussions registers as the last concern in my view. Thanks, Andrea
On Fri, Jan 8, 2021 at 4:48 AM Will Deacon <will@kernel.org> wrote: > > It certainly looks simple and correct to me, although it means we're now > taking the mmap sem for write in the case where we only want to clear the > access flag, which should be fine with the thing only held for read, no? When I was looking at that code, I was thinking that the whole function should be split up to get rid of some of the indentation and the "goto out_mm". And yes, it would probably be good to split up up even more than that "initial mm lookup and error handling", and have an actual case statement for the different clear_ref 'type' cases. And then it would be fairly simple and clean to say "this case only needs the mmap_sem for read, that case needs it for write". So I don't disagree, but I think it should be a separate patch - if it even matters. Is this strange /proc case something that is even commonly done? Linus
On Fri, Jan 8, 2021 at 8:14 AM Andrea Arcangeli <aarcange@redhat.com> wrote: > > page_count in do_wp_page is a fix for the original security issue Not just that. page_count() is simply the right and efficient thing to do. You talk about all these theoretical inefficiencies for cases like zygote and page pinning, which have never ever been seen except as a possible attack vector. Stop talking about irrelevant things. Stop trying to "optimize" things that never happen and don't matter. Instead, what matters is the *NORMAL* VM flow. Things like COW. Things like "oh, now that we check just the page count, we don't even need the page lock for the common case any more". > For the long term, I can't see how using page_count in do_wp_page is a > tenable proposition, I think you should re-calibrate your expectations, and accept that page_count() is the right thing to do, and live with it. And instead of worrying about irrelevant special-case code, start worrying about the code that gets triggered tens of thousands of times a second, on regular loads, without anybody doing anything odd or special at all, just running plain and normal shell scripts or any other normal traditional load. Those irrelevant special cases should be simple and work, not badly optimized to the point where they are buggy. And they are MUCH LESS IMPORTANT than the normal VM code, so if somebody does something odd, and it's slow, then that is the problem for the _odd_ case, not for the normal codepaths. This is why I refuse to add crazy new special cases to core code. Make the rules simple and straightforward, and make the code VM work well. Linus
On Fri, Jan 08, 2021 at 09:39:56AM -0800, Linus Torvalds wrote: > page_count() is simply the right and efficient thing to do. > > You talk about all these theoretical inefficiencies for cases like > zygote and page pinning, which have never ever been seen except as a > possible attack vector. Do you intend to eventually fix the zygote vmsplice case or not? Because in current upstream it's not fixed currently using the enterprise default config. > Stop talking about irrelevant things. Stop trying to "optimize" things > that never happen and don't matter. > > Instead, what matters is the *NORMAL* VM flow. > > Things like COW. > > Things like "oh, now that we check just the page count, we don't even > need the page lock for the common case any more". > > > For the long term, I can't see how using page_count in do_wp_page is a > > tenable proposition, > > I think you should re-calibrate your expectations, and accept that > page_count() is the right thing to do, and live with it. > > And instead of worrying about irrelevant special-case code, start Irrelevant special case as in: long term GUP pin on the memory? Or irrelevant special case as in: causing secondary MMU to hit silent data loss if a pte is ever wrprotected (arch code, vm86, whatever, all under mmap_write_lock of course). > worrying about the code that gets triggered tens of thousands of times > a second, on regular loads, without anybody doing anything odd or > special at all, just running plain and normal shell scripts or any > other normal traditional load. > > Those irrelevant special cases should be simple and work, not badly > optimized to the point where they are buggy. And they are MUCH LESS > IMPORTANT than the normal VM code, so if somebody does something odd, > and it's slow, then that is the problem for the _odd_ case, not for > the normal codepaths. > > This is why I refuse to add crazy new special cases to core code. Make > the rules simple and straightforward, and make the code VM work well. New special cases? which new cases? There's nothing new here besides the zygote that wasn't fully fixed with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 and is actually the only new case I can imagine where page_count actually isn't a regression. All old cases that you seem to refer as irrelevant and are in production in v4.18, I don't see anything new here. Even for the pure COW case with zero GUP involvement an hugepage with cows happening in different processes, would forever hit wp_copy_page since count is always > 1 despite mapcount can be 1 for all subpages. A simple app doing fork/exec would forever copy all memory in the parent even after the exec is finished. Thanks, Andrea
On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli <aarcange@redhat.com> wrote: > > Do you intend to eventually fix the zygote vmsplice case or not? > Because in current upstream it's not fixed currently using the > enterprise default config. Is this the hugepage case? Neither of your patches actually touched that, so I've forgotten the details. > Irrelevant special case as in: long term GUP pin on the memory? Irrelevant special case in that (a) an extra COW shouldn't be a correctness issue unless somebody does something horribly wrong (and obviously the code that hasn't taken the mmap_lock for writing are then examples of that) and (b) it's not a performance issue either unless you can find a real load that does it. Hmm? Linus
On Fri, Jan 08, 2021 at 11:25:21AM -0800, Linus Torvalds wrote: > On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Do you intend to eventually fix the zygote vmsplice case or not? > > Because in current upstream it's not fixed currently using the > > enterprise default config. > > Is this the hugepage case? Neither of your patches actually touched > that, so I've forgotten the details. The two patches only fixed the TLB flushing deferral in clear_refs and uffd-wp. So I didn't actually try to fix the hugepage case by adding the page_count checks there too. I could try to do that at least it'd be consistent but I still would try to find an alternate solution later. > > Irrelevant special case as in: long term GUP pin on the memory? > > Irrelevant special case in that > > (a) an extra COW shouldn't be a correctness issue unless somebody > does something horribly wrong (and obviously the code that hasn't > taken the mmap_lock for writing are then examples of that) > > and > > (b) it's not a performance issue either unless you can find a real > load that does it. > > Hmm? For b) I don't have an hard time to imagine `ps` hanging for seconds, if clear_refs is touched on a 4T mm, but b) is not the main concern. Having to rely on a) is the main concern and it's not about tlb flushes but the long term GUP pins. Thanks, Andrea
On Thu 07-01-21 13:53:18, John Hubbard wrote: > On 1/7/21 1:29 PM, Linus Torvalds wrote: > > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > > > The problem is it's not even possible to detect reliably if there's > > > really a long term GUP pin because of speculative pagecache lookups. > > > > So none of the normal code _needs_ that any more these days, which is > > what I think is so nice. Any pinning will do the COW, and then we have > > the logic to make sure it stays writable, and that keeps everything > > nicely coherent and is all fairly simple. > > > > And yes, it does mean that if somebody then explicitly write-protects > > a page, it may end up being COW'ed after all, but if you first pinned > > it, and then started playing with the protections of that page, why > > should you be surprised? > > > > So to me, this sounds like a "don't do that then" situation. > > > > Anybody who does page pinning and wants coherency should NOT TOUCH THE > > MAPPING IT PINNED. > > > > (And if you do touch it, it's your own fault, and you get to keep both > > of the broken pieces) > > > > Now, I do agree that from a QoI standpoint, it would be really lovely > > if we actually enforced it. I'm not entirely sure we can, but maybe it > > would be reasonable to use that > > > > mm->has_pinned && page_maybe_dma_pinned(page) > > > > at least as the beginning of a heuristic. > > > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > > made stronger than it is. Because at *THAT* point, we might say "we > > What exactly did you have in mind, to make it stronger? I think the > answer is in this email but I don't quite see it yet... > > Also, now seems to be a good time to mention that I've been thinking about > a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer > to peer with GUP/NIC, and HMM support for atomic operations from a device). > And it seems like the following approach would help: > > * Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks > to Jason for this point) means "user space owns the lifetime". We might > even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would > make them mean the same thing. The idea is that there are no "short term" > pins of this kind of memory. > > * Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans, > because several of us had thought that Direct IO needs FOLL_PIN. However, this > recent conversation, plus my list of cases above, seems to indicate otherwise. > That's because we only have one refcount approach for marking pages in this way, > and we should spend it on the long-term pinned pages. Those are both hard to > identify otherwise, and actionable once we identify them. Somewhat late to the game but I disagree here. I think direct IO still needs FOLL_PIN so that page_may_be_dma_pinned() returns true for it. At least for shared pages. Because filesystems/mm in the writeback path need to detect whether the page is pinned and thus its contents can change anytime without noticing, the page can be dirtied at random times etc. In that case we need to bounce the page during writeback (to avoid checksum failures), keep page as dirty in internal filesystem bookkeeping (and in MM as well) etc... Honza
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index a127262ba517..e75cb135db02 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1235,8 +1235,20 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, } if (type == CLEAR_REFS_SOFT_DIRTY) { for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!(vma->vm_flags & VM_SOFTDIRTY)) + struct vm_area_struct *tmp; + if (!(vma->vm_flags & VM_SOFTDIRTY)) { + inc_wrprotect_tlb_flush_pending(vma); continue; + } + + /* + * Rollback wrprotect_tlb_flush_pending before + * releasing the mmap_lock. + */ + for (vma = mm->mmap; vma != tmp; + vma = vma->vm_next) + dec_wrprotect_tlb_flush_pending(vma); + mmap_read_unlock(mm); if (mmap_write_lock_killable(mm)) { count = -EINTR; @@ -1245,6 +1257,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, for (vma = mm->mmap; vma; vma = vma->vm_next) { vma->vm_flags &= ~VM_SOFTDIRTY; vma_set_page_prot(vma); + inc_wrprotect_tlb_flush_pending(vma); } mmap_write_downgrade(mm); break; @@ -1260,6 +1273,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, if (type == CLEAR_REFS_SOFT_DIRTY) { mmu_notifier_invalidate_range_end(&range); flush_tlb_mm(mm); + for (vma = mm->mmap; vma; vma = vma->vm_next) + dec_wrprotect_tlb_flush_pending(vma); dec_tlb_flush_pending(mm); } mmap_read_unlock(mm); diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..caa1d9a71cb2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3177,5 +3177,51 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping, extern int sysctl_nr_trim_pages; +/* + * NOTE: the mmap_lock must be hold and it cannot be released at any + * time in between inc_wrprotect_tlb_flush_pending() and + * dec_wrprotect_tlb_flush_pending(). + * + * This counter has to be elevated before taking the PT-lock to + * wrprotect pagetables, if the TLB isn't flushed before the + * PT-unlock. + * + * The only reader is the page fault so this has to be elevated (in + * addition of the mm->tlb_flush_pending) only when the mmap_read_lock + * is taken instead of the mmap_write_lock (otherwise the page fault + * couldn't run concurrently in the first place). + * + * This doesn't need to be elevated when clearing pagetables even if + * only holding the mmap_read_lock (as in MADV_DONTNEED). The page + * fault doesn't risk to access the data of the page that is still + * under tlb-gather deferred flushing, if the pagetable is none, + * because the pagetable doesn't point to it anymore. + * + * This counter is read more specifically by the page fault when it + * has to issue a COW that doesn't result in page re-use because of + * the lack of stability of the page_count (vs speculative pagecache + * lookups) or because of a GUP pin exist on an exclusive and writable + * anon page. + * + * If this counter is elevated and the pageteable is wrprotected (as + * in !pte/pmd_write) and present, it means the page may be still + * modified by userland through a stale TLB entry that was + * instantiated before the wrprotection. In such case the COW fault, + * if it decides not to re-use the page, will have to either wait this + * counter to return zero, or it needs to flush the TLB before + * proceeding copying the page. + */ +static inline void inc_wrprotect_tlb_flush_pending(struct vm_area_struct *vma) +{ + atomic_inc(&vma->wrprotect_tlb_flush_pending); + VM_WARN_ON_ONCE(atomic_read(&vma->wrprotect_tlb_flush_pending) <= 0); +} + +static inline void dec_wrprotect_tlb_flush_pending(struct vm_area_struct *vma) +{ + atomic_dec(&vma->wrprotect_tlb_flush_pending); + VM_WARN_ON_ONCE(atomic_read(&vma->wrprotect_tlb_flush_pending) < 0); +} + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 07d9acb5b19c..e3f412c43c30 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -369,6 +369,11 @@ struct vm_area_struct { struct mempolicy *vm_policy; /* NUMA policy for the VMA */ #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; + /* + * When elevated, it indicates that a deferred TLB flush may + * be pending after a pagetable write protection on this vma. + */ + atomic_t wrprotect_tlb_flush_pending; } __randomize_layout; struct core_thread { diff --git a/kernel/fork.c b/kernel/fork.c index 37720a6d04ea..7a658c608f3a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -365,6 +365,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) *new = data_race(*orig); INIT_LIST_HEAD(&new->anon_vma_chain); new->vm_next = new->vm_prev = NULL; + VM_WARN_ON_ONCE(atomic_read(&new->wrprotect_tlb_flush_pending)); } return new; } diff --git a/mm/memory.c b/mm/memory.c index feff48e1465a..e8e407443119 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2809,6 +2809,73 @@ static inline void wp_page_reuse(struct vm_fault *vmf) count_vm_event(PGREUSE); } +/* + * This synchronization helper, by the time it returns, has to enforce + * there cannot be stale writable TLB entries left, on any page mapped + * as wrprotected in the pagetables in this vma range. + * + * This is normally used only by the COW fault immediately before + * copying the page to make it proof against stale TLB entries (as the + * one pictured below in CPU 2). + * + * CPU 0 CPU 1 CPU 2 + * ----- ----- ----- + * writable TLB instantiated + * mmap_lock_read + * inc_wrprotect_tlb_flush_pending() + * PT_lock + * wrprotect the pte + * PT unlock + * mmap_lock_read + * PT_lock + * vmf->orig_pte = pte + * do_wp_page() + * PT_unlock + * wp_page_copy() + * sync_wrprotect_tlb_flush_pending() + * found wrprotect_tlb_flush_pending elevated + * flush_tlb_page() + * writable TLB invalidated [1] + * kret of sync_wrprotect_tlb_flush_pending() + * cow_user_page() [2] + * + * The whole objective of the wrprotect_tlb_flush_pending atomic + * counter is to enforce [1] happens before [2] in the above sequence. + * + * Without having to alter the caller of this helper, it'd also be + * possible to replace the flush_tlb_page with a wait for + * wrprotect_tlb_flush_pending counter to return zero using the same + * logic as above. In such case the point [1] would be replaced by + * dec_wrprotect_tlb_flush_pending() happening in CPU 1. + * + * In terms of memory ordering guarantees: all page payload reads of + * page mapped by a wrprotected pagetable, executed after this + * function returns, must not be allowed to be reordered before the + * read of the wrprotect_tlb_flush_pending atomic counter at the start + * of the function. So this function has to provide acquire semantics + * to those page payload reads. + */ +static __always_inline +void sync_wrprotect_tlb_flush_pending(struct vm_area_struct *vma, + unsigned long address) +{ + int val = atomic_read(&vma->wrprotect_tlb_flush_pending); + if (val) { + /* + * flush_tlb_page() needs to deliver acquire semantics + * implicitly. Archs using IPIs to flush remote TLBs + * provide those with csd_lock_wait(). + */ + flush_tlb_page(vma, address); + } else { + /* + * Prevent the read of the wrprotect page payload to be + * reordered before the above atomic_read(). + */ + smp_rmb(); + } +} + /* * Handle the case of a page which we actually need to copy to a new page. * @@ -2849,6 +2916,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; + sync_wrprotect_tlb_flush_pending(vma, vmf->address); + if (!cow_user_page(new_page, old_page, vmf)) { /* * COW failed, if the fault was solved by other, diff --git a/mm/mprotect.c b/mm/mprotect.c index ab709023e9aa..6b7a52662de8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -335,6 +335,8 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); inc_tlb_flush_pending(mm); + if (unlikely(cp_flags & MM_CP_UFFD_WP_ALL)) + inc_wrprotect_tlb_flush_pending(vma); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) @@ -346,6 +348,8 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, /* Only flush the TLB if we actually modified any entries: */ if (pages) flush_tlb_range(vma, start, end); + if (unlikely(cp_flags & MM_CP_UFFD_WP_ALL)) + dec_wrprotect_tlb_flush_pending(vma); dec_tlb_flush_pending(mm); return pages;
NOTE: the "Fixes" tag used here is to optimize the backporting, but 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is completely correct. Despite being correct it happened to uncover some implicit assumption some other code made on a very specific behavior in do_wp_page that had to be altered by such commit. The page_mapcount is always guaranteed == 1 for an exclusive anon page, so when it was used to decide if an exclusive page under wrprotection could be reused (as in wp_page_reuse), the outcome would always come true. The page_mapcount had to be replaced with the page_count because it couldn't account for GUP pins, so after that change, for the first time, wp_page_copy can now be called also for exclusive anon pages that are underway wrprotection. Even then everything is still fine for all cases that wrprotect with the mmap_write_lock since the COW faults cannot run concurrently in such case. However there are two cases that could wrprotecting exclusive anon pages with only the mmap_read_lock: soft dirty clear_refs_write and UFFDIO_WRITEPROTECT. Both of them would benefit from keeping their wrprotection runtime scalable and to keep using the mmap_read_lock without having to switch to the mmap_write_lock. To stick to the mmap_read_lock, for both UFFDIO_WRITEPROTECT and clear_refs_write we need to handle the new reality that there can be COWs (as in wp_page_copy) happening on exclusive anon pages that are under wrprotection, but with the respective TLB flush still deferred. An example of the problematic UFFDIO_USERFAULTFD runtime is shown below. CPU0 CPU 1 CPU 2 ------ -------- ------- userfaultfd_wrprotect(mode_wp = true) PT lock atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE PT unlock do_page_fault FAULT_FLAG_WRITE userfaultfd_wrprotect(mode_wp = false) PT lock ATOMIC clear _PAGE_UFFD_WP <- problem /* _PAGE_WRITE not set */ PT unlock XXXXXXXXXXXXXX BUG RACE window open here PT lock FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy cow_user_page runs with stale TLB deferred tlb flush <- too late XXXXXXXXXXXXXX BUG RACE window close here userfaultfd_wrprotect(mode_wp = true) is never a problem because as long as the uffd-wp flag is set in the pte/hugepmd the page fault is guaranteed to reach a dead end in handle_userfault(). The window for uffd-wp not to be set while the pte has been wrprotected but the TLB flush is still pending, is opened when userfaultfd_wrprotect(mode_wp = false) releases the PT-lock as shown above and it closes when the first deferred TLB flush runs. If do_wp_page->wp_copy_page runs within such window, some userland writes can get lost in the copy and they can end up in the original page that gets discarded. The softy dirty runtime is similar and it would be like below: CPU0 CPU 1 CPU 2 ------ -------- ------- instantiate writable TLB clear_refs_write PT lock pte_wrprotect PT unlock do_page_fault FAULT_FLAG_WRITE PT lock FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy cow_user_page... writes through the TLB ...cow_user_page So to handle this race a wrprotect_tlb_flush_pending atomic counter is added to the vma. This counter needs to be elevated while holding the mmap_read_lock before taking the PT lock to wrprotect the pagetable and it can only be decreased after the deferred TLB flush is complete. This way the page fault can trivially serialize against pending TLB flushes using a new helper: sync_wrprotect_tlb_flush_pending(). Testing with the userfaultfd selftest is showing 100% reproducible mm corruption with writes getting lost, before this commit. $ ./userfaultfd anon 100 100 nr_pages: 25600, nr_pages_per_cpu: 3200 bounces: 99, mode: rnd racing read, userfaults: 773 missing (215+205+58+114+72+85+18+6), 9011 wp (1714+1714+886+1227+1009+1278+646+537) [..] bounces: 72, mode: poll, userfaults: 720 missing (187+148+102+49+92+103+24+15), 9885 wp (1452+1175+1104+1667+1101+1365+913+1108) bounces: 71, mode: rnd racing ver read, page_nr 25241 memory corruption 6 7 After the commit the userland memory corruption is gone as expected. Cc: stable@kernel.org Reported-by: Nadav Amit <namit@vmware.com> Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- fs/proc/task_mmu.c | 17 +++++++++- include/linux/mm.h | 46 +++++++++++++++++++++++++++ include/linux/mm_types.h | 5 +++ kernel/fork.c | 1 + mm/memory.c | 69 ++++++++++++++++++++++++++++++++++++++++ mm/mprotect.c | 4 +++ 6 files changed, 141 insertions(+), 1 deletion(-)