Message ID | 6e8deda970b982e1e8ffd876e3cef342c292fbb5.1729715266.git.lorenzo.stoakes@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fix error handling in mmap_region() and refactor | expand |
On Wed, Oct 23, 2024 at 09:38:29PM +0100, Lorenzo Stoakes wrote: > The mmap_region() function is somewhat terrifying, with spaghetti-like > control flow and numerous means by which issues can arise and incomplete > state, memory leaks and other unpleasantness can occur. Today's pending-fixes is showing a fairly large set of failures in the arm64 MTE selftests on all the platforms that have MTE (currently just the software ones). Bisection points at this change which is 0967bf7fbd0e0 in -next which seems plausible but I didn't investigate in any meaingful detail. There's nothing particularly instructive in the test logs, just plain reports that the tests failed: # # FAIL: mmap allocation # # FAIL: memory allocation # not ok 17 Check initial tags with private mapping, sync error mode and mmap memory # ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory # # FAIL: mmap allocation # # FAIL: memory allocation # not ok 19 Check initial tags with shared mapping, sync error mode and mmap memory # ok 20 Check initial tags with shared mapping, sync error mode and mmap/mprotect memory # # Totals: pass:18 fail:2 xfail:0 xpass:0 skip:0 error:0 not ok 42 selftests: arm64: check_buffer_fill # exit=1 (and more, mainly on mmap related things). A full log for a sample run on the FVP can be seen at: https://lava.sirena.org.uk/scheduler/job/901638#L3693 and one from qemu here: https://lava.sirena.org.uk/scheduler/job/901630#L3031 Both of these logs include links to filesystem/firmware images and command lines to run the model. Bisects converge cleanly (there's some random extra good commits logged at the start as my tooling feeds test results it already has on hand between the good and bad commits into the bisect): # bad: [6560005f01c3c14aab4c2ce35d97b75796d33d81] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git # good: [ea1fda89f5b23734e10c62762990120d5ae23c43] Merge tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip # good: [6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9] ASoC: Intel: sst: Support LPE0F28 ACPI HID # good: [2feb023110843acce790e9089e72e9a9503d9fa5] regulator: rtq2208: Fix uninitialized use of regulator_config # good: [0107f28f135231da22a9ad5756bb16bd5cada4d5] ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet # good: [25f00a13dccf8e45441265768de46c8bf58e08f6] spi: spi-fsl-dspi: Fix crash when not using GPIO chip select # good: [032532f91a1d06d0750f16c49a9698ef5374a68f] ASoC: codecs: rt5640: Always disable IRQs from rt5640_cancel_work() # good: [d48696b915527b5bcdd207a299aec03fb037eb17] ASoC: Intel: bytcr_rt5640: Add support for non ACPI instantiated codec # good: [d0ccf760a405d243a49485be0a43bd5b66ed17e2] spi: geni-qcom: Fix boot warning related to pm_runtime and devres # good: [f2b5b8201b1545ef92e050735e9c768010d497aa] spi: mtk-snfi: fix kerneldoc for mtk_snand_is_page_ops() # good: [b5a468199b995bd8ee3c26f169a416a181210c9e] spi: stm32: fix missing device mode capability in stm32mp25 git bisect start '6560005f01c3c14aab4c2ce35d97b75796d33d81' 'ea1fda89f5b23734e10c62762990120d5ae23c43' '6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9' '2feb023110843acce790e9089e72e9a9503d9fa5' '0107f28f135231da22a9ad5756bb16bd5cada4d5' '25f00a13dccf8e45441265768de46c8bf58e08f6' '032532f91a1d06d0750f16c49a9698ef5374a68f' 'd48696b915527b5bcdd207a299aec03fb037eb17' 'd0ccf760a405d243a49485be0a43bd5b66ed17e2' 'f2b5b8201b1545ef92e050735e9c768010d497aa' 'b5a468199b995bd8ee3c26f169a416a181210c9e' # bad: [6560005f01c3c14aab4c2ce35d97b75796d33d81] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git git bisect bad 6560005f01c3c14aab4c2ce35d97b75796d33d81 # bad: [4a2901b5d394f58cdc60bc25e32c381bb2b83891] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless.git git bisect bad 4a2901b5d394f58cdc60bc25e32c381bb2b83891 # bad: [4093d34d740447b23a1ea916dabcf902aa767812] Merge branch 'fs-current' of linux-next git bisect bad 4093d34d740447b23a1ea916dabcf902aa767812 # bad: [0967bf7fbd0e03cee0525035762150a91ba1bb7c] mm: resolve faulty mmap_region() error path behaviour git bisect bad 0967bf7fbd0e03cee0525035762150a91ba1bb7c # good: [633e7df6cfdf97f8acf2a59fbfead01e31d0e492] tools: testing: add expand-only mode VMA test git bisect good 633e7df6cfdf97f8acf2a59fbfead01e31d0e492 # good: [315add1ace71306a7d8518fd417466d938041ff1] mseal: update mseal.rst git bisect good 315add1ace71306a7d8518fd417466d938041ff1 # good: [bcbb8b25ab80347994e33c358481e65f95f665fd] mm: fix PSWPIN counter for large folios swap-in git bisect good bcbb8b25ab80347994e33c358481e65f95f665fd # good: [8438cf67b86bf8c966f32612a7e12b2eb910396b] mm: unconditionally close VMAs on error git bisect good 8438cf67b86bf8c966f32612a7e12b2eb910396b # good: [a220e219d89c2d574ad9ffda627575e11334fede] mm: refactor map_deny_write_exec() git bisect good a220e219d89c2d574ad9ffda627575e11334fede # first bad commit: [0967bf7fbd0e03cee0525035762150a91ba1bb7c] mm: resolve faulty mmap_region() error path behaviour
On Mon, Oct 28, 2024 at 06:29:36PM +0000, Mark Brown wrote: > On Wed, Oct 23, 2024 at 09:38:29PM +0100, Lorenzo Stoakes wrote: > > The mmap_region() function is somewhat terrifying, with spaghetti-like > > control flow and numerous means by which issues can arise and incomplete > > state, memory leaks and other unpleasantness can occur. > > Today's pending-fixes is showing a fairly large set of failures in the > arm64 MTE selftests on all the platforms that have MTE (currently just > the software ones). Bisection points at this change which is > 0967bf7fbd0e0 in -next which seems plausible but I didn't investigate in > any meaingful detail. There's nothing particularly instructive in the > test logs, just plain reports that the tests failed: Ugh yep ok. Thanks for the report, this is likely then because MTE relies in some way on merge behaviour or the ->mmap() hook in an unfortunate way that we haven't accounted for here. Bad time for my arm64 qemu to be broken :) Would it be possible for you to assist me with investigating this a little as you have things pretty well set up? On these memory allocation failures, could you tell me what errno is? Could you check dmesg for anything strange? > > # # FAIL: mmap allocation Interesting that it MAP_FAIL's though. This could be arch_validate_flags() being moved around. Could you do me a further favour then and try a kernel at this commit with: /* Allow architectures to sanity-check the vm_flags. */ if (!arch_validate_flags(vm_flags)) return -EINVAL; In mmap_region() commented out? That and the errno would be hugely useful information thank you! Wondering if somehow the driver hook changes flags that makes the arch validate flags pass but not with the original flags. OK looking at thet source 99% certain it's the move of this check, as arm64 in its hook for this does: /* only allow VM_MTE if VM_MTE_ALLOWED has been set previously */ return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED); So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and expects this to be checked after (ugh). Thanks! > # # FAIL: memory allocation > # not ok 17 Check initial tags with private mapping, sync error mode and mmap memory > # ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory > # # FAIL: mmap allocation > # # FAIL: memory allocation > # not ok 19 Check initial tags with shared mapping, sync error mode and mmap memory > # ok 20 Check initial tags with shared mapping, sync error mode and mmap/mprotect memory > # # Totals: pass:18 fail:2 xfail:0 xpass:0 skip:0 error:0 > not ok 42 selftests: arm64: check_buffer_fill # exit=1 > > (and more, mainly on mmap related things). A full log for a sample run > on the FVP can be seen at: > > https://lava.sirena.org.uk/scheduler/job/901638#L3693 > > and one from qemu here: > > https://lava.sirena.org.uk/scheduler/job/901630#L3031 > > Both of these logs include links to filesystem/firmware images and > command lines to run the model. > > Bisects converge cleanly (there's some random extra good commits logged > at the start as my tooling feeds test results it already has on hand > between the good and bad commits into the bisect): > > # bad: [6560005f01c3c14aab4c2ce35d97b75796d33d81] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git > # good: [ea1fda89f5b23734e10c62762990120d5ae23c43] Merge tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > # good: [6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9] ASoC: Intel: sst: Support LPE0F28 ACPI HID > # good: [2feb023110843acce790e9089e72e9a9503d9fa5] regulator: rtq2208: Fix uninitialized use of regulator_config > # good: [0107f28f135231da22a9ad5756bb16bd5cada4d5] ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet > # good: [25f00a13dccf8e45441265768de46c8bf58e08f6] spi: spi-fsl-dspi: Fix crash when not using GPIO chip select > # good: [032532f91a1d06d0750f16c49a9698ef5374a68f] ASoC: codecs: rt5640: Always disable IRQs from rt5640_cancel_work() > # good: [d48696b915527b5bcdd207a299aec03fb037eb17] ASoC: Intel: bytcr_rt5640: Add support for non ACPI instantiated codec > # good: [d0ccf760a405d243a49485be0a43bd5b66ed17e2] spi: geni-qcom: Fix boot warning related to pm_runtime and devres > # good: [f2b5b8201b1545ef92e050735e9c768010d497aa] spi: mtk-snfi: fix kerneldoc for mtk_snand_is_page_ops() > # good: [b5a468199b995bd8ee3c26f169a416a181210c9e] spi: stm32: fix missing device mode capability in stm32mp25 > git bisect start '6560005f01c3c14aab4c2ce35d97b75796d33d81' 'ea1fda89f5b23734e10c62762990120d5ae23c43' '6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9' '2feb023110843acce790e9089e72e9a9503d9fa5' '0107f28f135231da22a9ad5756bb16bd5cada4d5' '25f00a13dccf8e45441265768de46c8bf58e08f6' '032532f91a1d06d0750f16c49a9698ef5374a68f' 'd48696b915527b5bcdd207a299aec03fb037eb17' 'd0ccf760a405d243a49485be0a43bd5b66ed17e2' 'f2b5b8201b1545ef92e050735e9c768010d497aa' 'b5a468199b995bd8ee3c26f169a416a181210c9e' > # bad: [6560005f01c3c14aab4c2ce35d97b75796d33d81] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git > git bisect bad 6560005f01c3c14aab4c2ce35d97b75796d33d81 > # bad: [4a2901b5d394f58cdc60bc25e32c381bb2b83891] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless.git > git bisect bad 4a2901b5d394f58cdc60bc25e32c381bb2b83891 > # bad: [4093d34d740447b23a1ea916dabcf902aa767812] Merge branch 'fs-current' of linux-next > git bisect bad 4093d34d740447b23a1ea916dabcf902aa767812 > # bad: [0967bf7fbd0e03cee0525035762150a91ba1bb7c] mm: resolve faulty mmap_region() error path behaviour > git bisect bad 0967bf7fbd0e03cee0525035762150a91ba1bb7c > # good: [633e7df6cfdf97f8acf2a59fbfead01e31d0e492] tools: testing: add expand-only mode VMA test > git bisect good 633e7df6cfdf97f8acf2a59fbfead01e31d0e492 > # good: [315add1ace71306a7d8518fd417466d938041ff1] mseal: update mseal.rst > git bisect good 315add1ace71306a7d8518fd417466d938041ff1 > # good: [bcbb8b25ab80347994e33c358481e65f95f665fd] mm: fix PSWPIN counter for large folios swap-in > git bisect good bcbb8b25ab80347994e33c358481e65f95f665fd > # good: [8438cf67b86bf8c966f32612a7e12b2eb910396b] mm: unconditionally close VMAs on error > git bisect good 8438cf67b86bf8c966f32612a7e12b2eb910396b > # good: [a220e219d89c2d574ad9ffda627575e11334fede] mm: refactor map_deny_write_exec() > git bisect good a220e219d89c2d574ad9ffda627575e11334fede > # first bad commit: [0967bf7fbd0e03cee0525035762150a91ba1bb7c] mm: resolve faulty mmap_region() error path behaviour
On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and > expects this to be checked after (ugh). Gaah. Yes. mm/shmem.c: shmem_mmap() does /* arm64 - allow memory tagging on RAM-based files */ vm_flags_set(vma, VM_MTE_ALLOWED); and while I found the equivalent hack for the VM_SPARC_ADI case, I hadn't noticed that MTE thing. How very annoying. So the arch_validate_flags() case does need to be done after the ->mmap() call. How about just finalizing everything, and then doing a regular munmap() afterwards and returning an error (all still holding the mmap semaphore, of course). That still avoids the whole "partially completed mmap" case. Linus
On Mon, Oct 28, 2024 at 09:05:44AM -1000, Linus Torvalds wrote: > On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > > > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and > > expects this to be checked after (ugh). > > Gaah. Yes. mm/shmem.c: shmem_mmap() does > > /* arm64 - allow memory tagging on RAM-based files */ > vm_flags_set(vma, VM_MTE_ALLOWED); > > and while I found the equivalent hack for the VM_SPARC_ADI case, I > hadn't noticed that MTE thing. > > How very annoying. > > So the arch_validate_flags() case does need to be done after the ->mmap() call. > > How about just finalizing everything, and then doing a regular > munmap() afterwards and returning an error (all still holding the mmap > semaphore, of course). > > That still avoids the whole "partially completed mmap" case. > > Linus Yeah I was thinking the same... just bite the bullet, go through the whole damn process and revert if arch_validate_flags() chokes. It also removes the ugly #ifdef CONFIG_SPARC64 hack... This will litearlly only be applicable for these two cases and (hopefully) most of the time you'd not fail it. I mean by then it'll be added into the rmap and such but nothing will be populated yet and we shouldn't be able to fault as vma_start_write() should have incremented the vma lock seqnum. Any issues from the RCU visibility stuff Liam? Any security problems Jann...? It'd suck to have to bring back a partial complete case. Though I do note we handle errors from mmap_file() ok so we could still potentially handle that there, but would sort of semi-undo some of the point of the series.
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241028 15:14]: > On Mon, Oct 28, 2024 at 09:05:44AM -1000, Linus Torvalds wrote: > > On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and > > > expects this to be checked after (ugh). > > > > Gaah. Yes. mm/shmem.c: shmem_mmap() does > > > > /* arm64 - allow memory tagging on RAM-based files */ > > vm_flags_set(vma, VM_MTE_ALLOWED); > > > > and while I found the equivalent hack for the VM_SPARC_ADI case, I > > hadn't noticed that MTE thing. > > > > How very annoying. > > > > So the arch_validate_flags() case does need to be done after the ->mmap() call. > > > > How about just finalizing everything, and then doing a regular > > munmap() afterwards and returning an error (all still holding the mmap > > semaphore, of course). > > > > That still avoids the whole "partially completed mmap" case. > > > > Linus > > Yeah I was thinking the same... just bite the bullet, go through the whole damn > process and revert if arch_validate_flags() chokes. It also removes the ugly > #ifdef CONFIG_SPARC64 hack... > > This will litearlly only be applicable for these two cases and (hopefully) most > of the time you'd not fail it. > > I mean by then it'll be added into the rmap and such but nothing will be > populated yet and we shouldn't be able to fault as vma_start_write() should have > incremented the vma lock seqnum. > > Any issues from the RCU visibility stuff Liam? It is probably fine? We would see a mapping appear then disappear. We'd have a (benign) race with rmap for truncating the PTEs (but it's safe). Page faults would be stopped though. Unfortunately, we'd have to write to the vma tree so that we could... write to the vma tree. We'd have to somehow ensure munmap() is done with a gfp flag to ensure no failures as well... Maybe we should just call close on the vma again (and do whatever call_mmap() needs to undo)? > > Any security problems Jann...? > > It'd suck to have to bring back a partial complete case. Though I do note we > handle errors from mmap_file() ok so we could still potentially handle that > there, but would sort of semi-undo some of the point of the series.
* Liam R. Howlett <Liam.Howlett@oracle.com> [241028 15:50]: > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241028 15:14]: > > On Mon, Oct 28, 2024 at 09:05:44AM -1000, Linus Torvalds wrote: > > > On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes > > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > > > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and > > > > expects this to be checked after (ugh). > > > > > > Gaah. Yes. mm/shmem.c: shmem_mmap() does > > > > > > /* arm64 - allow memory tagging on RAM-based files */ > > > vm_flags_set(vma, VM_MTE_ALLOWED); > > > > > > and while I found the equivalent hack for the VM_SPARC_ADI case, I > > > hadn't noticed that MTE thing. > > > > > > How very annoying. > > > > > > So the arch_validate_flags() case does need to be done after the ->mmap() call. > > > > > > How about just finalizing everything, and then doing a regular > > > munmap() afterwards and returning an error (all still holding the mmap > > > semaphore, of course). > > > > > > That still avoids the whole "partially completed mmap" case. > > > > > > Linus > > > > Yeah I was thinking the same... just bite the bullet, go through the whole damn > > process and revert if arch_validate_flags() chokes. It also removes the ugly > > #ifdef CONFIG_SPARC64 hack... > > > > This will litearlly only be applicable for these two cases and (hopefully) most > > of the time you'd not fail it. > > > > I mean by then it'll be added into the rmap and such but nothing will be > > populated yet and we shouldn't be able to fault as vma_start_write() should have > > incremented the vma lock seqnum. > > > > Any issues from the RCU visibility stuff Liam? > > It is probably fine? We would see a mapping appear then disappear. > We'd have a (benign) race with rmap for truncating the PTEs (but it's > safe). Page faults would be stopped though. > > Unfortunately, we'd have to write to the vma tree so that we could... > write to the vma tree. We'd have to somehow ensure munmap() is done > with a gfp flag to ensure no failures as well... > > Maybe we should just call close on the vma again (and do whatever > call_mmap() needs to undo)? I take it back, that won't work. > > > > > Any security problems Jann...? > > > > It'd suck to have to bring back a partial complete case. Though I do note we > > handle errors from mmap_file() ok so we could still potentially handle that > > there, but would sort of semi-undo some of the point of the series.
On Mon, Oct 28, 2024 at 04:00:29PM -0400, Liam R. Howlett wrote: > * Liam R. Howlett <Liam.Howlett@oracle.com> [241028 15:50]: > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241028 15:14]: > > > On Mon, Oct 28, 2024 at 09:05:44AM -1000, Linus Torvalds wrote: > > > > On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes > > > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > > > > > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and > > > > > expects this to be checked after (ugh). > > > > > > > > Gaah. Yes. mm/shmem.c: shmem_mmap() does > > > > > > > > /* arm64 - allow memory tagging on RAM-based files */ > > > > vm_flags_set(vma, VM_MTE_ALLOWED); > > > > > > > > and while I found the equivalent hack for the VM_SPARC_ADI case, I > > > > hadn't noticed that MTE thing. > > > > > > > > How very annoying. > > > > > > > > So the arch_validate_flags() case does need to be done after the ->mmap() call. > > > > > > > > How about just finalizing everything, and then doing a regular > > > > munmap() afterwards and returning an error (all still holding the mmap > > > > semaphore, of course). > > > > > > > > That still avoids the whole "partially completed mmap" case. > > > > > > > > Linus > > > > > > Yeah I was thinking the same... just bite the bullet, go through the whole damn > > > process and revert if arch_validate_flags() chokes. It also removes the ugly > > > #ifdef CONFIG_SPARC64 hack... > > > > > > This will litearlly only be applicable for these two cases and (hopefully) most > > > of the time you'd not fail it. > > > > > > I mean by then it'll be added into the rmap and such but nothing will be > > > populated yet and we shouldn't be able to fault as vma_start_write() should have > > > incremented the vma lock seqnum. > > > > > > Any issues from the RCU visibility stuff Liam? > > > > It is probably fine? We would see a mapping appear then disappear. > > We'd have a (benign) race with rmap for truncating the PTEs (but it's > > safe). Page faults would be stopped though. > > > > Unfortunately, we'd have to write to the vma tree so that we could... > > write to the vma tree. We'd have to somehow ensure munmap() is done > > with a gfp flag to ensure no failures as well... > > > > Maybe we should just call close on the vma again (and do whatever > > call_mmap() needs to undo)? > > I take it back, that won't work. > > > > > > > > > Any security problems Jann...? > > > > > > It'd suck to have to bring back a partial complete case. Though I do note we > > > handle errors from mmap_file() ok so we could still potentially handle that > > > there, but would sort of semi-undo some of the point of the series. I'm genuinely not opposed to a horrible, awful: #ifdef CONFIG_ARM64 if (file && file->f_ops == shmem_file_operations) vm_flags |= VM_MTE_ALLOWED; #endif Early in the operation prior to the arch_validate_flags() check. Just to get this over the finish line (original idea credit to Vlastimil, insane ugliness credit to me). Could be in a __arch_workarounds() type function heavily commented... I mean this is pretty gross but...
On Mon, 28 Oct 2024 at 10:18, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > I'm genuinely not opposed to a horrible, awful: > > #ifdef CONFIG_ARM64 > if (file && file->f_ops == shmem_file_operations) > vm_flags |= VM_MTE_ALLOWED; > #endif > > Early in the operation prior to the arch_validate_flags() check. I would just put it inside the arm64 code itself. IOW, get rid of the VM_MTE_ALLOWED flag entirely, and just make the arm64 arch_validate_flags() code do something like if (flags & VM_MTE) { if (file->f_ops != shmem_file_operations) return false; } and be done with it. Considering that we only have that horrendous arch_validate_flags() for two architectures, and that they both just have magical special cases for MTE-like behavior, I do think that just making it be a hack inside those functions is the way to go. Linus
On Mon, Oct 28, 2024 at 10:22:32AM -1000, Linus Torvalds wrote: > On Mon, 28 Oct 2024 at 10:18, Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > > > I'm genuinely not opposed to a horrible, awful: > > > > #ifdef CONFIG_ARM64 > > if (file && file->f_ops == shmem_file_operations) > > vm_flags |= VM_MTE_ALLOWED; > > #endif > > > > Early in the operation prior to the arch_validate_flags() check. > > I would just put it inside the arm64 code itself. > > IOW, get rid of the VM_MTE_ALLOWED flag entirely, and just make the > arm64 arch_validate_flags() code do something like > > if (flags & VM_MTE) { > if (file->f_ops != shmem_file_operations) > return false; > } > > and be done with it. > > Considering that we only have that horrendous arch_validate_flags() > for two architectures, and that they both just have magical special > cases for MTE-like behavior, I do think that just making it be a hack > inside those functions is the way to go. > > Linus Ah yeah makes sense. FWIW I just made a fix -for now- which implements it in the hideous way, shown below. We can maybe take that as a fix-patch for now and I can look at replacing this tomorrow with something as you suggest properly. My only concern is that arm people might not be happy and we get some hold up here... Thanks, Lorenzo ----8<---- From fb6c15c74ba0db57f18b08fc6d1e901676f25bf6 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Date: Mon, 28 Oct 2024 20:36:49 +0000 Subject: [PATCH] mm: account for MTE in arm64 on mmap_region() operation Correctly account for MTE on mmap_region(). We need to check this ahead of the operation, the shmem mmap hook was doing it, but this is at a point where a failure would mean we'd have to tear down a partially installed VMA. Avoid all this by adding a function to specifically handle this case. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- mm/mmap.c | 20 ++++++++++++++++++++ mm/shmem.c | 3 --- 2 files changed, 20 insertions(+), 3 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 8462de1ee583..83afa1ebfd75 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1575,6 +1575,24 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, return error; } +/* + * We check VMA flag validity early in the mmap() process, however this can + * cause issues for arm64 when using MTE, which requires that it be used with + * shmem and in this instance and only then is VM_MTE_ALLOWED set permitting + * this operation. + * + * To avoid having to tear down a partially complete mapping we do this ahead of + * time. + */ +static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags) +{ + if (!IS_ENABLED(CONFIG_ARM64)) + return vm_flags; + + if (shmem_file(file)) + return vm_flags | VM_MTE_ALLOWED; +} + unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) @@ -1586,6 +1604,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (map_deny_write_exec(vm_flags, vm_flags)) return -EACCES; + vm_flags = arch_adjust_flags(file, vm_flags); + /* Allow architectures to sanity-check the vm_flags. */ if (!arch_validate_flags(vm_flags)) return -EINVAL; diff --git a/mm/shmem.c b/mm/shmem.c index 4ba1d00fabda..e87f5d6799a7 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; - /* arm64 - allow memory tagging on RAM-based files */ - vm_flags_set(vma, VM_MTE_ALLOWED); - file_accessed(file); /* This is anonymous shared memory if it is unlinked at the time of mmap */ if (inode->i_nlink) --
On Mon, Oct 28, 2024 at 06:57:25PM +0000, Lorenzo Stoakes wrote: > On Mon, Oct 28, 2024 at 06:29:36PM +0000, Mark Brown wrote: > > any meaingful detail. There's nothing particularly instructive in the > > test logs, just plain reports that the tests failed: > On these memory allocation failures, could you tell me what errno is? Could you > check dmesg for anything strange? Looks like this is mostly figured out already but JFTR: As I said in the report there's nothing in the logs that I noticed, anything there is that I've missed should be in the linked logs. The errnos I'm seeing are all: # mmap(): Invalid argument (22) > > # # FAIL: mmap allocation > Interesting that it MAP_FAIL's though. This could be arch_validate_flags() being > moved around. > Could you do me a further favour then and try a kernel at this commit with: > /* Allow architectures to sanity-check the vm_flags. */ > if (!arch_validate_flags(vm_flags)) > return -EINVAL; > In mmap_region() commented out? Unsurprisingly given the above and the rest of the thread commenting out that check causes the affected tests to pass, I didn't check for any additional impacts.
On 10/28/24 21:22, Linus Torvalds wrote: > On Mon, 28 Oct 2024 at 10:18, Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: >> >> I'm genuinely not opposed to a horrible, awful: >> >> #ifdef CONFIG_ARM64 >> if (file && file->f_ops == shmem_file_operations) >> vm_flags |= VM_MTE_ALLOWED; >> #endif >> >> Early in the operation prior to the arch_validate_flags() check. > > I would just put it inside the arm64 code itself. > > IOW, get rid of the VM_MTE_ALLOWED flag entirely, and just make the > arm64 arch_validate_flags() code do something like > > if (flags & VM_MTE) { > if (file->f_ops != shmem_file_operations) > return false; > } > > and be done with it. VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits(): if (system_supports_mte() && (flags & MAP_ANONYMOUS)) return VM_MTE_ALLOWED; And there's also this in arch/arm64/include/asm/page.h #define VM_DATA_DEFAULT_FLAGS (VM_DATA_FLAGS_TSK_EXEC | VM_MTE_ALLOWED) So it would need to all be considered in the validation if we wanted to replace VM_MTE_ALLOWED completely? > Considering that we only have that horrendous arch_validate_flags() > for two architectures, and that they both just have magical special > cases for MTE-like behavior, I do think that just making it be a hack > inside those functions is the way to go. > > Linus
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241028 16:43]: > On Mon, Oct 28, 2024 at 10:22:32AM -1000, Linus Torvalds wrote: > > On Mon, 28 Oct 2024 at 10:18, Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > I'm genuinely not opposed to a horrible, awful: > > > > > > #ifdef CONFIG_ARM64 > > > if (file && file->f_ops == shmem_file_operations) > > > vm_flags |= VM_MTE_ALLOWED; > > > #endif > > > > > > Early in the operation prior to the arch_validate_flags() check. > > > > I would just put it inside the arm64 code itself. > > > > IOW, get rid of the VM_MTE_ALLOWED flag entirely, and just make the > > arm64 arch_validate_flags() code do something like > > > > if (flags & VM_MTE) { > > if (file->f_ops != shmem_file_operations) > > return false; > > } > > > > and be done with it. > > > > Considering that we only have that horrendous arch_validate_flags() > > for two architectures, and that they both just have magical special > > cases for MTE-like behavior, I do think that just making it be a hack > > inside those functions is the way to go. > > > > Linus > > Ah yeah makes sense. > > FWIW I just made a fix -for now- which implements it in the hideous way, > shown below. > > We can maybe take that as a fix-patch for now and I can look at replacing > this tomorrow with something as you suggest properly. > > My only concern is that arm people might not be happy and we get some hold > up here... > > Thanks, Lorenzo > > > ----8<---- > From fb6c15c74ba0db57f18b08fc6d1e901676f25bf6 Mon Sep 17 00:00:00 2001 > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Date: Mon, 28 Oct 2024 20:36:49 +0000 > Subject: [PATCH] mm: account for MTE in arm64 on mmap_region() operation > > Correctly account for MTE on mmap_region(). We need to check this ahead of > the operation, the shmem mmap hook was doing it, but this is at a point > where a failure would mean we'd have to tear down a partially installed > VMA. > > Avoid all this by adding a function to specifically handle this case. > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > mm/mmap.c | 20 ++++++++++++++++++++ > mm/shmem.c | 3 --- > 2 files changed, 20 insertions(+), 3 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 8462de1ee583..83afa1ebfd75 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1575,6 +1575,24 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, > return error; > } > > +/* > + * We check VMA flag validity early in the mmap() process, however this can > + * cause issues for arm64 when using MTE, which requires that it be used with > + * shmem and in this instance and only then is VM_MTE_ALLOWED set permitting > + * this operation. > + * > + * To avoid having to tear down a partially complete mapping we do this ahead of > + * time. > + */ > +static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags) Is it worth adding an inline? > +{ > + if (!IS_ENABLED(CONFIG_ARM64)) CONFIG_ARM64_MTE .. otherwise VM_MTE_ALLOWED is 0 so, really doesn't matter I guess. > + return vm_flags; > + > + if (shmem_file(file)) > + return vm_flags | VM_MTE_ALLOWED; Would if (VM_MTE_ALLOWED && shmem_file(file)) allow for the pre-compiler to remove some of this? Also probably doesn't matter much. > +} > + > unsigned long mmap_region(struct file *file, unsigned long addr, > unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, > struct list_head *uf) > @@ -1586,6 +1604,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, > if (map_deny_write_exec(vm_flags, vm_flags)) > return -EACCES; > > + vm_flags = arch_adjust_flags(file, vm_flags); > + > /* Allow architectures to sanity-check the vm_flags. */ > if (!arch_validate_flags(vm_flags)) > return -EINVAL; > diff --git a/mm/shmem.c b/mm/shmem.c > index 4ba1d00fabda..e87f5d6799a7 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > if (ret) > return ret; > > - /* arm64 - allow memory tagging on RAM-based files */ > - vm_flags_set(vma, VM_MTE_ALLOWED); > - > file_accessed(file); > /* This is anonymous shared memory if it is unlinked at the time of mmap */ > if (inode->i_nlink) > --
On Mon, Oct 28, 2024 at 08:43:08PM +0000, Lorenzo Stoakes wrote: > +/* > + * We check VMA flag validity early in the mmap() process, however this can > + * cause issues for arm64 when using MTE, which requires that it be used with > + * shmem and in this instance and only then is VM_MTE_ALLOWED set permitting > + * this operation. > + * > + * To avoid having to tear down a partially complete mapping we do this ahead of > + * time. > + */ > +static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags) > +{ > + if (!IS_ENABLED(CONFIG_ARM64)) > + return vm_flags; > + > + if (shmem_file(file)) > + return vm_flags | VM_MTE_ALLOWED; > +} This doesn't build: mm/mmap.c:1595:1: error: non-void function does not return a value in all control paths [-Werror,-Wreturn-type] 1595 | } | ^ with that corrected: diff --git a/mm/mmap.c b/mm/mmap.c index d1ab4301c671..cea051c5fef3 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1587,11 +1587,10 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, */ static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags) { - if (!IS_ENABLED(CONFIG_ARM64)) - return vm_flags; + if (IS_ENABLED(CONFIG_ARM64) && shmem_file(file)) + vm_flags |= VM_MTE_ALLOWED; - if (shmem_file(file)) - return vm_flags | VM_MTE_ALLOWED; + return vm_flags; } unsigned long mmap_region(struct file *file, unsigned long addr, the relevant tests all pass for me. Tested-by: Mark Brown <broonie@kernel.org> I'd have expected arch_adjust_flags() to be something overridden by the arch headers (probably like arch_calc_vm_prot_bits() and friends), but if this is juat a short lived fix it's probably not worth the trouble.
On Mon, 28 Oct 2024 at 11:00, Vlastimil Babka <vbabka@suse.cz> wrote: > > VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits(): > > if (system_supports_mte() && (flags & MAP_ANONYMOUS)) > return VM_MTE_ALLOWED; Yeah, but that should just move into arch_validate_flags() too. There's no reason why that's done in a separate place. Linus
On Mon, Oct 28, 2024 at 09:05:33PM +0000, Mark Brown wrote: > On Mon, Oct 28, 2024 at 08:43:08PM +0000, Lorenzo Stoakes wrote: > > > +/* > > + * We check VMA flag validity early in the mmap() process, however this can > > + * cause issues for arm64 when using MTE, which requires that it be used with > > + * shmem and in this instance and only then is VM_MTE_ALLOWED set permitting > > + * this operation. > > + * > > + * To avoid having to tear down a partially complete mapping we do this ahead of > > + * time. > > + */ > > +static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags) > > +{ > > + if (!IS_ENABLED(CONFIG_ARM64)) > > + return vm_flags; > > + > > + if (shmem_file(file)) > > + return vm_flags | VM_MTE_ALLOWED; > > +} > > This doesn't build: > > mm/mmap.c:1595:1: error: non-void function does not return a value in all control paths [-Werror,-Wreturn-type] > 1595 | } > | ^ Doh that'll teach me for rushing this... > > with that corrected: > > diff --git a/mm/mmap.c b/mm/mmap.c > index d1ab4301c671..cea051c5fef3 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1587,11 +1587,10 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, > */ > static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags) > { > - if (!IS_ENABLED(CONFIG_ARM64)) > - return vm_flags; > + if (IS_ENABLED(CONFIG_ARM64) && shmem_file(file)) > + vm_flags |= VM_MTE_ALLOWED; > > - if (shmem_file(file)) > - return vm_flags | VM_MTE_ALLOWED; > + return vm_flags; > } > > unsigned long mmap_region(struct file *file, unsigned long addr, > > the relevant tests all pass for me. > > Tested-by: Mark Brown <broonie@kernel.org> Thanks! > > I'd have expected arch_adjust_flags() to be something overridden by the > arch headers (probably like arch_calc_vm_prot_bits() and friends), but > if this is juat a short lived fix it's probably not worth the trouble. Yeah this is just a sample solution that I had put together when Linus suggested a sensible alternative which I'll code up... Good to confirm this is definitely the issue thanks for testing!
On 10/28/24 22:19, Linus Torvalds wrote: > On Mon, 28 Oct 2024 at 11:00, Vlastimil Babka <vbabka@suse.cz> wrote: >> >> VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits(): >> >> if (system_supports_mte() && (flags & MAP_ANONYMOUS)) >> return VM_MTE_ALLOWED; > > Yeah, but that should just move into arch_validate_flags() too. > There's no reason why that's done in a separate place. > > Linus Right, and VM_DATA_DEFAULT_FLAGS is only used in do_brk_flags() which is also an anonymous VMA, and it doesn't perform arch_validate_flags() anyway.
On Mon, Oct 28, 2024 at 10:28:47PM +0100, Vlastimil Babka wrote: > On 10/28/24 22:19, Linus Torvalds wrote: > > On Mon, 28 Oct 2024 at 11:00, Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits(): > >> > >> if (system_supports_mte() && (flags & MAP_ANONYMOUS)) > >> return VM_MTE_ALLOWED; > > > > Yeah, but that should just move into arch_validate_flags() too. > > There's no reason why that's done in a separate place. > > > > Linus > > Right, and VM_DATA_DEFAULT_FLAGS is only used in do_brk_flags() which is > also an anonymous VMA, and it doesn't perform arch_validate_flags() anyway. Unfortunately we can't do this and have everything work entirely consistently. This is because, while adding a file parameter to arch_validate_flags() lets us do this shmem check, we can't be sure MAP_ANON was set and we do not have access to this information at the point of the arch_validate_flags() check. We could check file == NULL, but this then excludes the MAP_ANON | MAP_HUGETLB case which is (probably accidentally) permitted by arch_calc_vm_flag_bits() and for the purposes of this fix we probably just want to keep behaviour identical. We could alternatively check the file is shmem _or_ hugetlb but this would amount to a change in behaviour and not sure we want to go there. Anyway I attach a patch that does what you suggest, I actually compiled this one (!) but don't have a system set up to test it (Mark?) If we bring this in it should probably be a separate commit... ----8<---- From 247003cd2a4b5f4fc2dac97f5ef7e473a47f4324 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Date: Mon, 28 Oct 2024 22:05:44 +0000 Subject: [PATCH] mm: perform MTE check within arm64 hook entirely It doesn't make sense to have shmem explicitly check this one arch-specific case, it is arch-specific, so the arch should handle it. We know shmem is a case in which we want to permit MTE, so simply check for this directly. This also fixes the issue with checking arch_validate_flags() early, which would otherwise break mmap_region(). In order to implement this we must pass a file pointer, and additionally update the sparc code to accept this parameter too. We'd ideally like to have eliminated the arch_calc_vm_flag_bits() case, but we risk inadvertently changing behaviour as we do not have mmap() flags available at the point of the arch_validate_flags() check and a MAP_ANON | MAP_HUGETLB case would be accepted for MTE currently, but a MAP_SHARED | MAP_HUGETLB would not. This is likely an oversight but we want to try to keep behaviour identical to before in this patch. So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if MAP_ANON. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- arch/arm64/include/asm/mman.h | 18 ++++++++++++++---- arch/sparc/include/asm/mman.h | 5 +++-- include/linux/mman.h | 2 +- mm/mmap.c | 4 ++-- mm/mprotect.c | 2 +- mm/shmem.c | 3 --- 6 files changed, 21 insertions(+), 13 deletions(-) diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h index 9e39217b4afb..44b7c8a1dd67 100644 --- a/arch/arm64/include/asm/mman.h +++ b/arch/arm64/include/asm/mman.h @@ -6,7 +6,9 @@ #ifndef BUILD_VDSO #include <linux/compiler.h> +#include <linux/fs.h> #include <linux/types.h> +#include <linux/shmem_fs.h> static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, unsigned long pkey) @@ -60,15 +62,23 @@ static inline bool arch_validate_prot(unsigned long prot, } #define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr) -static inline bool arch_validate_flags(unsigned long vm_flags) +static inline bool arch_validate_flags(struct file *file, unsigned long vm_flags) { if (!system_supports_mte()) return true; - /* only allow VM_MTE if VM_MTE_ALLOWED has been set previously */ - return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED); + if (!(vm_flags & VM_MTE)) + return true; + + if (vm_flags & VM_MTE_ALLOWED) + return true; + + if (shmem_file(file)) + return true; + + return false; } -#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags) +#define arch_validate_flags(file, vm_flags) arch_validate_flags(file, vm_flags) #endif /* !BUILD_VDSO */ diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h index af9c10c83dc5..d426e1f7c2c1 100644 --- a/arch/sparc/include/asm/mman.h +++ b/arch/sparc/include/asm/mman.h @@ -10,6 +10,7 @@ int sparc_mmap_check(unsigned long addr, unsigned long len); #ifdef CONFIG_SPARC64 #include <asm/adi_64.h> +#include <linux/fs.h> static inline void ipi_set_tstate_mcde(void *arg) { @@ -54,11 +55,11 @@ static inline int sparc_validate_prot(unsigned long prot, unsigned long addr) return 1; } -#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags) +#define arch_validate_flags(file, vm_flags) arch_validate_flags(file, vm_flags) /* arch_validate_flags() - Ensure combination of flags is valid for a * VMA. */ -static inline bool arch_validate_flags(unsigned long vm_flags) +static inline bool arch_validate_flags(struct file *file, unsigned long vm_flags) { /* If ADI is being enabled on this VMA, check for ADI * capability on the platform and ensure VMA is suitable diff --git a/include/linux/mman.h b/include/linux/mman.h index 8ddca62d6460..82e6488026b7 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -117,7 +117,7 @@ static inline bool arch_validate_prot(unsigned long prot, unsigned long addr) * * Returns true if the VM_* flags are valid. */ -static inline bool arch_validate_flags(unsigned long flags) +static inline bool arch_validate_flags(struct file *file, unsigned long flags) { return true; } diff --git a/mm/mmap.c b/mm/mmap.c index 8462de1ee583..e06171d243ef 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1504,7 +1504,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, #ifdef CONFIG_SPARC64 /* TODO: Fix SPARC ADI! */ - WARN_ON_ONCE(!arch_validate_flags(vm_flags)); + WARN_ON_ONCE(!arch_validate_flags(file, vm_flags)); #endif /* Lock the VMA since it is modified after insertion into VMA tree */ @@ -1587,7 +1587,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return -EACCES; /* Allow architectures to sanity-check the vm_flags. */ - if (!arch_validate_flags(vm_flags)) + if (!arch_validate_flags(file, vm_flags)) return -EINVAL; /* Map writable and ensure this isn't a sealed memfd. */ diff --git a/mm/mprotect.c b/mm/mprotect.c index 6f450af3252e..c6db98b893fc 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -816,7 +816,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, } /* Allow architectures to sanity-check the new flags */ - if (!arch_validate_flags(newflags)) { + if (!arch_validate_flags(vma->vm_file, newflags)) { error = -EINVAL; break; } diff --git a/mm/shmem.c b/mm/shmem.c index 4ba1d00fabda..e87f5d6799a7 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; - /* arm64 - allow memory tagging on RAM-based files */ - vm_flags_set(vma, VM_MTE_ALLOWED); - file_accessed(file); /* This is anonymous shared memory if it is unlinked at the time of mmap */ if (inode->i_nlink) -- 2.47.0
On 10/28/24 23:14, Lorenzo Stoakes wrote: > On Mon, Oct 28, 2024 at 10:28:47PM +0100, Vlastimil Babka wrote: >> On 10/28/24 22:19, Linus Torvalds wrote: >> > On Mon, 28 Oct 2024 at 11:00, Vlastimil Babka <vbabka@suse.cz> wrote: >> >> >> >> VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits(): >> >> >> >> if (system_supports_mte() && (flags & MAP_ANONYMOUS)) >> >> return VM_MTE_ALLOWED; >> > >> > Yeah, but that should just move into arch_validate_flags() too. >> > There's no reason why that's done in a separate place. >> > >> > Linus >> >> Right, and VM_DATA_DEFAULT_FLAGS is only used in do_brk_flags() which is >> also an anonymous VMA, and it doesn't perform arch_validate_flags() anyway. > > Unfortunately we can't do this and have everything work entirely > consistently. Consistently with the current implementation, which seems inconsiscent wrt hugetlbfs. > This is because, while adding a file parameter to arch_validate_flags() > lets us do this shmem check, we can't be sure MAP_ANON was set and we do > not have access to this information at the point of the > arch_validate_flags() check. > > We could check file == NULL, but this then excludes the MAP_ANON | > MAP_HUGETLB case which is (probably accidentally) permitted by > arch_calc_vm_flag_bits() and for the purposes of this fix we probably just > want to keep behaviour identical. > > We could alternatively check the file is shmem _or_ hugetlb but this would > amount to a change in behaviour and not sure we want to go there. > > Anyway I attach a patch that does what you suggest, I actually compiled > this one (!) but don't have a system set up to test it (Mark?) > > If we bring this in it should probably be a separate commit... > > ----8<---- > From 247003cd2a4b5f4fc2dac97f5ef7e473a47f4324 Mon Sep 17 00:00:00 2001 > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Date: Mon, 28 Oct 2024 22:05:44 +0000 > Subject: [PATCH] mm: perform MTE check within arm64 hook entirely > > It doesn't make sense to have shmem explicitly check this one arch-specific > case, it is arch-specific, so the arch should handle it. We know shmem is a > case in which we want to permit MTE, so simply check for this directly. > > This also fixes the issue with checking arch_validate_flags() early, which > would otherwise break mmap_region(). > > In order to implement this we must pass a file pointer, and additionally > update the sparc code to accept this parameter too. > > We'd ideally like to have eliminated the arch_calc_vm_flag_bits() case, but > we risk inadvertently changing behaviour as we do not have mmap() flags > available at the point of the arch_validate_flags() check and a MAP_ANON | > MAP_HUGETLB case would be accepted for MTE currently, but a MAP_SHARED | > MAP_HUGETLB would not. > > This is likely an oversight but we want to try to keep behaviour identical > to before in this patch. If it's confirmed to be oversight and MAP_SHARED hugetlbfs should be allowed, we could add another is_file_hugepages() condition > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if > MAP_ANON. > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > arch/arm64/include/asm/mman.h | 18 ++++++++++++++---- > arch/sparc/include/asm/mman.h | 5 +++-- > include/linux/mman.h | 2 +- > mm/mmap.c | 4 ++-- > mm/mprotect.c | 2 +- > mm/shmem.c | 3 --- > 6 files changed, 21 insertions(+), 13 deletions(-) > > diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h > index 9e39217b4afb..44b7c8a1dd67 100644 > --- a/arch/arm64/include/asm/mman.h > +++ b/arch/arm64/include/asm/mman.h > @@ -6,7 +6,9 @@ > > #ifndef BUILD_VDSO > #include <linux/compiler.h> > +#include <linux/fs.h> > #include <linux/types.h> > +#include <linux/shmem_fs.h> > > static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, > unsigned long pkey) > @@ -60,15 +62,23 @@ static inline bool arch_validate_prot(unsigned long prot, > } > #define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr) > > -static inline bool arch_validate_flags(unsigned long vm_flags) > +static inline bool arch_validate_flags(struct file *file, unsigned long vm_flags) > { > if (!system_supports_mte()) > return true; > > - /* only allow VM_MTE if VM_MTE_ALLOWED has been set previously */ > - return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED); > + if (!(vm_flags & VM_MTE)) > + return true; > + > + if (vm_flags & VM_MTE_ALLOWED) > + return true; > + > + if (shmem_file(file)) > + return true; > + > + return false; > } > -#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags) > +#define arch_validate_flags(file, vm_flags) arch_validate_flags(file, vm_flags) > > #endif /* !BUILD_VDSO */ > > diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h > index af9c10c83dc5..d426e1f7c2c1 100644 > --- a/arch/sparc/include/asm/mman.h > +++ b/arch/sparc/include/asm/mman.h > @@ -10,6 +10,7 @@ int sparc_mmap_check(unsigned long addr, unsigned long len); > > #ifdef CONFIG_SPARC64 > #include <asm/adi_64.h> > +#include <linux/fs.h> > > static inline void ipi_set_tstate_mcde(void *arg) > { > @@ -54,11 +55,11 @@ static inline int sparc_validate_prot(unsigned long prot, unsigned long addr) > return 1; > } > > -#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags) > +#define arch_validate_flags(file, vm_flags) arch_validate_flags(file, vm_flags) > /* arch_validate_flags() - Ensure combination of flags is valid for a > * VMA. > */ > -static inline bool arch_validate_flags(unsigned long vm_flags) > +static inline bool arch_validate_flags(struct file *file, unsigned long vm_flags) > { > /* If ADI is being enabled on this VMA, check for ADI > * capability on the platform and ensure VMA is suitable > diff --git a/include/linux/mman.h b/include/linux/mman.h > index 8ddca62d6460..82e6488026b7 100644 > --- a/include/linux/mman.h > +++ b/include/linux/mman.h > @@ -117,7 +117,7 @@ static inline bool arch_validate_prot(unsigned long prot, unsigned long addr) > * > * Returns true if the VM_* flags are valid. > */ > -static inline bool arch_validate_flags(unsigned long flags) > +static inline bool arch_validate_flags(struct file *file, unsigned long flags) > { > return true; > } > diff --git a/mm/mmap.c b/mm/mmap.c > index 8462de1ee583..e06171d243ef 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1504,7 +1504,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, > > #ifdef CONFIG_SPARC64 > /* TODO: Fix SPARC ADI! */ > - WARN_ON_ONCE(!arch_validate_flags(vm_flags)); > + WARN_ON_ONCE(!arch_validate_flags(file, vm_flags)); > #endif > > /* Lock the VMA since it is modified after insertion into VMA tree */ > @@ -1587,7 +1587,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, > return -EACCES; > > /* Allow architectures to sanity-check the vm_flags. */ > - if (!arch_validate_flags(vm_flags)) > + if (!arch_validate_flags(file, vm_flags)) > return -EINVAL; > > /* Map writable and ensure this isn't a sealed memfd. */ > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 6f450af3252e..c6db98b893fc 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -816,7 +816,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, > } > > /* Allow architectures to sanity-check the new flags */ > - if (!arch_validate_flags(newflags)) { > + if (!arch_validate_flags(vma->vm_file, newflags)) { > error = -EINVAL; > break; > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 4ba1d00fabda..e87f5d6799a7 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > if (ret) > return ret; > > - /* arm64 - allow memory tagging on RAM-based files */ > - vm_flags_set(vma, VM_MTE_ALLOWED); > - > file_accessed(file); > /* This is anonymous shared memory if it is unlinked at the time of mmap */ > if (inode->i_nlink) > -- > 2.47.0
On Tue, Oct 29, 2024 at 08:50:11AM +0100, Vlastimil Babka wrote: > On 10/28/24 23:14, Lorenzo Stoakes wrote: <snip> > > Unfortunately we can't do this and have everything work entirely > > consistently. > > Consistently with the current implementation, which seems inconsiscent wrt > hugetlbfs. Yeah 100% I think this is an oversight. <snip> > > This is likely an oversight but we want to try to keep behaviour identical > > to before in this patch. > > If it's confirmed to be oversight and MAP_SHARED hugetlbfs should be > allowed, we could add another is_file_hugepages() condition Yeah I mean I think this _is_ an oversight but I think best to delay a fix for that to a follow-up series and just do this for the hotfix. Might throw in a comment. Andrew requested a comment on the shmem_file() check anyway so can sling this in there! > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if > > MAP_ANON. > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> Thanks! <snip>
On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote: > It doesn't make sense to have shmem explicitly check this one arch-specific > case, it is arch-specific, so the arch should handle it. We know shmem is a > case in which we want to permit MTE, so simply check for this directly. > > This also fixes the issue with checking arch_validate_flags() early, which > would otherwise break mmap_region(). The relevant tests all pass with this on today's pending-fixes. Tested-by: Mark Brown <broonie@kernel.org>
On Tue, Oct 29, 2024 at 12:33:13PM +0000, Mark Brown wrote: > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote: > > > It doesn't make sense to have shmem explicitly check this one arch-specific > > case, it is arch-specific, so the arch should handle it. We know shmem is a > > case in which we want to permit MTE, so simply check for this directly. > > > > This also fixes the issue with checking arch_validate_flags() early, which > > would otherwise break mmap_region(). > > The relevant tests all pass with this on today's pending-fixes. Thanks, am respinning series now, will add your tag! > > Tested-by: Mark Brown <broonie@kernel.org>
Hi Lorenzo, Thanks for trying to fix this. On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote: > From 247003cd2a4b5f4fc2dac97f5ef7e473a47f4324 Mon Sep 17 00:00:00 2001 > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Date: Mon, 28 Oct 2024 22:05:44 +0000 > Subject: [PATCH] mm: perform MTE check within arm64 hook entirely > > It doesn't make sense to have shmem explicitly check this one arch-specific > case, it is arch-specific, so the arch should handle it. We know shmem is a > case in which we want to permit MTE, so simply check for this directly. > > This also fixes the issue with checking arch_validate_flags() early, which > would otherwise break mmap_region(). > > In order to implement this we must pass a file pointer, and additionally > update the sparc code to accept this parameter too. > > We'd ideally like to have eliminated the arch_calc_vm_flag_bits() case, but > we risk inadvertently changing behaviour as we do not have mmap() flags > available at the point of the arch_validate_flags() check and a MAP_ANON | > MAP_HUGETLB case would be accepted for MTE currently, but a MAP_SHARED | > MAP_HUGETLB would not. > > This is likely an oversight but we want to try to keep behaviour identical > to before in this patch. MAP_HUGETLB support for MTE is only in -next currently, so there wouldn't be any ABI change if we also allowed MAP_SHARED | MAP_HUGETLB. In 6.12, MAP_HUGETLB is not allowed to have PROT_MTE. > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if > MAP_ANON. [...] > diff --git a/mm/shmem.c b/mm/shmem.c > index 4ba1d00fabda..e87f5d6799a7 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > if (ret) > return ret; > > - /* arm64 - allow memory tagging on RAM-based files */ > - vm_flags_set(vma, VM_MTE_ALLOWED); This breaks arm64 KVM if the VMM uses shared mappings for the memory slots (which is possible). We have kvm_vma_mte_allowed() that checks for the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly. I need to read this thread properly but why not pass the file argument to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there?
On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote: > Hi Lorenzo, > > Thanks for trying to fix this. > > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote: > > From 247003cd2a4b5f4fc2dac97f5ef7e473a47f4324 Mon Sep 17 00:00:00 2001 > > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Date: Mon, 28 Oct 2024 22:05:44 +0000 > > Subject: [PATCH] mm: perform MTE check within arm64 hook entirely > > > > It doesn't make sense to have shmem explicitly check this one arch-specific > > case, it is arch-specific, so the arch should handle it. We know shmem is a > > case in which we want to permit MTE, so simply check for this directly. > > > > This also fixes the issue with checking arch_validate_flags() early, which > > would otherwise break mmap_region(). > > > > In order to implement this we must pass a file pointer, and additionally > > update the sparc code to accept this parameter too. > > > > We'd ideally like to have eliminated the arch_calc_vm_flag_bits() case, but > > we risk inadvertently changing behaviour as we do not have mmap() flags > > available at the point of the arch_validate_flags() check and a MAP_ANON | > > MAP_HUGETLB case would be accepted for MTE currently, but a MAP_SHARED | > > MAP_HUGETLB would not. > > > > This is likely an oversight but we want to try to keep behaviour identical > > to before in this patch. > > MAP_HUGETLB support for MTE is only in -next currently, so there > wouldn't be any ABI change if we also allowed MAP_SHARED | MAP_HUGETLB. > In 6.12, MAP_HUGETLB is not allowed to have PROT_MTE. > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if > > MAP_ANON. > [...] > > diff --git a/mm/shmem.c b/mm/shmem.c > > index 4ba1d00fabda..e87f5d6799a7 100644 > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > > if (ret) > > return ret; > > > > - /* arm64 - allow memory tagging on RAM-based files */ > > - vm_flags_set(vma, VM_MTE_ALLOWED); > > This breaks arm64 KVM if the VMM uses shared mappings for the memory > slots (which is possible). We have kvm_vma_mte_allowed() that checks for > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly. Ugh yup missed that thanks. > > I need to read this thread properly but why not pass the file argument > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there? Can't really do that as it is entangled in a bunch of other stuff, e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch of places including arch code and... etc. etc. And definitely no good for a hotfix that has to be backported. I suggest instead we instead don't drop the yucky shmem thing, which will set VM_MTE_ALLOWED for shmem, with arch_calc_vm_flag_bits() still setting it for MAP_ANON, but the other changes will mean the arch_validate_flags() will be fixed too. So this just means not dropping the mm/shmem.c bit basically and everything should 'just work'? > > -- > Catalin But we definitely need to find a better way post-hotfix to sort all this out I'm sure you agree :)
On Tue, Oct 29, 2024 at 03:16:00PM +0000, Lorenzo Stoakes wrote: > On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote: > > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote: > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if > > > MAP_ANON. > > [...] > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > index 4ba1d00fabda..e87f5d6799a7 100644 > > > --- a/mm/shmem.c > > > +++ b/mm/shmem.c > > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > > > if (ret) > > > return ret; > > > > > > - /* arm64 - allow memory tagging on RAM-based files */ > > > - vm_flags_set(vma, VM_MTE_ALLOWED); > > > > This breaks arm64 KVM if the VMM uses shared mappings for the memory > > slots (which is possible). We have kvm_vma_mte_allowed() that checks for > > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly. > > Ugh yup missed that thanks. > > > I need to read this thread properly but why not pass the file argument > > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there? > > Can't really do that as it is entangled in a bunch of other stuff, > e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch > of places including arch code and... etc. etc. Not calc_vm_prot_bits() but calc_vm_flag_bits(). arch_calc_vm_flag_bits() is only implemented by two architectures - arm64 and parisc and calc_vm_flag_bits() is only called from do_mmap(). Basically we want to set VM_MTE_ALLOWED early during the mmap() call and, at the time, my thinking was to do it in calc_vm_flag_bits(). The calc_vm_prot_bits() OTOH is also called on the mprotect() path and is responsible for translating PROT_MTE into a VM_MTE flag without any checks. arch_validate_flags() would check if VM_MTE comes together with VM_MTE_ALLOWED. But, as in the KVM case, that's not the only function checking VM_MTE_ALLOWED. Since calc_vm_flag_bits() did not take a file argument, the lazy approach was to add the flag explicitly for shmem (and hugetlbfs in -next). But I think it would be easier to just add the file argument to calc_vm_flag_bits() and do the check in the arch code to return VM_MTE_ALLOWED. AFAICT, this is called before mmap_region() and arch_validate_flags() (unless I missed something in the recent reworking). > I suggest instead we instead don't drop the yucky shmem thing, which will > set VM_MTE_ALLOWED for shmem, with arch_calc_vm_flag_bits() still setting > it for MAP_ANON, but the other changes will mean the arch_validate_flags() > will be fixed too. > > So this just means not dropping the mm/shmem.c bit basically and everything > should 'just work'? If we can't get the calc_vm_flag_bits() approach to work, I'm fine with this as a fix and we'll look to do it properly from 6.13.
On Tue, Oct 29, 2024 at 04:22:42PM +0000, Catalin Marinas wrote: > On Tue, Oct 29, 2024 at 03:16:00PM +0000, Lorenzo Stoakes wrote: > > On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote: > > > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote: > > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if > > > > MAP_ANON. > > > [...] > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > > index 4ba1d00fabda..e87f5d6799a7 100644 > > > > --- a/mm/shmem.c > > > > +++ b/mm/shmem.c > > > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > > > > if (ret) > > > > return ret; > > > > > > > > - /* arm64 - allow memory tagging on RAM-based files */ > > > > - vm_flags_set(vma, VM_MTE_ALLOWED); > > > > > > This breaks arm64 KVM if the VMM uses shared mappings for the memory > > > slots (which is possible). We have kvm_vma_mte_allowed() that checks for > > > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly. > > > > Ugh yup missed that thanks. > > > > > I need to read this thread properly but why not pass the file argument > > > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there? > > > > Can't really do that as it is entangled in a bunch of other stuff, > > e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch > > of places including arch code and... etc. etc. > > Not calc_vm_prot_bits() but calc_vm_flag_bits(). > arch_calc_vm_flag_bits() is only implemented by two architectures - > arm64 and parisc and calc_vm_flag_bits() is only called from do_mmap(). > > Basically we want to set VM_MTE_ALLOWED early during the mmap() call > and, at the time, my thinking was to do it in calc_vm_flag_bits(). The > calc_vm_prot_bits() OTOH is also called on the mprotect() path and is > responsible for translating PROT_MTE into a VM_MTE flag without any > checks. arch_validate_flags() would check if VM_MTE comes together with > VM_MTE_ALLOWED. But, as in the KVM case, that's not the only function > checking VM_MTE_ALLOWED. > > Since calc_vm_flag_bits() did not take a file argument, the lazy > approach was to add the flag explicitly for shmem (and hugetlbfs in > -next). But I think it would be easier to just add the file argument to > calc_vm_flag_bits() and do the check in the arch code to return > VM_MTE_ALLOWED. AFAICT, this is called before mmap_region() and > arch_validate_flags() (unless I missed something in the recent > reworking). I mean I totally get why you're suggesting it - it's the right _place_ but... It would require changes to a ton of code which is no good for a backport and we don't _need_ to do it. I'd rather do the smallest delta at this point, as I am not a huge fan of sticking it in here (I mean your point is wholly valid - it's at a better place to do so and we can change flags here, it's just - it's not where you expect to do this obviously). I mean for instance in arch/x86/kernel/cpu/sgx/encl.c (a file I'd _really_ like us not to touch here by the way) we'd have to what pass NULL? I mean passing file to arch_validate_flags() is icky, but it makes some sense since we _always_ have that available and meaningful at the point of invocation, if we added it to arch_calc_vm_flag_bits() now there are places where it's not available. And then we're assuming we can just pass NULL... and it becomes a confusing mess really I think. I also worry we might somehow break something somewhere this way, we're already exposed to subtle issues here. Alternatively, we can change my series by 2 lines (as I've already asked Andrew to do), everything still works, the fix applies, the VM_MTE_ALLOWED flag works still in an obvious way (it behaves exactly as it did before) and all is well with the world and we can frolick in the fields freely and joyously :) > > > I suggest instead we instead don't drop the yucky shmem thing, which will > > set VM_MTE_ALLOWED for shmem, with arch_calc_vm_flag_bits() still setting > > it for MAP_ANON, but the other changes will mean the arch_validate_flags() > > will be fixed too. > > > > So this just means not dropping the mm/shmem.c bit basically and everything > > should 'just work'? > > If we can't get the calc_vm_flag_bits() approach to work, I'm fine with > this as a fix and we'll look to do it properly from 6.13. I think overwhelmingly since I'm going to be backporting this and as a hotfix it's better to just leave the shmem stuff in and leave the rest the same. I really would like us to figure out a better way overall from >=6.13 though and replace all this with something saner :>) Am happy to help and collaborate on that! > > -- > Catalin Cheers, and sorry to fiddle with arm64 stuff here, sadly happens to just be where this issue becomes a thing with this hotfix!
On Tue, Oct 29, 2024 at 04:36:32PM +0000, Lorenzo Stoakes wrote: > On Tue, Oct 29, 2024 at 04:22:42PM +0000, Catalin Marinas wrote: > > On Tue, Oct 29, 2024 at 03:16:00PM +0000, Lorenzo Stoakes wrote: > > > On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote: > > > > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote: > > > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if > > > > > MAP_ANON. > > > > [...] > > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > > > index 4ba1d00fabda..e87f5d6799a7 100644 > > > > > --- a/mm/shmem.c > > > > > +++ b/mm/shmem.c > > > > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > > > > > if (ret) > > > > > return ret; > > > > > > > > > > - /* arm64 - allow memory tagging on RAM-based files */ > > > > > - vm_flags_set(vma, VM_MTE_ALLOWED); > > > > > > > > This breaks arm64 KVM if the VMM uses shared mappings for the memory > > > > slots (which is possible). We have kvm_vma_mte_allowed() that checks for > > > > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly. > > > > > > Ugh yup missed that thanks. > > > > > > > I need to read this thread properly but why not pass the file argument > > > > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there? > > > > > > Can't really do that as it is entangled in a bunch of other stuff, > > > e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch > > > of places including arch code and... etc. etc. > > > > Not calc_vm_prot_bits() but calc_vm_flag_bits(). > > arch_calc_vm_flag_bits() is only implemented by two architectures - > > arm64 and parisc and calc_vm_flag_bits() is only called from do_mmap(). > > > > Basically we want to set VM_MTE_ALLOWED early during the mmap() call > > and, at the time, my thinking was to do it in calc_vm_flag_bits(). The > > calc_vm_prot_bits() OTOH is also called on the mprotect() path and is > > responsible for translating PROT_MTE into a VM_MTE flag without any > > checks. arch_validate_flags() would check if VM_MTE comes together with > > VM_MTE_ALLOWED. But, as in the KVM case, that's not the only function > > checking VM_MTE_ALLOWED. > > > > Since calc_vm_flag_bits() did not take a file argument, the lazy > > approach was to add the flag explicitly for shmem (and hugetlbfs in > > -next). But I think it would be easier to just add the file argument to > > calc_vm_flag_bits() and do the check in the arch code to return > > VM_MTE_ALLOWED. AFAICT, this is called before mmap_region() and > > arch_validate_flags() (unless I missed something in the recent > > reworking). > > I mean I totally get why you're suggesting it Not sure ;) > - it's the right _place_ but... > It would require changes to a ton of code which is no good for a backport > and we don't _need_ to do it. > > I'd rather do the smallest delta at this point, as I am not a huge fan of > sticking it in here (I mean your point is wholly valid - it's at a better > place to do so and we can change flags here, it's just - it's not where you > expect to do this obviously). > > I mean for instance in arch/x86/kernel/cpu/sgx/encl.c (a file I'd _really_ > like us not to touch here by the way) we'd have to what pass NULL? That's calc_vm_prot_bits(). I suggested calc_vm_flag_bits() (I know, confusing names and total lack of inspiration when we added MTE support). The latter is only called in one place - do_mmap(). That's what I meant (untested, on top of -next as it has a MAP_HUGETLB check in there). I don't think it's much worse than your proposal, assuming that it works: diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h index 1dbfb56cb313..358bbaaafd41 100644 --- a/arch/arm64/include/asm/mman.h +++ b/arch/arm64/include/asm/mman.h @@ -6,6 +6,8 @@ #ifndef BUILD_VDSO #include <linux/compiler.h> +#include <linux/fs.h> +#include <linux/shmem_fs.h> #include <linux/types.h> static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, @@ -31,7 +33,7 @@ static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, } #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey) -static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) +static inline unsigned long arch_calc_vm_flag_bits(struct file *file, unsigned long flags) { /* * Only allow MTE on anonymous mappings as these are guaranteed to be @@ -39,12 +41,12 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) * filesystem supporting MTE (RAM-based). */ if (system_supports_mte() && - (flags & (MAP_ANONYMOUS | MAP_HUGETLB))) + (flags & (MAP_ANONYMOUS | MAP_HUGETLB) || shmem_file(file))) return VM_MTE_ALLOWED; return 0; } -#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags) +#define arch_calc_vm_flag_bits(file, flags) arch_calc_vm_flag_bits(file, flags) static inline bool arch_validate_prot(unsigned long prot, unsigned long addr __always_unused) diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h index 89b6beeda0b8..663f587dc789 100644 --- a/arch/parisc/include/asm/mman.h +++ b/arch/parisc/include/asm/mman.h @@ -2,6 +2,7 @@ #ifndef __ASM_MMAN_H__ #define __ASM_MMAN_H__ +#include <linux/fs.h> #include <uapi/asm/mman.h> /* PARISC cannot allow mdwe as it needs writable stacks */ @@ -11,7 +12,7 @@ static inline bool arch_memory_deny_write_exec_supported(void) } #define arch_memory_deny_write_exec_supported arch_memory_deny_write_exec_supported -static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) +static inline unsigned long arch_calc_vm_flag_bits(struct file *file, unsigned long flags) { /* * The stack on parisc grows upwards, so if userspace requests memory @@ -23,6 +24,6 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) return 0; } -#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags) +#define arch_calc_vm_flag_bits(file, flags) arch_calc_vm_flag_bits(file, flags) #endif /* __ASM_MMAN_H__ */ diff --git a/include/linux/mman.h b/include/linux/mman.h index 8ddca62d6460..a842783ffa62 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -2,6 +2,7 @@ #ifndef _LINUX_MMAN_H #define _LINUX_MMAN_H +#include <linux/fs.h> #include <linux/mm.h> #include <linux/percpu_counter.h> @@ -94,7 +95,7 @@ static inline void vm_unacct_memory(long pages) #endif #ifndef arch_calc_vm_flag_bits -#define arch_calc_vm_flag_bits(flags) 0 +#define arch_calc_vm_flag_bits(file, flags) 0 #endif #ifndef arch_validate_prot @@ -151,13 +152,13 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey) * Combine the mmap "flags" argument into "vm_flags" used internally. */ static inline unsigned long -calc_vm_flag_bits(unsigned long flags) +calc_vm_flag_bits(struct file *file, unsigned long flags) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | _calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) | - arch_calc_vm_flag_bits(flags); + arch_calc_vm_flag_bits(file, flags); } unsigned long vm_commit_limit(void); diff --git a/mm/mmap.c b/mm/mmap.c index f102314bb500..f904b3bba962 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -344,7 +344,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, * to. we assume access permissions have been handled by the open * of the memory object, so we don't do any here. */ - vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | + vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(file, flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; /* Obtain the address to map to. we verify (or select) it and ensure diff --git a/mm/nommu.c b/mm/nommu.c index 635d028d647b..e9b5f527ab5b 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -842,7 +842,7 @@ static unsigned long determine_vm_flags(struct file *file, { unsigned long vm_flags; - vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags); + vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(file, flags); if (!file) { /* diff --git a/mm/shmem.c b/mm/shmem.c index f24a0f34723e..ff194341fddb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2737,9 +2737,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; - /* arm64 - allow memory tagging on RAM-based files */ - vm_flags_set(vma, VM_MTE_ALLOWED); - file_accessed(file); /* This is anonymous shared memory if it is unlinked at the time of mmap */ if (inode->i_nlink)
On Tue, Oct 29, 2024 at 05:02:23PM +0000, Catalin Marinas wrote: > On Tue, Oct 29, 2024 at 04:36:32PM +0000, Lorenzo Stoakes wrote: > > On Tue, Oct 29, 2024 at 04:22:42PM +0000, Catalin Marinas wrote: > > > On Tue, Oct 29, 2024 at 03:16:00PM +0000, Lorenzo Stoakes wrote: > > > > On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote: > > > > > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote: > > > > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if > > > > > > MAP_ANON. > > > > > [...] > > > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > > > > index 4ba1d00fabda..e87f5d6799a7 100644 > > > > > > --- a/mm/shmem.c > > > > > > +++ b/mm/shmem.c > > > > > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > > > > > > if (ret) > > > > > > return ret; > > > > > > > > > > > > - /* arm64 - allow memory tagging on RAM-based files */ > > > > > > - vm_flags_set(vma, VM_MTE_ALLOWED); > > > > > > > > > > This breaks arm64 KVM if the VMM uses shared mappings for the memory > > > > > slots (which is possible). We have kvm_vma_mte_allowed() that checks for > > > > > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly. > > > > > > > > Ugh yup missed that thanks. > > > > > > > > > I need to read this thread properly but why not pass the file argument > > > > > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there? > > > > > > > > Can't really do that as it is entangled in a bunch of other stuff, > > > > e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch > > > > of places including arch code and... etc. etc. > > > > > > Not calc_vm_prot_bits() but calc_vm_flag_bits(). > > > arch_calc_vm_flag_bits() is only implemented by two architectures - > > > arm64 and parisc and calc_vm_flag_bits() is only called from do_mmap(). > > > > > > Basically we want to set VM_MTE_ALLOWED early during the mmap() call > > > and, at the time, my thinking was to do it in calc_vm_flag_bits(). The > > > calc_vm_prot_bits() OTOH is also called on the mprotect() path and is > > > responsible for translating PROT_MTE into a VM_MTE flag without any > > > checks. arch_validate_flags() would check if VM_MTE comes together with > > > VM_MTE_ALLOWED. But, as in the KVM case, that's not the only function > > > checking VM_MTE_ALLOWED. > > > > > > Since calc_vm_flag_bits() did not take a file argument, the lazy > > > approach was to add the flag explicitly for shmem (and hugetlbfs in > > > -next). But I think it would be easier to just add the file argument to > > > calc_vm_flag_bits() and do the check in the arch code to return > > > VM_MTE_ALLOWED. AFAICT, this is called before mmap_region() and > > > arch_validate_flags() (unless I missed something in the recent > > > reworking). > > > > I mean I totally get why you're suggesting it > > Not sure ;) > > > - it's the right _place_ but... > > It would require changes to a ton of code which is no good for a backport > > and we don't _need_ to do it. > > > > I'd rather do the smallest delta at this point, as I am not a huge fan of > > sticking it in here (I mean your point is wholly valid - it's at a better > > place to do so and we can change flags here, it's just - it's not where you > > expect to do this obviously). > > > > I mean for instance in arch/x86/kernel/cpu/sgx/encl.c (a file I'd _really_ > > like us not to touch here by the way) we'd have to what pass NULL? > > That's calc_vm_prot_bits(). I suggested calc_vm_flag_bits() (I know, > confusing names and total lack of inspiration when we added MTE > support). The latter is only called in one place - do_mmap(). > > That's what I meant (untested, on top of -next as it has a MAP_HUGETLB > check in there). I don't think it's much worse than your proposal, > assuming that it works: Right sorry misread. Yeah this is better, let me do a quick -v4 then! Cheers, sorry pretty tired at this stage, was looking at this all last night...
On Tue, Oct 29, 2024 at 05:28:49PM +0000, Lorenzo Stoakes wrote: > On Tue, Oct 29, 2024 at 05:02:23PM +0000, Catalin Marinas wrote: > > That's what I meant (untested, on top of -next as it has a MAP_HUGETLB > > check in there). I don't think it's much worse than your proposal, > > assuming that it works: > > Right sorry misread. Yeah this is better, let me do a quick -v4 then! Thanks! > Cheers, sorry pretty tired at this stage, was looking at this all last > night... No worries. The way we handle MTE is pretty convoluted. Happy to simplify it if we find a better way.
diff --git a/mm/mmap.c b/mm/mmap.c index 66edf0ebba94..e686d57ed9f7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1361,20 +1361,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, return do_vmi_munmap(&vmi, mm, start, len, uf, false); } -unsigned long mmap_region(struct file *file, unsigned long addr, +static unsigned long __mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; pgoff_t pglen = PHYS_PFN(len); - struct vm_area_struct *merge; unsigned long charged = 0; struct vma_munmap_struct vms; struct ma_state mas_detach; struct maple_tree mt_detach; unsigned long end = addr + len; - bool writable_file_mapping = false; int error; VMA_ITERATOR(vmi, mm, addr); VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff); @@ -1448,28 +1446,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vm_flags_init(vma, vm_flags); vma->vm_page_prot = vm_get_page_prot(vm_flags); + if (vma_iter_prealloc(&vmi, vma)) { + error = -ENOMEM; + goto free_vma; + } + if (file) { vma->vm_file = get_file(file); error = mmap_file(file, vma); if (error) - goto unmap_and_free_vma; - - if (vma_is_shared_maywrite(vma)) { - error = mapping_map_writable(file->f_mapping); - if (error) - goto close_and_free_vma; - - writable_file_mapping = true; - } + goto unmap_and_free_file_vma; + /* Drivers cannot alter the address of the VMA. */ + WARN_ON_ONCE(addr != vma->vm_start); /* - * Expansion is handled above, merging is handled below. - * Drivers should not alter the address of the VMA. + * Drivers should not permit writability when previously it was + * disallowed. */ - if (WARN_ON((addr != vma->vm_start))) { - error = -EINVAL; - goto close_and_free_vma; - } + VM_WARN_ON_ONCE(vm_flags != vma->vm_flags && + !(vm_flags & VM_MAYWRITE) && + (vma->vm_flags & VM_MAYWRITE)); vma_iter_config(&vmi, addr, end); /* @@ -1477,6 +1473,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * vma again as we may succeed this time. */ if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) { + struct vm_area_struct *merge; + vmg.flags = vma->vm_flags; /* If this fails, state is reset ready for a reattempt. */ merge = vma_merge_new_range(&vmg); @@ -1494,7 +1492,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma = merge; /* Update vm_flags to pick up the change. */ vm_flags = vma->vm_flags; - goto unmap_writable; + goto file_expanded; } vma_iter_config(&vmi, addr, end); } @@ -1503,26 +1501,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr, } else if (vm_flags & VM_SHARED) { error = shmem_zero_setup(vma); if (error) - goto free_vma; + goto free_iter_vma; } else { vma_set_anonymous(vma); } - if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) { - error = -EACCES; - goto close_and_free_vma; - } - - /* Allow architectures to sanity-check the vm_flags */ - if (!arch_validate_flags(vma->vm_flags)) { - error = -EINVAL; - goto close_and_free_vma; - } - - if (vma_iter_prealloc(&vmi, vma)) { - error = -ENOMEM; - goto close_and_free_vma; - } +#ifdef CONFIG_SPARC64 + /* TODO: Fix SPARC ADI! */ + WARN_ON_ONCE(!arch_validate_flags(vm_flags)); +#endif /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); @@ -1536,10 +1523,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, */ khugepaged_enter_vma(vma, vma->vm_flags); - /* Once vma denies write, undo our temporary denial count */ -unmap_writable: - if (writable_file_mapping) - mapping_unmap_writable(file->f_mapping); +file_expanded: file = vma->vm_file; ksm_add_vma(vma); expanded: @@ -1572,23 +1556,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma_set_page_prot(vma); - validate_mm(mm); return addr; -close_and_free_vma: - vma_close(vma); - - if (file || vma->vm_file) { -unmap_and_free_vma: - fput(vma->vm_file); - vma->vm_file = NULL; +unmap_and_free_file_vma: + fput(vma->vm_file); + vma->vm_file = NULL; - vma_iter_set(&vmi, vma->vm_end); - /* Undo any partial mapping done by a device driver. */ - unmap_region(&vmi.mas, vma, vmg.prev, vmg.next); - } - if (writable_file_mapping) - mapping_unmap_writable(file->f_mapping); + vma_iter_set(&vmi, vma->vm_end); + /* Undo any partial mapping done by a device driver. */ + unmap_region(&vmi.mas, vma, vmg.prev, vmg.next); +free_iter_vma: + vma_iter_free(&vmi); free_vma: vm_area_free(vma); unacct_error: @@ -1598,10 +1576,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr, abort_munmap: vms_abort_munmap_vmas(&vms, &mas_detach); gather_failed: - validate_mm(mm); return error; } +unsigned long mmap_region(struct file *file, unsigned long addr, + unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, + struct list_head *uf) +{ + unsigned long ret; + bool writable_file_mapping = false; + + /* Check to see if MDWE is applicable. */ + if (map_deny_write_exec(vm_flags, vm_flags)) + return -EACCES; + + /* Allow architectures to sanity-check the vm_flags. */ + if (!arch_validate_flags(vm_flags)) + return -EINVAL; + + /* Map writable and ensure this isn't a sealed memfd. */ + if (file && is_shared_maywrite(vm_flags)) { + int error = mapping_map_writable(file->f_mapping); + + if (error) + return error; + writable_file_mapping = true; + } + + ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf); + + /* Clear our write mapping regardless of error. */ + if (writable_file_mapping) + mapping_unmap_writable(file->f_mapping); + + validate_mm(current->mm); + return ret; +} + static int __vm_munmap(unsigned long start, size_t len, bool unlock) { int ret;