Message ID | 20211124192024.2408218-4-catalin.marinas@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Avoid live-lock in fault-in+uaccess loops with sub-page faults | expand |
On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote: > +++ b/fs/btrfs/ioctl.c > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > while (1) { > ret = -EFAULT; > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > + if (fault_in_exact_writeable(ubuf + sk_offset, > + *buf_size - sk_offset)) > break; > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); Couldn't we avoid all of this nastiness by doing ... @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path, * problem. Otherwise we'll fault and then copy the buffer in * properly this next time through */ - if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { - ret = 0; + ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh)); + if (ret) goto out; - } *sk_offset += sizeof(sh); @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode, int ret; int num_found = 0; unsigned long sk_offset = 0; + unsigned long next_offset = 0; if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) { *buf_size = sizeof(struct btrfs_ioctl_search_header); @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, while (1) { ret = -EFAULT; - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) + if (fault_in_writeable(ubuf + sk_offset + next_offset, + *buf_size - sk_offset - next_offset)) break; ret = btrfs_search_forward(root, &key, path, sk->min_transid); @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode, ret = copy_to_sk(path, &key, sk, buf_size, ubuf, &sk_offset, &num_found); btrfs_release_path(path); - if (ret) + if (ret > 0) + next_offset = ret; + else if (ret < 0) break; - } - if (ret > 0) + if (ret == -ENOSPC || ret > 0) ret = 0; err: sk->nr_items = num_found; (not shown: the tedious bits where the existing 'ret = 1' are converted to 'ret = -ENOSPC' in copy_to_sk()) (where __copy_to_user_nofault() is a new function that does exactly what copy_to_user_nofault() does, but returns the number of bytes copied) That way, the existing fault_in_writable() will get the fault, and we don't need to probe every 16 bytes.
On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote: > On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote: > > +++ b/fs/btrfs/ioctl.c > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > > > while (1) { > > ret = -EFAULT; > > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > > + if (fault_in_exact_writeable(ubuf + sk_offset, > > + *buf_size - sk_offset)) > > break; > > > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > > Couldn't we avoid all of this nastiness by doing ... I had a similar attempt initially but I concluded that it doesn't work: https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com > @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path, > * problem. Otherwise we'll fault and then copy the buffer in > * properly this next time through > */ > - if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { > - ret = 0; > + ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh)); > + if (ret) There is no requirement for the arch implementation to be exact and copy the maximum number of bytes possible. It can fail early while there are still some bytes left that would not fault. The only requirement is that if it is restarted from where it faulted, it makes some progress (on arm64 there is one extra byte). > goto out; > - } > > *sk_offset += sizeof(sh); > @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode, > int ret; > int num_found = 0; > unsigned long sk_offset = 0; > + unsigned long next_offset = 0; > > if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) { > *buf_size = sizeof(struct btrfs_ioctl_search_header); > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > while (1) { > ret = -EFAULT; > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > + if (fault_in_writeable(ubuf + sk_offset + next_offset, > + *buf_size - sk_offset - next_offset)) > break; > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode, > ret = copy_to_sk(path, &key, sk, buf_size, ubuf, > &sk_offset, &num_found); > btrfs_release_path(path); > - if (ret) > + if (ret > 0) > + next_offset = ret; So after this point, ubuf+sk_offset+next_offset is writeable by fault_in_writable(). If copy_to_user() was attempted on ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts the copy from ubuf+sk_offset, so it returns exacting the same ret as in the previous iteration.
Catalin talked about the other change, but this part: On Wed, Nov 24, 2021 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote: > > (where __copy_to_user_nofault() is a new function that does exactly what > copy_to_user_nofault() does, but returns the number of bytes copied) If we want the "how many bytes" part, then we should just make copy_to_user_nofault() have the same semantics as a plain copy_to_user(). IOW, change it to return "number of bytes not copied". Lookin gat the current uses, such a change would be trivial. The only case that wants a 0/-EFAULT error is the bpf_probe_write_user(), everybody else already just wants "zero for success", so changing copy_to_user_nofault() would be trivial. And it really is odd and very non-intuitive that copy_to_user_nofault() has a completely different return value from copy_to_user(). So if _anybody_ wants a byte-count, that should just be fixed. Linus
On Wed, Nov 24, 2021 at 03:00:00PM -0800, Linus Torvalds wrote: > On Wed, Nov 24, 2021 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote: > > (where __copy_to_user_nofault() is a new function that does exactly what > > copy_to_user_nofault() does, but returns the number of bytes copied) > > If we want the "how many bytes" part, then we should just make > copy_to_user_nofault() have the same semantics as a plain > copy_to_user(). > > IOW, change it to return "number of bytes not copied". > > Looking at the current uses, such a change would be trivial. The only > case that wants a 0/-EFAULT error is the bpf_probe_write_user(), > everybody else already just wants "zero for success", so changing > copy_to_user_nofault() would be trivial. I agree, if we want the number of byte not copied, we should just change copy_{to,from}_user_nofault() and their callers (I can count three each). For this specific btrfs case, if we want go with tuning the offset based on the fault address, we'd need copy_to_user_nofault() (or a new function) to be exact. IOW, fall back to byte-at-a-time copy until it hits the real faulting address.
On Thu, Nov 25, 2021 at 3:10 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > For this specific btrfs case, if we want go with tuning the offset based > on the fault address, we'd need copy_to_user_nofault() (or a new > function) to be exact. I really don't see why you harp on the exactness. I really believe that the fix is to make the read/write probing just be more aggressive. Make the read/write probing require that AT LEAST <n> bytes be readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and ALIGN is whatever size that copy_from/to_user_xyz() might require just because it might do multi-byte accesses. In fact, make ALIGN be perhaps something reasonable like 512 bytes or whatever, and then you know you can handle the btrfs "copy a whole structure and reset if that fails" case too. Don't require that the fundamental copying routines (and whatever fixup the code might need) be some kind of byte-precise - it's the error case that should instead be made stricter. If the user gave you a range that triggered a pointer color mismatch, then returning an error is fine, rather than say "we'll do as much as we can and waste time and effort on being byte-exact too". Your earlier argument was that it was too expensive to probe things. That was based on looking at the whole range that migth be MB (or GB) in size. So just make it check the first <n> bytes, and problem solved. Linus
On Thu, Nov 25, 2021 at 10:13:25AM -0800, Linus Torvalds wrote: > On Thu, Nov 25, 2021 at 3:10 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > For this specific btrfs case, if we want go with tuning the offset based > > on the fault address, we'd need copy_to_user_nofault() (or a new > > function) to be exact. > > I really don't see why you harp on the exactness. I guess because I always thought we either fix fault_in_writable() to probe the whole range (this series) or we change the loops to take the copy_to_user() returned value into account when re-faulting. > I really believe that the fix is to make the read/write probing just > be more aggressive. > > Make the read/write probing require that AT LEAST <n> bytes be > readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and > ALIGN is whatever size that copy_from/to_user_xyz() might require just > because it might do multi-byte accesses. > > In fact, make ALIGN be perhaps something reasonable like 512 bytes or > whatever, and then you know you can handle the btrfs "copy a whole > structure and reset if that fails" case too. IIUC what you are suggesting, we still need changes to the btrfs loop similar to willy's but that should work fine together with a slightly more aggressive fault_in_writable(). A probing of at least sizeof(struct btrfs_ioctl_search_key) should suffice without any loop changes and 512 would cover it but it doesn't look generic enough. We could pass a 'probe_prefix' argument to fault_in_exact_writeable() to only probe this and btrfs would just specify the above sizeof(). > Don't require that the fundamental copying routines (and whatever > fixup the code might need) be some kind of byte-precise - it's the > error case that should instead be made stricter. > > If the user gave you a range that triggered a pointer color mismatch, > then returning an error is fine, rather than say "we'll do as much as > we can and waste time and effort on being byte-exact too". Yes, I'm fine with not copying anything at all, I just want to avoid the infinite loop. I think we are down to three approaches: 1. Probe the whole range in fault_in_*() for sub-page faults, no need to worry about copy_*_user() failures. 2. Get fault_in_*() to probe a sufficient prefix to cover the uaccess inexactness. In addition, change the btrfs loop to fault-in from where the copy_to_user() failed (willy's suggestion combined with the more aggressive probing in fault_in_*()). 3. Implement fault_in_exact_writeable(uaddr, size, probe_prefix) where loops that do some rewind would pass an appropriate value. (1) is this series, (2) requires changing the loop logic, (3) looks pretty simple. I'm happy to attempt either (2) or (3) with a preference for the latter.
On Thu, Nov 25, 2021 at 08:43:57PM +0000, Catalin Marinas wrote: > > I really believe that the fix is to make the read/write probing just > > be more aggressive. > > > > Make the read/write probing require that AT LEAST <n> bytes be > > readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and > > ALIGN is whatever size that copy_from/to_user_xyz() might require just > > because it might do multi-byte accesses. > > > > In fact, make ALIGN be perhaps something reasonable like 512 bytes or > > whatever, and then you know you can handle the btrfs "copy a whole > > structure and reset if that fails" case too. > > IIUC what you are suggesting, we still need changes to the btrfs loop > similar to willy's but that should work fine together with a slightly > more aggressive fault_in_writable(). > > A probing of at least sizeof(struct btrfs_ioctl_search_key) should > suffice without any loop changes and 512 would cover it but it doesn't > look generic enough. We could pass a 'probe_prefix' argument to > fault_in_exact_writeable() to only probe this and btrfs would just > specify the above sizeof(). How about something like this? +++ b/mm/gup.c @@ -1672,6 +1672,13 @@ size_t fault_in_writeable(char __user *uaddr, size_t size) if (unlikely(size == 0)) return 0; + if (SUBPAGE_PROBE_INTERVAL) { + while (uaddr < PAGE_ALIGN((unsigned long)uaddr)) { + if (unlikely(__put_user(0, uaddr) != 0)) + goto out; + uaddr += SUBPAGE_PROBE_INTERVAL; + } + } if (!PAGE_ALIGNED(uaddr)) { if (unlikely(__put_user(0, uaddr) != 0)) return size; ARM then defines SUBPAGE_PROBE_INTERVAL to be 16 and the rest of us leave it as 0. That way we probe all the way to the end of the current page and the start of the next page. Oh, that needs to be checked to not exceed size as well ... anyway, you get the idea.
On Thu, Nov 25, 2021 at 09:02:29PM +0000, Matthew Wilcox wrote: > On Thu, Nov 25, 2021 at 08:43:57PM +0000, Catalin Marinas wrote: > > > I really believe that the fix is to make the read/write probing just > > > be more aggressive. > > > > > > Make the read/write probing require that AT LEAST <n> bytes be > > > readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and > > > ALIGN is whatever size that copy_from/to_user_xyz() might require just > > > because it might do multi-byte accesses. > > > > > > In fact, make ALIGN be perhaps something reasonable like 512 bytes or > > > whatever, and then you know you can handle the btrfs "copy a whole > > > structure and reset if that fails" case too. > > > > IIUC what you are suggesting, we still need changes to the btrfs loop > > similar to willy's but that should work fine together with a slightly > > more aggressive fault_in_writable(). > > > > A probing of at least sizeof(struct btrfs_ioctl_search_key) should > > suffice without any loop changes and 512 would cover it but it doesn't > > look generic enough. We could pass a 'probe_prefix' argument to > > fault_in_exact_writeable() to only probe this and btrfs would just > > specify the above sizeof(). > > How about something like this? > > +++ b/mm/gup.c > @@ -1672,6 +1672,13 @@ size_t fault_in_writeable(char __user *uaddr, size_t size) > > if (unlikely(size == 0)) > return 0; > + if (SUBPAGE_PROBE_INTERVAL) { > + while (uaddr < PAGE_ALIGN((unsigned long)uaddr)) { > + if (unlikely(__put_user(0, uaddr) != 0)) > + goto out; > + uaddr += SUBPAGE_PROBE_INTERVAL; > + } > + } > if (!PAGE_ALIGNED(uaddr)) { > if (unlikely(__put_user(0, uaddr) != 0)) > return size; > > ARM then defines SUBPAGE_PROBE_INTERVAL to be 16 and the rest of us > leave it as 0. That way we probe all the way to the end of the current > page and the start of the next page. It doesn't help if the copy_to_user() fault happens 16 bytes into the second page for example. The fault_in() passes, copy_to_user() fails and the loop restarts from the same place. With sub-page faults, the page boundary doesn't have any relevance. We want to probe the beginning of the buffer that's at least as big as the loop rewind size even if it goes past a page boundary.
On Thu, Nov 25, 2021 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote: > On Thu, Nov 25, 2021 at 08:43:57PM +0000, Catalin Marinas wrote: > > > I really believe that the fix is to make the read/write probing just > > > be more aggressive. > > > > > > Make the read/write probing require that AT LEAST <n> bytes be > > > readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and > > > ALIGN is whatever size that copy_from/to_user_xyz() might require just > > > because it might do multi-byte accesses. > > > > > > In fact, make ALIGN be perhaps something reasonable like 512 bytes or > > > whatever, and then you know you can handle the btrfs "copy a whole > > > structure and reset if that fails" case too. > > > > IIUC what you are suggesting, we still need changes to the btrfs loop > > similar to willy's but that should work fine together with a slightly > > more aggressive fault_in_writable(). > > > > A probing of at least sizeof(struct btrfs_ioctl_search_key) should > > suffice without any loop changes and 512 would cover it but it doesn't > > look generic enough. We could pass a 'probe_prefix' argument to > > fault_in_exact_writeable() to only probe this and btrfs would just > > specify the above sizeof(). > > How about something like this? > > +++ b/mm/gup.c > @@ -1672,6 +1672,13 @@ size_t fault_in_writeable(char __user *uaddr, size_t size) > > if (unlikely(size == 0)) > return 0; > + if (SUBPAGE_PROBE_INTERVAL) { > + while (uaddr < PAGE_ALIGN((unsigned long)uaddr)) { > + if (unlikely(__put_user(0, uaddr) != 0)) > + goto out; > + uaddr += SUBPAGE_PROBE_INTERVAL; > + } > + } > if (!PAGE_ALIGNED(uaddr)) { > if (unlikely(__put_user(0, uaddr) != 0)) > return size; > > ARM then defines SUBPAGE_PROBE_INTERVAL to be 16 and the rest of us > leave it as 0. That way we probe all the way to the end of the current > page and the start of the next page. > > Oh, that needs to be checked to not exceed size as well ... anyway, > you get the idea. Note that we don't need this additional probing when accessing the buffer with byte granularity. That's probably the common case. Andreas
On Wed, Nov 24, 2021 at 9:37 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote: > > On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote: > > > +++ b/fs/btrfs/ioctl.c > > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > > > > > while (1) { > > > ret = -EFAULT; > > > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > > > + if (fault_in_exact_writeable(ubuf + sk_offset, > > > + *buf_size - sk_offset)) > > > break; > > > > > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > > > > Couldn't we avoid all of this nastiness by doing ... > > I had a similar attempt initially but I concluded that it doesn't work: > > https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com > > > @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path, > > * problem. Otherwise we'll fault and then copy the buffer in > > * properly this next time through > > */ > > - if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { > > - ret = 0; > > + ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh)); > > + if (ret) > > There is no requirement for the arch implementation to be exact and copy > the maximum number of bytes possible. It can fail early while there are > still some bytes left that would not fault. The only requirement is that > if it is restarted from where it faulted, it makes some progress (on > arm64 there is one extra byte). > > > goto out; > > - } > > > > *sk_offset += sizeof(sh); > > @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode, > > int ret; > > int num_found = 0; > > unsigned long sk_offset = 0; > > + unsigned long next_offset = 0; > > > > if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) { > > *buf_size = sizeof(struct btrfs_ioctl_search_header); > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > > > while (1) { > > ret = -EFAULT; > > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > > + if (fault_in_writeable(ubuf + sk_offset + next_offset, > > + *buf_size - sk_offset - next_offset)) > > break; > > > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > > @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode, > > ret = copy_to_sk(path, &key, sk, buf_size, ubuf, > > &sk_offset, &num_found); > > btrfs_release_path(path); > > - if (ret) > > + if (ret > 0) > > + next_offset = ret; > > So after this point, ubuf+sk_offset+next_offset is writeable by > fault_in_writable(). If copy_to_user() was attempted on > ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts > the copy from ubuf+sk_offset, so it returns exacting the same ret as in > the previous iteration. So this means that after a short copy_to_user_nofault(), copy_to_sk() needs to figure out the actual point of failure. We'll have the same problem elsewhere, so this should probably be a generic helper. The alignment hacks are arch specific, so maybe we can have a generic version that assumes no alignment restrictions, with arch-specific overrides. Once we know the exact point of failure, a fault_in_writeable(point_of_failure, 1) in search_ioctl() will tell if the failure is pertinent. Once we know that the failure isn't pertinent, we're safe to retry the original fault_in_writeable(). Thanks, Andreas
On Thu, Nov 25, 2021 at 11:25:54PM +0100, Andreas Gruenbacher wrote: > On Wed, Nov 24, 2021 at 9:37 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote: > > > On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote: > > > > +++ b/fs/btrfs/ioctl.c > > > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > > > > > > > while (1) { > > > > ret = -EFAULT; > > > > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > > > > + if (fault_in_exact_writeable(ubuf + sk_offset, > > > > + *buf_size - sk_offset)) > > > > break; > > > > > > > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > > > > > > Couldn't we avoid all of this nastiness by doing ... > > > > I had a similar attempt initially but I concluded that it doesn't work: > > > > https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com > > > > > @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path, > > > * problem. Otherwise we'll fault and then copy the buffer in > > > * properly this next time through > > > */ > > > - if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { > > > - ret = 0; > > > + ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh)); > > > + if (ret) > > > > There is no requirement for the arch implementation to be exact and copy > > the maximum number of bytes possible. It can fail early while there are > > still some bytes left that would not fault. The only requirement is that > > if it is restarted from where it faulted, it makes some progress (on > > arm64 there is one extra byte). > > > > > goto out; > > > - } > > > > > > *sk_offset += sizeof(sh); > > > @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode, > > > int ret; > > > int num_found = 0; > > > unsigned long sk_offset = 0; > > > + unsigned long next_offset = 0; > > > > > > if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) { > > > *buf_size = sizeof(struct btrfs_ioctl_search_header); > > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > > > > > while (1) { > > > ret = -EFAULT; > > > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > > > + if (fault_in_writeable(ubuf + sk_offset + next_offset, > > > + *buf_size - sk_offset - next_offset)) > > > break; > > > > > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > > > @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode, > > > ret = copy_to_sk(path, &key, sk, buf_size, ubuf, > > > &sk_offset, &num_found); > > > btrfs_release_path(path); > > > - if (ret) > > > + if (ret > 0) > > > + next_offset = ret; > > > > So after this point, ubuf+sk_offset+next_offset is writeable by > > fault_in_writable(). If copy_to_user() was attempted on > > ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts > > the copy from ubuf+sk_offset, so it returns exacting the same ret as in > > the previous iteration. > > So this means that after a short copy_to_user_nofault(), copy_to_sk() > needs to figure out the actual point of failure. We'll have the same > problem elsewhere, so this should probably be a generic helper. The > alignment hacks are arch specific, so maybe we can have a generic > version that assumes no alignment restrictions, with arch-specific > overrides. > > Once we know the exact point of failure, a > fault_in_writeable(point_of_failure, 1) in search_ioctl() will tell if > the failure is pertinent. Once we know that the failure isn't > pertinent, we're safe to retry the original fault_in_writeable(). The "exact point of failure" is problematic since copy_to_user() may fail a few bytes before the actual fault point (e.g. by doing an unaligned store). As per Linus' reply, we can work around this by doing a sub-page fault_in_writable(point_of_failure, align) where 'align' should cover the copy_to_user() impreciseness. (of course, fault_in_writable() takes the full size argument but behind the scene it probes the 'align' prefix at sub-page fault granularity)
On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote: > Commit a48b73eca4ce ("btrfs: fix potential deadlock in the search > ioctl") addressed a lockdep warning by pre-faulting the user pages and > attempting the copy_to_user_nofault() in an infinite loop. On > architectures like arm64 with MTE, an access may fault within a page at > a location different from what fault_in_writeable() probed. Since the > sk_offset is rewound to the previous struct btrfs_ioctl_search_header > boundary, there is no guaranteed forward progress and search_ioctl() may > live-lock. > > Use fault_in_exact_writeable() instead which probes the entire user > buffer for faults at sub-page granularity. > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > Reported-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: David Sterba <dsterba@suse.com>
On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Nov 25, 2021 at 11:25:54PM +0100, Andreas Gruenbacher wrote: > > On Wed, Nov 24, 2021 at 9:37 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote: > > > > On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote: > > > > > +++ b/fs/btrfs/ioctl.c > > > > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > > > > > > > > > while (1) { > > > > > ret = -EFAULT; > > > > > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > > > > > + if (fault_in_exact_writeable(ubuf + sk_offset, > > > > > + *buf_size - sk_offset)) > > > > > break; > > > > > > > > > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > > > > > > > > Couldn't we avoid all of this nastiness by doing ... > > > > > > I had a similar attempt initially but I concluded that it doesn't work: > > > > > > https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com > > > > > > > @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path, > > > > * problem. Otherwise we'll fault and then copy the buffer in > > > > * properly this next time through > > > > */ > > > > - if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { > > > > - ret = 0; > > > > + ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh)); > > > > + if (ret) > > > > > > There is no requirement for the arch implementation to be exact and copy > > > the maximum number of bytes possible. It can fail early while there are > > > still some bytes left that would not fault. The only requirement is that > > > if it is restarted from where it faulted, it makes some progress (on > > > arm64 there is one extra byte). > > > > > > > goto out; > > > > - } > > > > > > > > *sk_offset += sizeof(sh); > > > > @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode, > > > > int ret; > > > > int num_found = 0; > > > > unsigned long sk_offset = 0; > > > > + unsigned long next_offset = 0; > > > > > > > > if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) { > > > > *buf_size = sizeof(struct btrfs_ioctl_search_header); > > > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, > > > > > > > > while (1) { > > > > ret = -EFAULT; > > > > - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) > > > > + if (fault_in_writeable(ubuf + sk_offset + next_offset, > > > > + *buf_size - sk_offset - next_offset)) > > > > break; > > > > > > > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > > > > @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode, > > > > ret = copy_to_sk(path, &key, sk, buf_size, ubuf, > > > > &sk_offset, &num_found); > > > > btrfs_release_path(path); > > > > - if (ret) > > > > + if (ret > 0) > > > > + next_offset = ret; > > > > > > So after this point, ubuf+sk_offset+next_offset is writeable by > > > fault_in_writable(). If copy_to_user() was attempted on > > > ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts > > > the copy from ubuf+sk_offset, so it returns exacting the same ret as in > > > the previous iteration. > > > > So this means that after a short copy_to_user_nofault(), copy_to_sk() > > needs to figure out the actual point of failure. We'll have the same > > problem elsewhere, so this should probably be a generic helper. The > > alignment hacks are arch specific, so maybe we can have a generic > > version that assumes no alignment restrictions, with arch-specific > > overrides. > > > > Once we know the exact point of failure, a > > fault_in_writeable(point_of_failure, 1) in search_ioctl() will tell if > > the failure is pertinent. Once we know that the failure isn't > > pertinent, we're safe to retry the original fault_in_writeable(). > > The "exact point of failure" is problematic since copy_to_user() may > fail a few bytes before the actual fault point (e.g. by doing an > unaligned store). That's why after the initial failure, we must keep trying until we hit the actual point of failure independent of alignment. If there's even a single writable byte left, fault_in_writable() won't fail and we'll be stuck in a loop. On the other hand, once we've reached the actual point of failure, the existing version of fault_in_writeable() will work for sub-page faults as well. > As per Linus' reply, we can work around this by doing > a sub-page fault_in_writable(point_of_failure, align) where 'align' > should cover the copy_to_user() impreciseness. > > (of course, fault_in_writable() takes the full size argument but behind > the scene it probes the 'align' prefix at sub-page fault granularity) That doesn't make sense; we don't want fault_in_writable() to fail or succeed depending on the alignment of the address range passed to it. Have a look at the below code to see what I mean. Function copy_to_user_nofault_unaligned() should be further optimized, maybe as mm/maccess.c:copy_from_kernel_nofault() and/or per architecture depending on the actual alignment rules; I'm not sure. Thanks, Andreas diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4e03a6d3aa32..067408fd26f9 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -6764,7 +6764,8 @@ void read_extent_buffer(const struct extent_buffer *eb, void *dstv, int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb, void __user *dstv, - unsigned long start, unsigned long len) + unsigned long start, unsigned long len, + void __user **copy_failure) { size_t cur; size_t offset; @@ -6773,6 +6774,7 @@ int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb, char __user *dst = (char __user *)dstv; unsigned long i = get_eb_page_index(start); int ret = 0; + size_t rest; WARN_ON(start > eb->len); WARN_ON(start + len > eb->start + eb->len); @@ -6784,7 +6786,9 @@ int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb, cur = min(len, (PAGE_SIZE - offset)); kaddr = page_address(page); - if (copy_to_user_nofault(dst, kaddr + offset, cur)) { + rest = copy_to_user_nofault_unaligned(dst, kaddr + offset, cur); + if (rest) { + *copy_failure = dst + cur - rest; ret = -EFAULT; break; } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 0399cf8e3c32..833ff597a27f 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -238,9 +238,11 @@ int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv, void read_extent_buffer(const struct extent_buffer *eb, void *dst, unsigned long start, unsigned long len); +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size); int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb, void __user *dst, unsigned long start, - unsigned long len); + unsigned long len, + void __user **copy_failure); void write_extent_buffer_fsid(const struct extent_buffer *eb, const void *src); void write_extent_buffer_chunk_tree_uuid(const struct extent_buffer *eb, const void *src); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index fb8cc9642ac4..646f9c0251d9 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key, return 1; } +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size) +{ + size_t rest = copy_to_user_nofault(to, from, size); + + if (rest) { + size_t n; + + for (n = size - rest; n < size; n++) { + if (copy_to_user_nofault(to + n, from + n, 1)) + break; + } + rest = size - n; + } + return rest; +} + static noinline int copy_to_sk(struct btrfs_path *path, struct btrfs_key *key, struct btrfs_ioctl_search_key *sk, size_t *buf_size, char __user *ubuf, unsigned long *sk_offset, - int *num_found) + int *num_found, + void __user **copy_failure) { u64 found_transid; struct extent_buffer *leaf; @@ -2069,6 +2086,7 @@ static noinline int copy_to_sk(struct btrfs_path *path, int i; int slot; int ret = 0; + size_t rest; leaf = path->nodes[0]; slot = path->slots[0]; @@ -2121,7 +2139,9 @@ static noinline int copy_to_sk(struct btrfs_path *path, * problem. Otherwise we'll fault and then copy the buffer in * properly this next time through */ - if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { + rest = copy_to_user_nofault_unaligned(ubuf + *sk_offset, &sh, sizeof(sh)); + if (rest) { + *copy_failure = ubuf + *sk_offset + sizeof(sh) - rest; ret = 0; goto out; } @@ -2135,7 +2155,8 @@ static noinline int copy_to_sk(struct btrfs_path *path, * * sk_offset so we copy the full thing again. */ if (read_extent_buffer_to_user_nofault(leaf, up, - item_off, item_len)) { + item_off, item_len, + copy_failure)) { ret = 0; *sk_offset -= sizeof(sh); goto out; @@ -2222,6 +2243,8 @@ static noinline int search_ioctl(struct inode *inode, key.offset = sk->min_offset; while (1) { + void __user *copy_failure = NULL; + ret = -EFAULT; if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) break; @@ -2233,11 +2256,13 @@ static noinline int search_ioctl(struct inode *inode, goto err; } ret = copy_to_sk(path, &key, sk, buf_size, ubuf, - &sk_offset, &num_found); + &sk_offset, &num_found, ©_failure); btrfs_release_path(path); if (ret) break; - + ret = -EFAULT; + if (copy_failure && fault_in_writeable(copy_failure, 1)) + break; } if (ret > 0) ret = 0;
On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote: > On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > As per Linus' reply, we can work around this by doing > > a sub-page fault_in_writable(point_of_failure, align) where 'align' > > should cover the copy_to_user() impreciseness. > > > > (of course, fault_in_writable() takes the full size argument but behind > > the scene it probes the 'align' prefix at sub-page fault granularity) > > That doesn't make sense; we don't want fault_in_writable() to fail or > succeed depending on the alignment of the address range passed to it. If we know that the arch copy_to_user() has an error of say maximum 16 bytes (or 15 rather on arm64), we can instead get fault_in_writeable() to probe first 16 bytes rather than 1. > Have a look at the below code to see what I mean. Function > copy_to_user_nofault_unaligned() should be further optimized, maybe as > mm/maccess.c:copy_from_kernel_nofault() and/or per architecture > depending on the actual alignment rules; I'm not sure. [...] > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key, > return 1; > } > > +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size) > +{ > + size_t rest = copy_to_user_nofault(to, from, size); > + > + if (rest) { > + size_t n; > + > + for (n = size - rest; n < size; n++) { > + if (copy_to_user_nofault(to + n, from + n, 1)) > + break; > + } > + rest = size - n; > + } > + return rest; That's what I was trying to avoid. That's basically a fall-back to byte at a time copy (we do this in copy_mount_options(); at some point we even had a copy_from_user_exact() IIRC). Linus' idea (if I got it correctly) was instead to slightly extend the probing in fault_in_writeable() for the beginning of the buffer from 1 byte to some per-arch range. I attempted the above here and works ok: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix but too late to post it this evening, I'll do it in the next day or so as an alternative to this series.
On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote: > > On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > As per Linus' reply, we can work around this by doing > > > a sub-page fault_in_writable(point_of_failure, align) where 'align' > > > should cover the copy_to_user() impreciseness. > > > > > > (of course, fault_in_writable() takes the full size argument but behind > > > the scene it probes the 'align' prefix at sub-page fault granularity) > > > > That doesn't make sense; we don't want fault_in_writable() to fail or > > succeed depending on the alignment of the address range passed to it. > > If we know that the arch copy_to_user() has an error of say maximum 16 > bytes (or 15 rather on arm64), we can instead get fault_in_writeable() > to probe the first 16 bytes rather than 1. That isn't going to help one bit: [raw_]copy_to_user() is allowed to copy as little or as much as it wants as long as it follows the rules documented in include/linux/uaccess.h: [] If copying succeeds, the return value must be 0. If some data cannot be [] fetched, it is permitted to copy less than had been fetched; the only [] hard requirement is that not storing anything at all (i.e. returning size) [] should happen only when nothing could be copied. In other words, you don't [] have to squeeze as much as possible - it is allowed, but not necessary. When fault_in_writeable() tells us that an address range is accessible in principle, that doesn't mean that copy_to_user() will allow us to access it in arbitrary chunks. It's also not the case that fault_in_writeable(addr, size) is always followed by copy_to_user(addr, ..., size) for the exact same address range, not even in this case. These alignment restrictions have nothing to do with page or sub-page faults. I'm also fairly sure that passing in an unaligned buffer will send search_ioctl into an endless loop on architectures with copy_to_user() alignment restrictions; there don't seem to be any buffer alignment checks. > > Have a look at the below code to see what I mean. Function > > copy_to_user_nofault_unaligned() should be further optimized, maybe as > > mm/maccess.c:copy_from_kernel_nofault() and/or per architecture > > depending on the actual alignment rules; I'm not sure. > [...] > > --- a/fs/btrfs/ioctl.c > > +++ b/fs/btrfs/ioctl.c > > @@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key, > > return 1; > > } > > > > +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size) > > +{ > > + size_t rest = copy_to_user_nofault(to, from, size); > > + > > + if (rest) { > > + size_t n; > > + > > + for (n = size - rest; n < size; n++) { > > + if (copy_to_user_nofault(to + n, from + n, 1)) > > + break; > > + } > > + rest = size - n; > > + } > > + return rest; > > That's what I was trying to avoid. That's basically a fall-back to byte > at a time copy (we do this in copy_mount_options(); at some point we > even had a copy_from_user_exact() IIRC). We could try 8/4/2 byte chunks if both buffers are 8/4/2-byte aligned. It's just not clear that it's worth it. > Linus' idea (if I got it correctly) was instead to slightly extend the > probing in fault_in_writeable() for the beginning of the buffer from 1 > byte to some per-arch range. > > I attempted the above here and works ok: > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix > > but too late to post it this evening, I'll do it in the next day or so > as an alternative to this series. > > -- > Catalin > Thanks, Andreas
On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote: > > > On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > As per Linus' reply, we can work around this by doing > > > > a sub-page fault_in_writable(point_of_failure, align) where 'align' > > > > should cover the copy_to_user() impreciseness. > > > > > > > > (of course, fault_in_writable() takes the full size argument but behind > > > > the scene it probes the 'align' prefix at sub-page fault granularity) > > > > > > That doesn't make sense; we don't want fault_in_writable() to fail or > > > succeed depending on the alignment of the address range passed to it. > > > > If we know that the arch copy_to_user() has an error of say maximum 16 > > bytes (or 15 rather on arm64), we can instead get fault_in_writeable() > > to probe the first 16 bytes rather than 1. > > That isn't going to help one bit: [raw_]copy_to_user() is allowed to > copy as little or as much as it wants as long as it follows the rules > documented in include/linux/uaccess.h: > > [] If copying succeeds, the return value must be 0. If some data cannot be > [] fetched, it is permitted to copy less than had been fetched; the only > [] hard requirement is that not storing anything at all (i.e. returning size) > [] should happen only when nothing could be copied. In other words, you don't > [] have to squeeze as much as possible - it is allowed, but not necessary. > > When fault_in_writeable() tells us that an address range is accessible > in principle, that doesn't mean that copy_to_user() will allow us to > access it in arbitrary chunks. It's also not the case that > fault_in_writeable(addr, size) is always followed by > copy_to_user(addr, ..., size) for the exact same address range, not > even in this case. > > These alignment restrictions have nothing to do with page or sub-page faults. > > I'm also fairly sure that passing in an unaligned buffer will send > search_ioctl into an endless loop on architectures with copy_to_user() > alignment restrictions; there don't seem to be any buffer alignment > checks. Let me retract that ... The description in include/linux/uaccess.h leaves out permissible reasons for fetching/storing less than requested. Thinking about it, if the address range passed to one of the copy functions includes an address that faults, it kind of makes sense to allow the copy function to stop short instead of copying every last byte right up to the address that fails. If that's the only reason, then it would be great to have that included in the description. And then we can indeed deal with the alignment effects in fault_in_writeable(). > > > copy_to_user_nofault_unaligned() should be further optimized, maybe as > > > mm/maccess.c:copy_from_kernel_nofault() and/or per architecture > > > depending on the actual alignment rules; I'm not sure. > > [...] > > > --- a/fs/btrfs/ioctl.c > > > +++ b/fs/btrfs/ioctl.c > > > @@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key, > > > return 1; > > > } > > > > > > +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size) > > > +{ > > > + size_t rest = copy_to_user_nofault(to, from, size); > > > + > > > + if (rest) { > > > + size_t n; > > > + > > > + for (n = size - rest; n < size; n++) { > > > + if (copy_to_user_nofault(to + n, from + n, 1)) > > > + break; > > > + } > > > + rest = size - n; > > > + } > > > + return rest; > > > > That's what I was trying to avoid. That's basically a fall-back to byte > > at a time copy (we do this in copy_mount_options(); at some point we > > even had a copy_from_user_exact() IIRC). > > We could try 8/4/2 byte chunks if both buffers are 8/4/2-byte aligned. > It's just not clear that it's worth it. > > > Linus' idea (if I got it correctly) was instead to slightly extend the > > probing in fault_in_writeable() for the beginning of the buffer from 1 > > byte to some per-arch range. > > > > I attempted the above here and works ok: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix > > > > but too late to post it this evening, I'll do it in the next day or so > > as an alternative to this series. I've taken a quick look. Under the assumption that alignment effects are tied to page / sub-page faults, I think we can really solve this generically as Willy has proposed. Maybe as shown below; no need for arch-specific code. Thanks, Andreas diff --git a/mm/gup.c b/mm/gup.c index 2c51e9748a6a..a9b3d916b625 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, } #endif /* !CONFIG_MMU */ +#define SUBPAGE_FAULT_SIZE 16 + /** * fault_in_writeable - fault in userspace address range for writing * @uaddr: start of address range @@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size) if (unlikely(size == 0)) return 0; if (!PAGE_ALIGNED(uaddr)) { + if (SUBPAGE_FAULT_SIZE && + !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) { + end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE); + if (end - uaddr < size) { + if (unlikely(__put_user(0, uaddr) != 0)) + return size; + uaddr = end; + if (unlikely(!end)) + goto out; + } + } if (unlikely(__put_user(0, uaddr) != 0)) - return size; + goto out; uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr); } end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
On Sat, Nov 27, 2021 at 04:52:16AM +0100, Andreas Gruenbacher wrote: > On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote: > > > On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > As per Linus' reply, we can work around this by doing > > > > a sub-page fault_in_writable(point_of_failure, align) where 'align' > > > > should cover the copy_to_user() impreciseness. > > > > > > > > (of course, fault_in_writable() takes the full size argument but behind > > > > the scene it probes the 'align' prefix at sub-page fault granularity) > > > > > > That doesn't make sense; we don't want fault_in_writable() to fail or > > > succeed depending on the alignment of the address range passed to it. > > > > If we know that the arch copy_to_user() has an error of say maximum 16 > > bytes (or 15 rather on arm64), we can instead get fault_in_writeable() > > to probe the first 16 bytes rather than 1. > > That isn't going to help one bit: Not on its own but it does allow the restarted loop to use fault_in_writeable() on the address where copy_to_user() stopped, without the need to be byte-precise in the latter. > [raw_]copy_to_user() is allowed to > copy as little or as much as it wants as long as it follows the rules > documented in include/linux/uaccess.h: > > [] If copying succeeds, the return value must be 0. If some data cannot be > [] fetched, it is permitted to copy less than had been fetched; the only > [] hard requirement is that not storing anything at all (i.e. returning size) > [] should happen only when nothing could be copied. In other words, you don't > [] have to squeeze as much as possible - it is allowed, but not necessary. > > When fault_in_writeable() tells us that an address range is accessible > in principle, that doesn't mean that copy_to_user() will allow us to > access it in arbitrary chunks. Ignoring sub-page faults, my interpretation of the fault_in_writeable() semantics is that an arbitrary copy_to_user() within the faulted in range will *eventually* either succeed or the fault_in() fails. There are some theoretical live-lock conditions like a concurrent thread changing the permission (mprotect) in a way that fault_in() always succeeds and copy_to_user() always fails. Fortunately that's just theoretical. The above interpretation doesn't hold with sub-page faults because of the way fault_in_writeable() is probing - one byte per page. This series takes the big hammer approach of making the liveness assumption above work in the presence of sub-page faults. I'm fine with this since, from my tests so far, only the btrfs search_ioctl() is affected and fault_in_writeable() is not used anywhere else that matters (some highmem stuff we don't have on arm64). > It's also not the case that fault_in_writeable(addr, size) is always > followed by copy_to_user(addr, ..., size) for the exact same address > range, not even in this case. I agree, that's not a requirement. But there are some expectations of how the fault_in_writeable()/copy_to_user() pair is used, typically: a) pre-fault before the uaccess with the copy_to_user() within the range faulted in or b) copy_to_user() attempted with a subsequent fault_in_writeable() on the next address that the uaccess failed to write to. You can have a combination of the above but not completely disjoint ranges. For liveness properties, in addition, fault_in_writeable() needs to reproduce the fault conditions of the copy_to_user(). If your algorithm uses something like (a), you'd need to probe the whole range at sub-page granularity (this series. If you go for something like (b), either copy_to_user() is exact or fault_in_writeable() compensates for the uaccess inexactness. > These alignment restrictions have nothing to do with page or sub-page > faults. My point wasn't alignment faults (different set of problems, though on arm64 one needs a device memory type in user space). Let's say we have a user buffer: char mem[32]; and mem[0..15] has MTE tag 0, mem[16..31] has tag 1, on arm64 a copy_to_user(mem, kbuf, 32) succeeds in writing 16 bytes. However, a copy_to_user(mem + 8, kbuf, 24) only writes 1 byte even if 8 could have been written (that's in line with the uaccess requirements you quoted above). If we know for an arch the maximum delta between the reported copy_to_user() fault address and the real one (if byte-copy), we can tweak fault_in_writeable() slightly to probe this prefix at sub-page granularity and bail out. No need for an exact copy_to_user(). > I'm also fairly sure that passing in an unaligned buffer will send > search_ioctl into an endless loop on architectures with copy_to_user() > alignment restrictions; there don't seem to be any buffer alignment > checks. On such architectures, copy_to_user() should take care of doing aligned writes. I don't think it's for the caller to guarantee anything here as it doesn't know what the underlying uaccess implementation does. On arm64, since the architecture can do unaligned writes to Normal memory, the uaccess optimises the read to be aligned and the write may be unaligned (write-combining in the hardware buffers sorts this out). Thanks.
On Sat, Nov 27, 2021 at 01:39:58PM +0100, Andreas Gruenbacher wrote: > On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > If we know that the arch copy_to_user() has an error of say maximum 16 > > > bytes (or 15 rather on arm64), we can instead get fault_in_writeable() > > > to probe the first 16 bytes rather than 1. > > > > That isn't going to help one bit: [raw_]copy_to_user() is allowed to > > copy as little or as much as it wants as long as it follows the rules > > documented in include/linux/uaccess.h: > > > > [] If copying succeeds, the return value must be 0. If some data cannot be > > [] fetched, it is permitted to copy less than had been fetched; the only > > [] hard requirement is that not storing anything at all (i.e. returning size) > > [] should happen only when nothing could be copied. In other words, you don't > > [] have to squeeze as much as possible - it is allowed, but not necessary. > > > > When fault_in_writeable() tells us that an address range is accessible > > in principle, that doesn't mean that copy_to_user() will allow us to > > access it in arbitrary chunks. It's also not the case that > > fault_in_writeable(addr, size) is always followed by > > copy_to_user(addr, ..., size) for the exact same address range, not > > even in this case. > > > > These alignment restrictions have nothing to do with page or sub-page faults. > > > > I'm also fairly sure that passing in an unaligned buffer will send > > search_ioctl into an endless loop on architectures with copy_to_user() > > alignment restrictions; there don't seem to be any buffer alignment > > checks. > > Let me retract that ... > > The description in include/linux/uaccess.h leaves out permissible > reasons for fetching/storing less than requested. Thinking about it, if > the address range passed to one of the copy functions includes an > address that faults, it kind of makes sense to allow the copy function > to stop short instead of copying every last byte right up to the address > that fails. > > If that's the only reason, then it would be great to have that included > in the description. And then we can indeed deal with the alignment > effects in fault_in_writeable(). Ah, I started replying last night, sent it today without seeing your follow-up. > > > I attempted the above here and works ok: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix > > > > > > but too late to post it this evening, I'll do it in the next day or so > > > as an alternative to this series. > > I've taken a quick look. Under the assumption that alignment effects > are tied to page / sub-page faults, I think we can really solve this > generically as Willy has proposed. I think Willy's proposal stopped at the page boundary, it should go beyond. > Maybe as shown below; no need for arch-specific code. > > diff --git a/mm/gup.c b/mm/gup.c > index 2c51e9748a6a..a9b3d916b625 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, > } > #endif /* !CONFIG_MMU */ > > +#define SUBPAGE_FAULT_SIZE 16 > + > /** > * fault_in_writeable - fault in userspace address range for writing > * @uaddr: start of address range > @@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size) > if (unlikely(size == 0)) > return 0; > if (!PAGE_ALIGNED(uaddr)) { > + if (SUBPAGE_FAULT_SIZE && > + !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) { > + end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE); > + if (end - uaddr < size) { > + if (unlikely(__put_user(0, uaddr) != 0)) > + return size; > + uaddr = end; > + if (unlikely(!end)) > + goto out; > + } > + } > if (unlikely(__put_user(0, uaddr) != 0)) > - return size; > + goto out; > uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr); > } > end = (char __user *)PAGE_ALIGN((unsigned long)start + size); That's similar, somehow, to the arch-specific probing in one of my patches: [1]. We could do the above if we can guarantee that the maximum error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes even if aligned: reads of large blocks are done in 4 * 16 loads, and if one of them fails e.g. because of the 16-byte sub-page fault, no write is done, hence such larger than 16 delta). If you want something in the generic fault_in_writeable(), we probably need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE increments. But I thought I'd rather keep this in the arch-specific code. Of course, the above fault_in_writeable() still needs the btrfs search_ioctl() counterpart to change the probing on the actual fault address or offset. In the general case (uaccess error margin larger), I'm not entirely convinced we can skip the check if PAGE_ALIGNED(uaddr). I should probably get this logic through CBMC (or TLA+), I can't think it through. Thanks. [1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=devel/btrfs-live-lock-fix&id=af7e96d9e9537d9f9cc014f388b7b2bb4a5bc343
On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Sat, Nov 27, 2021 at 01:39:58PM +0100, Andreas Gruenbacher wrote: > > On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > > On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > If we know that the arch copy_to_user() has an error of say maximum 16 > > > > bytes (or 15 rather on arm64), we can instead get fault_in_writeable() > > > > to probe the first 16 bytes rather than 1. > > > > > > That isn't going to help one bit: [raw_]copy_to_user() is allowed to > > > copy as little or as much as it wants as long as it follows the rules > > > documented in include/linux/uaccess.h: > > > > > > [] If copying succeeds, the return value must be 0. If some data cannot be > > > [] fetched, it is permitted to copy less than had been fetched; the only > > > [] hard requirement is that not storing anything at all (i.e. returning size) > > > [] should happen only when nothing could be copied. In other words, you don't > > > [] have to squeeze as much as possible - it is allowed, but not necessary. > > > > > > When fault_in_writeable() tells us that an address range is accessible > > > in principle, that doesn't mean that copy_to_user() will allow us to > > > access it in arbitrary chunks. It's also not the case that > > > fault_in_writeable(addr, size) is always followed by > > > copy_to_user(addr, ..., size) for the exact same address range, not > > > even in this case. > > > > > > These alignment restrictions have nothing to do with page or sub-page faults. > > > > > > I'm also fairly sure that passing in an unaligned buffer will send > > > search_ioctl into an endless loop on architectures with copy_to_user() > > > alignment restrictions; there don't seem to be any buffer alignment > > > checks. > > > > Let me retract that ... > > > > The description in include/linux/uaccess.h leaves out permissible > > reasons for fetching/storing less than requested. Thinking about it, if > > the address range passed to one of the copy functions includes an > > address that faults, it kind of makes sense to allow the copy function > > to stop short instead of copying every last byte right up to the address > > that fails. > > > > If that's the only reason, then it would be great to have that included > > in the description. And then we can indeed deal with the alignment > > effects in fault_in_writeable(). > > Ah, I started replying last night, sent it today without seeing your > follow-up. > > > > > I attempted the above here and works ok: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix > > > > > > > > but too late to post it this evening, I'll do it in the next day or so > > > > as an alternative to this series. > > > > I've taken a quick look. Under the assumption that alignment effects > > are tied to page / sub-page faults, I think we can really solve this > > generically as Willy has proposed. > > I think Willy's proposal stopped at the page boundary, it should go > beyond. > > > Maybe as shown below; no need for arch-specific code. > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 2c51e9748a6a..a9b3d916b625 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, > > } > > #endif /* !CONFIG_MMU */ > > > > +#define SUBPAGE_FAULT_SIZE 16 > > + > > /** > > * fault_in_writeable - fault in userspace address range for writing > > * @uaddr: start of address range > > @@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size) > > if (unlikely(size == 0)) > > return 0; > > if (!PAGE_ALIGNED(uaddr)) { > > + if (SUBPAGE_FAULT_SIZE && > > + !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) { > > + end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE); > > + if (end - uaddr < size) { > > + if (unlikely(__put_user(0, uaddr) != 0)) > > + return size; > > + uaddr = end; > > + if (unlikely(!end)) > > + goto out; > > + } > > + } > > if (unlikely(__put_user(0, uaddr) != 0)) > > - return size; > > + goto out; > > uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr); > > } > > end = (char __user *)PAGE_ALIGN((unsigned long)start + size); > > That's similar, somehow, to the arch-specific probing in one of my > patches: [1]. We could do the above if we can guarantee that the maximum > error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For > arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever > need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes > even if aligned: reads of large blocks are done in 4 * 16 loads, and if > one of them fails e.g. because of the 16-byte sub-page fault, no write > is done, hence such larger than 16 delta). > > If you want something in the generic fault_in_writeable(), we probably > need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE > increments. But I thought I'd rather keep this in the arch-specific code. I see, that's even crazier than I'd thought. The looping / probing is still pretty generic, so I'd still consider putting it in the generic code. We also still have fault_in_safe_writeable which is more difficult to fix, and fault_in_readable which we don't want to leave behind broken, either. > Of course, the above fault_in_writeable() still needs the btrfs > search_ioctl() counterpart to change the probing on the actual fault > address or offset. Yes, but that change is relatively simple and it eliminates the need for probing the entire buffer, so it's a good thing. Maybe you want to add this though: --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode, unsigned long sk_offset = 0; - char __user *fault_in_addr; + char __user *fault_in_addr, *end; @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode, fault_in_addr = ubuf; + end = ubuf + *buf_size; while (1) { ret = -EFAULT; - if (fault_in_writeable(fault_in_addr, - *buf_size - (fault_in_addr - ubuf))) + if (fault_in_writeable(fault_in_addr, end - fault_in_addr)) break; > In the general case (uaccess error margin larger), I'm not entirely > convinced we can skip the check if PAGE_ALIGNED(uaddr). Yes, the loop can span multiple sub-page error domains, at least in the read case, so it needs to happen even for page-aligned addresses. > I should probably get this logic through CBMC (or TLA+), I can't think it > through. > > Thanks. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=devel/btrfs-live-lock-fix&id=af7e96d9e9537d9f9cc014f388b7b2bb4a5bc343 > > -- > Catalin > Thanks, Andreas
On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote: > On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > That's similar, somehow, to the arch-specific probing in one of my > > patches: [1]. We could do the above if we can guarantee that the maximum > > error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For > > arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever > > need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes > > even if aligned: reads of large blocks are done in 4 * 16 loads, and if > > one of them fails e.g. because of the 16-byte sub-page fault, no write > > is done, hence such larger than 16 delta). > > > > If you want something in the generic fault_in_writeable(), we probably > > need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE > > increments. But I thought I'd rather keep this in the arch-specific code. > > I see, that's even crazier than I'd thought. The looping / probing is > still pretty generic, so I'd still consider putting it in the generic > code. In the arm64 probe_subpage_user_writeable(), the loop is skipped if !system_supports_mte() (static label). It doesn't make much difference for search_ioctl() in terms of performance but I'd like the arch code to dynamically decide when to probe. An arch_has_subpage_faults() static inline function would solve this. However, the above assumes that the only way of probing is by doing a get_user/put_user(). A non-destructive probe with MTE would be to read the actual tags in memory and compare them with the top byte of the pointer. There's the CHERI architecture as well. Although very early days for arm64, we do have an incipient port (https://www.morello-project.org/). The void __user * pointers are propagated inside the kernel as 128-bit capabilities. A fault_in() would check whether the address (bottom 64-bit) is within the range and permissions specified in the upper 64-bit of the capability. There is no notion of sub-page fault granularity here and no need to do a put_user() as the check is just done on the pointer/capability. Given the above, my preference is to keep the probing arch-specific. > We also still have fault_in_safe_writeable which is more difficult to > fix, and fault_in_readable which we don't want to leave behind broken, > either. fault_in_safe_writeable() can be done by using get_user() instead of put_user() for arm64 MTE and probably SPARC ADI (an alternative is to read the in-memory tags and compare them with the pointer). For CHERI, that's different again since the capability encodes the read/write permissions independently. However, do we actually want to change the fault_in_safe_writeable() and fault_in_readable() functions at this stage? I could not get any of them to live-lock, though I only tried btrfs, ext4 and gfs2. As per the earlier discussion, normal files accesses are guaranteed to make progress. The only problematic one was O_DIRECT which seems to be alright for the above filesystems (the fs either bails out after several attempts or uses GUP to read which skips the uaccess altogether). Happy to address them if there is a real concern, I just couldn't trigger it. > > Of course, the above fault_in_writeable() still needs the btrfs > > search_ioctl() counterpart to change the probing on the actual fault > > address or offset. > > Yes, but that change is relatively simple and it eliminates the need > for probing the entire buffer, so it's a good thing. Maybe you want to > add this though: > > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode, > unsigned long sk_offset = 0; > - char __user *fault_in_addr; > + char __user *fault_in_addr, *end; > > @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode, > fault_in_addr = ubuf; > + end = ubuf + *buf_size; > while (1) { > ret = -EFAULT; > - if (fault_in_writeable(fault_in_addr, > - *buf_size - (fault_in_addr - ubuf))) > + if (fault_in_writeable(fault_in_addr, end - fault_in_addr)) > break; Thanks, I'll add it.
On Mon, Nov 29, 2021 at 1:22 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote: > > On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > That's similar, somehow, to the arch-specific probing in one of my > > > patches: [1]. We could do the above if we can guarantee that the maximum > > > error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For > > > arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever > > > need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes > > > even if aligned: reads of large blocks are done in 4 * 16 loads, and if > > > one of them fails e.g. because of the 16-byte sub-page fault, no write > > > is done, hence such larger than 16 delta). > > > > > > If you want something in the generic fault_in_writeable(), we probably > > > need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE > > > increments. But I thought I'd rather keep this in the arch-specific code. > > > > I see, that's even crazier than I'd thought. The looping / probing is > > still pretty generic, so I'd still consider putting it in the generic > > code. > > In the arm64 probe_subpage_user_writeable(), the loop is skipped if > !system_supports_mte() (static label). It doesn't make much difference > for search_ioctl() in terms of performance but I'd like the arch code to > dynamically decide when to probe. An arch_has_subpage_faults() static > inline function would solve this. > > However, the above assumes that the only way of probing is by doing a > get_user/put_user(). A non-destructive probe with MTE would be to read > the actual tags in memory and compare them with the top byte of the > pointer. > > There's the CHERI architecture as well. Although very early days for > arm64, we do have an incipient port (https://www.morello-project.org/). > The void __user * pointers are propagated inside the kernel as 128-bit > capabilities. A fault_in() would check whether the address (bottom > 64-bit) is within the range and permissions specified in the upper > 64-bit of the capability. There is no notion of sub-page fault > granularity here and no need to do a put_user() as the check is just > done on the pointer/capability. > > Given the above, my preference is to keep the probing arch-specific. > > > We also still have fault_in_safe_writeable which is more difficult to > > fix, and fault_in_readable which we don't want to leave behind broken, > > either. > > fault_in_safe_writeable() can be done by using get_user() instead of > put_user() for arm64 MTE and probably SPARC ADI (an alternative is to > read the in-memory tags and compare them with the pointer). So we'd keep the existing fault_in_safe_writeable() logic for the actual fault-in and use get_user() to check for sub-page faults? If so, then that should probably also be hidden in arch code. > For CHERI, that's different again since the fault_in_safe_writeable capability > encodes the read/write permissions independently. > > However, do we actually want to change the fault_in_safe_writeable() and > fault_in_readable() functions at this stage? I could not get any of them > to live-lock, though I only tried btrfs, ext4 and gfs2. As per the > earlier discussion, normal files accesses are guaranteed to make > progress. The only problematic one was O_DIRECT which seems to be > alright for the above filesystems (the fs either bails out after several > attempts or uses GUP to read which skips the uaccess altogether). Only gfs2 uses fault_in_safe_writeable(). For buffered reads, progress is guaranteed because failures are at a byte granularity. O_DIRECT reads and writes happen in device block size granularity, but the pages are grabbed with get_user_pages() before the copying happens. So by the time the copying happens, the pages are guaranteed to be resident, and we don't need to loop around fault_in_*(). You've mentioned before that copying to/from struct page bypasses sub-page fault checking. If that is the case, then the checking probably needs to happen in iomap_dio_bio_iter and dio_refill_pages instead. > Happy to address them if there is a real concern, I just couldn't trigger it. Hopefully it should now be clear why you couldn't. One way of reproducing with fault_in_safe_writeable() would be to use that in btrfs instead of fault_in_writeable(), of course. We're not doing any chunked reads from user space with page faults disabled as far as I'm aware right now, so we probably don't have a reproducer for fault_in_readable(). It would still be worth fixing fault_in_readable() to prevent things from blowing up very unexpectedly later, though. Thanks, Andreas > > > Of course, the above fault_in_writeable() still needs the btrfs > > > search_ioctl() counterpart toget_user_pages change the probing on the actual fault > > > address or offset. > > > > Yes, but that change is relatively simple and it eliminates the need > > for probing the entire buffer, so it's a good thing. Maybe you want to > > add this though: > > > > --- a/fs/btrfs/ioctl.c > > +++ b/fs/btrfs/ioctl.c > > @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode, > > unsigned long sk_offset = 0; > > - char __user *fault_in_addr; > > + char __user *fault_in_addr, *end; > > > > @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode, > > fault_in_addr = ubuf; > > + end = ubuf + *buf_size; > > while (1) { > > ret = -EFAULT; > > - if (fault_in_writeable(fault_in_addr, > > - *buf_size - (fault_in_addr - ubuf))) > > + if (fault_in_writeable(fault_in_addr, end - fault_in_addr)) > > break; > > Thanks, I'll add it. > > -- > Catalin >
On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote: > Maybe you want to add this though: > > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode, > unsigned long sk_offset = 0; > - char __user *fault_in_addr; > + char __user *fault_in_addr, *end; > > @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode, > fault_in_addr = ubuf; > + end = ubuf + *buf_size; > while (1) { > ret = -EFAULT; > - if (fault_in_writeable(fault_in_addr, > - *buf_size - (fault_in_addr - ubuf))) > + if (fault_in_writeable(fault_in_addr, end - fault_in_addr)) > break; It looks like *buf_size is updated inside copy_to_sk(), so I'll move the end update inside the loop.
On Mon, Nov 29, 2021 at 02:33:42PM +0100, Andreas Gruenbacher wrote: > On Mon, Nov 29, 2021 at 1:22 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote: > > > We also still have fault_in_safe_writeable which is more difficult to > > > fix, and fault_in_readable which we don't want to leave behind broken, > > > either. > > > > fault_in_safe_writeable() can be done by using get_user() instead of > > put_user() for arm64 MTE and probably SPARC ADI (an alternative is to > > read the in-memory tags and compare them with the pointer). > > So we'd keep the existing fault_in_safe_writeable() logic for the > actual fault-in and use get_user() to check for sub-page faults? If > so, then that should probably also be hidden in arch code. That's what this series does when it probes the whole range in fault_in_writeable(). The main reason was that it's more efficient to do a read than a write on a large range (the latter dirtying the cache lines). > > For CHERI, that's different again since the fault_in_safe_writeable capability > > encodes the read/write permissions independently. > > > > However, do we actually want to change the fault_in_safe_writeable() and > > fault_in_readable() functions at this stage? I could not get any of them > > to live-lock, though I only tried btrfs, ext4 and gfs2. As per the > > earlier discussion, normal files accesses are guaranteed to make > > progress. The only problematic one was O_DIRECT which seems to be > > alright for the above filesystems (the fs either bails out after several > > attempts or uses GUP to read which skips the uaccess altogether). > > Only gfs2 uses fault_in_safe_writeable(). For buffered reads, progress > is guaranteed because failures are at a byte granularity. > > O_DIRECT reads and writes happen in device block size granularity, but > the pages are grabbed with get_user_pages() before the copying > happens. So by the time the copying happens, the pages are guaranteed > to be resident, and we don't need to loop around fault_in_*(). For file reads, I couldn't triggered any mismatched tag faults with gfs2 and O_DIRECT, so it matches your description above. For file writes it does trigger such faults, so I suspect it doesn't always use get_user_pages() for writes. No live-lock though with the vanilla kernel. My test uses a page with some mismatched tags in the middle. ext4: no tag faults with O_DIRECT read/write irrespective of whether the user buffer is page aligned or not. btrfs: O_DIRECT file writes - no faults on page-aligned buffers, faults on unaligned; file reads - tag faults on both aligned/unaligned buffers. No live-lock. So, some tag faults still happen even with O_DIRECT|O_SYNC but the filesystems too care of continuous faulting. > You've mentioned before that copying to/from struct page bypasses > sub-page fault checking. If that is the case, then the checking > probably needs to happen in iomap_dio_bio_iter and dio_refill_pages > instead. It's too expensive and not really worth it. With a buffered access, the uaccess takes care of checking at the time of load/store (the hardware does this for us). With a GUP, the access is done via the kernel mapping with a match-all tag to avoid faults (kernel panic). We set the ABI expectation some time ago that kernel accesses to user memory may not always be tag-checked if the access is done via a GUP'ed page. > > Happy to address them if there is a real concern, I just couldn't trigger it. > > Hopefully it should now be clear why you couldn't. One way of > reproducing with fault_in_safe_writeable() would be to use that in > btrfs instead of fault_in_writeable(), of course. Yes, that would trigger it again. I guess if we want to make this API safer in general, we can add the checks to the other functions. Only probing a few bytes at the start shouldn't cause a performance issue. Thanks.
On Mon, Nov 29, 2021 at 7:36 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > That's what this series does when it probes the whole range in > fault_in_writeable(). The main reason was that it's more efficient to do > a read than a write on a large range (the latter dirtying the cache > lines). The more this thread goes on, the more I'm starting to think that we should just make "fault_in_writable()" (and readable, of course) only really work on the beginning of the area. Not just for the finer-granularity pointer color probing, but for the page probing too. I'm looking at our current fault_in_writeable(), and I'm going (a) it uses __put_user() without range checks, which is really not great (b) it looks like a disaster from another standpoint: essentially user-controlled loop size with no limit checking, no preemption, and no check for fatal signals. Now, (a) should be fixed with a access_ok() or similar. And (b) can easily be fixed multiple ways, with one option simply just being adding a can_resched() call and checking for fatal signals. But faulting in the whole region is actually fundamentally wrong in low-memory situations - the beginning of the region might be swapped out by the time we get to the end. That's unlikely to be a problem in real life, but it's an example of how it's simply not conceptually sensible. So I do wonder why we don't just say "fault_in_writable will fault in _at_most_ X bytes", and simply limit the actual fault-in size to something reasonable. That solves _all_ the problems. It solves the lack of preemption and fatal signals (by virtue of just limiting the amount of work we do). It solves the low memory situation. And it solves the "excessive dirty cachelines" case too. Of course, we want to have some minimum bytes we fault in too, but that minimum range might well be "we guarantee at least a full page worth of data" (and in practice make it a couple of pages). It's not like fault_in_writeable() avoids page faults or anything like that - it just moves them around. So there's really very little reason to fault in a large range, and there are multiple reasons _not_ to do it. Hmm? Linus
On Mon, Nov 29, 2021 at 7:41 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Nov 29, 2021 at 7:36 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > That's what this series does when it probes the whole range in > > fault_in_writeable(). The main reason was that it's more efficient to do > > a read than a write on a large range (the latter dirtying the cache > > lines). > > The more this thread goes on, the more I'm starting to think that we > should just make "fault_in_writable()" (and readable, of course) only > really work on the beginning of the area. > > Not just for the finer-granularity pointer color probing, but for the > page probing too. > > I'm looking at our current fault_in_writeable(), and I'm going > > (a) it uses __put_user() without range checks, which is really not great > > (b) it looks like a disaster from another standpoint: essentially > user-controlled loop size with no limit checking, no preemption, and > no check for fatal signals. > > Now, (a) should be fixed with a access_ok() or similar. > > And (b) can easily be fixed multiple ways, with one option simply just > being adding a can_resched() call and checking for fatal signals. > > But faulting in the whole region is actually fundamentally wrong in > low-memory situations - the beginning of the region might be swapped > out by the time we get to the end. That's unlikely to be a problem in > real life, but it's an example of how it's simply not conceptually > sensible. > > So I do wonder why we don't just say "fault_in_writable will fault in > _at_most_ X bytes", and simply limit the actual fault-in size to > something reasonable. > > That solves _all_ the problems. It solves the lack of preemption and > fatal signals (by virtue of just limiting the amount of work we do). > It solves the low memory situation. And it solves the "excessive dirty > cachelines" case too. > > Of course, we want to have some minimum bytes we fault in too, but > that minimum range might well be "we guarantee at least a full page > worth of data" (and in practice make it a couple of pages). > > It's not like fault_in_writeable() avoids page faults or anything like > that - it just moves them around. So there's really very little reason > to fault in a large range, and there are multiple reasons _not_ to do > it. > > Hmm? This would mean that we could get rid of gfs2's should_fault_in_pages() logic, which is based on what's in btrfs_buffered_write(). Andreas > > Linus >
On Mon, Nov 29, 2021 at 10:40:38AM -0800, Linus Torvalds wrote: > On Mon, Nov 29, 2021 at 7:36 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > That's what this series does when it probes the whole range in > > fault_in_writeable(). The main reason was that it's more efficient to do > > a read than a write on a large range (the latter dirtying the cache > > lines). > > The more this thread goes on, the more I'm starting to think that we > should just make "fault_in_writable()" (and readable, of course) only > really work on the beginning of the area. > > Not just for the finer-granularity pointer color probing, but for the > page probing too. I have patches for the finer-granularity checking of the beginning of the buffer. They need a bit of testing, so probably posting them tomorrow. > I'm looking at our current fault_in_writeable(), and I'm going > > (a) it uses __put_user() without range checks, which is really not great For arm64 at least __put_user() does the access_ok() check. I thought only unsafe_put_user() should skip the checks. If __put_user() can write arbitrary memory, we may have a bigger problem. > (b) it looks like a disaster from another standpoint: essentially > user-controlled loop size with no limit checking, no preemption, and > no check for fatal signals. Indeed, the fault_in_*() loop can get pretty long, bounded by how much memory can be faulted in the user process. My patches for now only address the outer loop doing the copy_to_user() as that can be unbounded. > Now, (a) should be fixed with a access_ok() or similar. > > And (b) can easily be fixed multiple ways, with one option simply just > being adding a can_resched() call and checking for fatal signals. > > But faulting in the whole region is actually fundamentally wrong in > low-memory situations - the beginning of the region might be swapped > out by the time we get to the end. That's unlikely to be a problem in > real life, but it's an example of how it's simply not conceptually > sensible. > > So I do wonder why we don't just say "fault_in_writable will fault in > _at_most_ X bytes", and simply limit the actual fault-in size to > something reasonable. > > That solves _all_ the problems. It solves the lack of preemption and > fatal signals (by virtue of just limiting the amount of work we do). > It solves the low memory situation. And it solves the "excessive dirty > cachelines" case too. I think that would be useful, though it doesn't solve the potential livelock with sub-page faults. We still need the outer loop to handle the copy_to_user() for the whole user buffer and the sub-page probing of the beginning of such buffer (or whenever copy_to_user() failed). IOW, you still fault in the whole buffer eventually. Anyway, I think the sub-page probing and limiting the fault-in are complementary improvements.
On Mon, Nov 29, 2021 at 12:56 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > For arm64 at least __put_user() does the access_ok() check. I thought > only unsafe_put_user() should skip the checks. If __put_user() can write > arbitrary memory, we may have a bigger problem. That's literally be the historical difference between __put_user() and put_user() - the access check. > I think that would be useful, though it doesn't solve the potential > livelock with sub-page faults. I was assuming we'd just do the sub-page faults. In fact, I was assuming we'd basically just replace all the PAGE_ALIGN and PAGE_SIZE with SUBPAGE_{ALIGN,SIZE}, together with something like if (size > PAGE_SIZE) size = PAGE_SIZE; to limit that size thing (or possibly make that "min size" be a parameter, so that people who have things like that "I need at least this initial structure to be copied" issue can document their minimum size needs). Linus
On Mon, Nov 29, 2021 at 01:53:01PM -0800, Linus Torvalds wrote: > On Mon, Nov 29, 2021 at 12:56 PM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > I think that would be useful, though it doesn't solve the potential > > livelock with sub-page faults. > > I was assuming we'd just do the sub-page faults. > > In fact, I was assuming we'd basically just replace all the PAGE_ALIGN > and PAGE_SIZE with SUBPAGE_{ALIGN,SIZE}, together with something like > > if (size > PAGE_SIZE) > size = PAGE_SIZE; > > to limit that size thing (or possibly make that "min size" be a > parameter, so that people who have things like that "I need at least > this initial structure to be copied" issue can document their minimum > size needs). Ah, so fault_in_writeable() would never fault in the whole range (if too large). When copy_to_user() goes beyond the faulted in range, it may fail and we go back to fault in a bit more of the range. A copy loop would be equivalent to: fault_addr = ubuf; end = ubuf + size; while (1) { if (fault_in_writeable(fault_addr, min(PAGE_SIZE, end - fault_addr))) break; left = copy_to_user(ubuf, kbuf, size); if (!left) break; fault_addr = end - left; } That should work. I'll think about it tomorrow, getting late over here. (I may still keep the sub-page probing in the arch code, see my earlier exchanges with Andreas)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 92138ac2a4e2..23167c72fa47 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, while (1) { ret = -EFAULT; - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) + if (fault_in_exact_writeable(ubuf + sk_offset, + *buf_size - sk_offset)) break; ret = btrfs_search_forward(root, &key, path, sk->min_transid);
Commit a48b73eca4ce ("btrfs: fix potential deadlock in the search ioctl") addressed a lockdep warning by pre-faulting the user pages and attempting the copy_to_user_nofault() in an infinite loop. On architectures like arm64 with MTE, an access may fault within a page at a location different from what fault_in_writeable() probed. Since the sk_offset is rewound to the previous struct btrfs_ioctl_search_header boundary, there is no guaranteed forward progress and search_ioctl() may live-lock. Use fault_in_exact_writeable() instead which probes the entire user buffer for faults at sub-page granularity. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/btrfs/ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)