Message ID | fd36cd900023955c763bd424c0895ae5828f68a0.1731979403.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/1] io_uring: prevent reg-wait speculations | expand |
On 11/18/24 6:29 PM, Pavel Begunkov wrote: > With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments > for the waiting loop the user can specify an offset into a pre-mapped > region of memory, in which case the > [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the > argument. > > As we address a kernel array using a user given index, it'd be a subject > to speculation type of exploits. > > Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") > Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > io_uring/io_uring.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index da8fd460977b..3a3e4fca1545 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -3207,6 +3207,7 @@ static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx, > end > ctx->cq_wait_size)) > return ERR_PTR(-EFAULT); > > + barrier_nospec(); > return ctx->cq_wait_arg + offset; We need something better than that, barrier_nospec() is a big slow hammer...
On 11/19/24 01:29, Pavel Begunkov wrote: > With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments > for the waiting loop the user can specify an offset into a pre-mapped > region of memory, in which case the > [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the > argument. Jann, do mind taking a look? I hope there is some clever trick with masks we can use instead of the barrier, it seems expensive. The byte offset user pases is 0 based and we add it to the base kernel address: if (unlikely(check_add_overflow(offset, sizeof(struct ...), &end) || end > ctx->cq_wait_size)) return ERR_PTR(-EFAULT); barrier_nospec(); return ctx->cq_wait_arg + offset; Here in particular we know the structure size, but I also wonder how to do it right if size is variable. > As we address a kernel array using a user given index, it'd be a subject > to speculation type of exploits. > > Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") > Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > io_uring/io_uring.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index da8fd460977b..3a3e4fca1545 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -3207,6 +3207,7 @@ static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx, > end > ctx->cq_wait_size)) > return ERR_PTR(-EFAULT); > > + barrier_nospec(); > return ctx->cq_wait_arg + offset; > } >
On 11/19/24 01:29, Jens Axboe wrote: > On 11/18/24 6:29 PM, Pavel Begunkov wrote: >> With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments >> for the waiting loop the user can specify an offset into a pre-mapped >> region of memory, in which case the >> [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the >> argument. >> >> As we address a kernel array using a user given index, it'd be a subject >> to speculation type of exploits. >> >> Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") >> Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >> --- >> io_uring/io_uring.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index da8fd460977b..3a3e4fca1545 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -3207,6 +3207,7 @@ static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx, >> end > ctx->cq_wait_size)) >> return ERR_PTR(-EFAULT); >> >> + barrier_nospec(); >> return ctx->cq_wait_arg + offset; > > We need something better than that, barrier_nospec() is a big slow > hammer... Right, more of a discussion opener. I wonder if Jann can help here (see the other reply). I don't like back and forth like that, but if nothing works there is an option of returning back to reg-wait array indexes. Trivial to change, but then we're committing to not expanding the structure or complicating things if we do.
On 11/18/24 6:43 PM, Pavel Begunkov wrote: > On 11/19/24 01:29, Jens Axboe wrote: >> On 11/18/24 6:29 PM, Pavel Begunkov wrote: >>> With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments >>> for the waiting loop the user can specify an offset into a pre-mapped >>> region of memory, in which case the >>> [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the >>> argument. >>> >>> As we address a kernel array using a user given index, it'd be a subject >>> to speculation type of exploits. >>> >>> Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") >>> Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") >>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>> --- >>> io_uring/io_uring.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>> index da8fd460977b..3a3e4fca1545 100644 >>> --- a/io_uring/io_uring.c >>> +++ b/io_uring/io_uring.c >>> @@ -3207,6 +3207,7 @@ static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx, >>> end > ctx->cq_wait_size)) >>> return ERR_PTR(-EFAULT); >>> + barrier_nospec(); >>> return ctx->cq_wait_arg + offset; >> >> We need something better than that, barrier_nospec() is a big slow >> hammer... > > Right, more of a discussion opener. I wonder if Jann can help here > (see the other reply). I don't like back and forth like that, but if > nothing works there is an option of returning back to reg-wait array > indexes. Trivial to change, but then we're committing to not expanding > the structure or complicating things if we do. Then I think it should've been marked as a discussion point, because we definitely can't do this. Soliciting input is perfectly fine. And yeah, was thinking the same thing, if this is an issue then we just go back to indexing again. At least both the problem and solution is well known there. The original aa00f67adc2c0 just needed an array_index_nospec() and it would've been fine. Not a huge deal in terms of timing, either way. I suspect we can do something similar here, with just clamping the indexing offset. But let's hear what Jann thinks.
On 11/19/24 01:59, Jens Axboe wrote: > On 11/18/24 6:43 PM, Pavel Begunkov wrote: >> On 11/19/24 01:29, Jens Axboe wrote: >>> On 11/18/24 6:29 PM, Pavel Begunkov wrote: >>>> With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments >>>> for the waiting loop the user can specify an offset into a pre-mapped >>>> region of memory, in which case the >>>> [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the >>>> argument. >>>> >>>> As we address a kernel array using a user given index, it'd be a subject >>>> to speculation type of exploits. >>>> >>>> Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") >>>> Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") >>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>> --- >>>> io_uring/io_uring.c | 1 + >>>> 1 file changed, 1 insertion(+) >>>> >>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>> index da8fd460977b..3a3e4fca1545 100644 >>>> --- a/io_uring/io_uring.c >>>> +++ b/io_uring/io_uring.c >>>> @@ -3207,6 +3207,7 @@ static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx, >>>> end > ctx->cq_wait_size)) >>>> return ERR_PTR(-EFAULT); >>>> + barrier_nospec(); >>>> return ctx->cq_wait_arg + offset; >>> >>> We need something better than that, barrier_nospec() is a big slow >>> hammer... >> >> Right, more of a discussion opener. I wonder if Jann can help here >> (see the other reply). I don't like back and forth like that, but if >> nothing works there is an option of returning back to reg-wait array >> indexes. Trivial to change, but then we're committing to not expanding >> the structure or complicating things if we do. > > Then I think it should've been marked as a discussion point, because we > definitely can't do this. Soliciting input is perfectly fine. And yeah, > was thinking the same thing, if this is an issue then we just go back to > indexing again. At least both the problem and solution is well known > there. The original aa00f67adc2c0 just needed an array_index_nospec() > and it would've been fine. > > Not a huge deal in terms of timing, either way. > > I suspect we can do something similar here, with just clamping the > indexing offset. But let's hear what Jann thinks. That what I hope for, but I can't say I entirely understand it. E.g. why can_do_masked_user_access() exists and guards mask_user_address(). IIRC, with invalid argument the mask turns the index into 0. A complete speculation from my side of how it works is that you then able to "inspect" or what's the right word the value of array[0] but not a address of memory of choice. Then in our case, considering that mappings are page sized, array_index_nospec() would clamp it to either first 32 bytes of the first page or to absolute addresses [0, 32) in case size==0 and the mapping is NULL. But that could be just my fantasy.
On Thu, Nov 21, 2024 at 1:12 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 11/19/24 01:59, Jens Axboe wrote: > > On 11/18/24 6:43 PM, Pavel Begunkov wrote: > >> On 11/19/24 01:29, Jens Axboe wrote: > >>> On 11/18/24 6:29 PM, Pavel Begunkov wrote: > >>>> With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments > >>>> for the waiting loop the user can specify an offset into a pre-mapped > >>>> region of memory, in which case the > >>>> [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the > >>>> argument. > >>>> > >>>> As we address a kernel array using a user given index, it'd be a subject > >>>> to speculation type of exploits. > >>>> > >>>> Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") > >>>> Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") > >>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > >>>> --- > >>>> io_uring/io_uring.c | 1 + > >>>> 1 file changed, 1 insertion(+) > >>>> > >>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > >>>> index da8fd460977b..3a3e4fca1545 100644 > >>>> --- a/io_uring/io_uring.c > >>>> +++ b/io_uring/io_uring.c > >>>> @@ -3207,6 +3207,7 @@ static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx, > >>>> end > ctx->cq_wait_size)) > >>>> return ERR_PTR(-EFAULT); > >>>> + barrier_nospec(); > >>>> return ctx->cq_wait_arg + offset; > >>> > >>> We need something better than that, barrier_nospec() is a big slow > >>> hammer... > >> > >> Right, more of a discussion opener. I wonder if Jann can help here > >> (see the other reply). I don't like back and forth like that, but if > >> nothing works there is an option of returning back to reg-wait array > >> indexes. Trivial to change, but then we're committing to not expanding > >> the structure or complicating things if we do. > > > > Then I think it should've been marked as a discussion point, because we > > definitely can't do this. Soliciting input is perfectly fine. And yeah, > > was thinking the same thing, if this is an issue then we just go back to > > indexing again. At least both the problem and solution is well known > > there. The original aa00f67adc2c0 just needed an array_index_nospec() > > and it would've been fine. > > > > Not a huge deal in terms of timing, either way. > > > > I suspect we can do something similar here, with just clamping the > > indexing offset. But let's hear what Jann thinks. > > That what I hope for, but I can't say I entirely understand it. E.g. > why can_do_masked_user_access() exists and guards mask_user_address(). FWIW, my understanding is that x86-64 can do this because there is a really big hole in the virtual address space between userspace addresses and kernel addresses (over 2^63 bytes big); so if you check that the address at which you start accessing memory is in userspace, and then you access memory more or less linearly forward from there, you'll never reach kernelspace addresses. > IIRC, with invalid argument the mask turns the index into 0. A complete > speculation from my side of how it works is that you then able to > "inspect" or what's the right word the value of array[0] but not a > address of memory of choice. Yeah, exactly, that's the idea of array_index_nospec(). As the comment above the generic version of array_index_mask_nospec() describes, the mask used for the bitwise AND in array_index_nospec() is generated as follows: "When @index is out of bounds (@index >= @size), the sign bit will be set. Extend the sign bit to all bits and invert, giving a result of zero for an out of bounds index, or ~0 if within bounds [0, @size)." The X86 version of array_index_mask_nospec() just does the same with optimized assembly code. But there are architectures, like arm64, that actually do more than just this - arm64's array_index_mask_nospec() additionally includes a csdb(), which is some arm64 barrier for some kinds of speculation. (So, for example, open-coding a bitwise AND may not be enough on all architectures.) > Then in our case, considering that > mappings are page sized, array_index_nospec() would clamp it to either > first 32 bytes of the first page or to absolute addresses [0, 32) > in case size==0 and the mapping is NULL. But that could be just my > fantasy. Without having looked at this uring code in detail: What you're saying sounds reasonable to me. Though one thing to be careful with is that if the value you're masking is a byte offset, you should ideally make sure to give array_index_mask_nospec() a limit that is something like the size of the region minus the size of an element - you should probably assume that the preceding "if (unlikely(offset % sizeof(long)))" could be bypassed in speculation, though in practice that's probably unlikely since it'd compile into something straightforward like a bitmask-and-test with immediate operands.
On 11/27/24 22:53, Jann Horn wrote: > On Thu, Nov 21, 2024 at 1:12 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> On 11/19/24 01:59, Jens Axboe wrote: >>> On 11/18/24 6:43 PM, Pavel Begunkov wrote: >>>> On 11/19/24 01:29, Jens Axboe wrote: >>>>> On 11/18/24 6:29 PM, Pavel Begunkov wrote: >>>>>> With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments >>>>>> for the waiting loop the user can specify an offset into a pre-mapped >>>>>> region of memory, in which case the >>>>>> [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the >>>>>> argument. >>>>>> >>>>>> As we address a kernel array using a user given index, it'd be a subject >>>>>> to speculation type of exploits. >>>>>> >>>>>> Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") >>>>>> Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") >>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>>>> --- >>>>>> io_uring/io_uring.c | 1 + >>>>>> 1 file changed, 1 insertion(+) >>>>>> >>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>>>> index da8fd460977b..3a3e4fca1545 100644 >>>>>> --- a/io_uring/io_uring.c >>>>>> +++ b/io_uring/io_uring.c >>>>>> @@ -3207,6 +3207,7 @@ static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx, >>>>>> end > ctx->cq_wait_size)) >>>>>> return ERR_PTR(-EFAULT); >>>>>> + barrier_nospec(); >>>>>> return ctx->cq_wait_arg + offset; >>>>> >>>>> We need something better than that, barrier_nospec() is a big slow >>>>> hammer... >>>> >>>> Right, more of a discussion opener. I wonder if Jann can help here >>>> (see the other reply). I don't like back and forth like that, but if >>>> nothing works there is an option of returning back to reg-wait array >>>> indexes. Trivial to change, but then we're committing to not expanding >>>> the structure or complicating things if we do. >>> >>> Then I think it should've been marked as a discussion point, because we >>> definitely can't do this. Soliciting input is perfectly fine. And yeah, >>> was thinking the same thing, if this is an issue then we just go back to >>> indexing again. At least both the problem and solution is well known >>> there. The original aa00f67adc2c0 just needed an array_index_nospec() >>> and it would've been fine. >>> >>> Not a huge deal in terms of timing, either way. >>> >>> I suspect we can do something similar here, with just clamping the >>> indexing offset. But let's hear what Jann thinks. >> >> That what I hope for, but I can't say I entirely understand it. E.g. >> why can_do_masked_user_access() exists and guards mask_user_address(). > > FWIW, my understanding is that x86-64 can do this because there is a > really big hole in the virtual address space between userspace > addresses and kernel addresses (over 2^63 bytes big); so if you check > that the address at which you start accessing memory is in userspace, > and then you access memory more or less linearly forward from there, > you'll never reach kernelspace addresses. > >> IIRC, with invalid argument the mask turns the index into 0. A complete >> speculation from my side of how it works is that you then able to >> "inspect" or what's the right word the value of array[0] but not a >> address of memory of choice. > > Yeah, exactly, that's the idea of array_index_nospec(). As the comment > above the generic version of array_index_mask_nospec() describes, the > mask used for the bitwise AND in array_index_nospec() is generated as > follows: > "When @index is out of bounds (@index >= @size), the sign bit will be > set. Extend the sign bit to all bits and invert, giving a result of > zero for an out of bounds index, or ~0 if within bounds [0, @size)." > > The X86 version of array_index_mask_nospec() just does the same with > optimized assembly code. But there are architectures, like arm64, that > actually do more than just this - arm64's array_index_mask_nospec() > additionally includes a csdb(), which is some arm64 barrier for some > kinds of speculation. (So, for example, open-coding a bitwise AND may > not be enough on all architectures.) Good to know, thanks >> Then in our case, considering that >> mappings are page sized, array_index_nospec() would clamp it to either >> first 32 bytes of the first page or to absolute addresses [0, 32) >> in case size==0 and the mapping is NULL. But that could be just my >> fantasy. > > Without having looked at this uring code in detail: What you're saying > sounds reasonable to me. Though one thing to be careful with is that > if the value you're masking is a byte offset, you should ideally make > sure to give array_index_mask_nospec() a limit that is something like > the size of the region minus the size of an element - you should Good point, thanks. I'll prepare something based on it. > probably assume that the preceding "if (unlikely(offset % > sizeof(long)))" could be bypassed in speculation, though in practice That is there because of arches that don't like unaligned accesses, and size checks don't rely on it. It doesn't seems we should worry about it. > that's probably unlikely since it'd compile into something > straightforward like a bitmask-and-test with immediate operands.
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index da8fd460977b..3a3e4fca1545 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3207,6 +3207,7 @@ static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx, end > ctx->cq_wait_size)) return ERR_PTR(-EFAULT); + barrier_nospec(); return ctx->cq_wait_arg + offset; }
With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments for the waiting loop the user can specify an offset into a pre-mapped region of memory, in which case the [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the argument. As we address a kernel array using a user given index, it'd be a subject to speculation type of exploits. Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- io_uring/io_uring.c | 1 + 1 file changed, 1 insertion(+)