[v2] hw: misc: edu: fix 2 off-by-one errors

Message ID	20221015211025.16781-1-chrisfriedt@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Chris Friedt <chrisfriedt@gmail.com> To: qemu-devel@nongnu.org Cc: cfriedt@meta.com, jslaby@suse.cz Subject: [v2] hw: misc: edu: fix 2 off-by-one errors Date: Sat, 15 Oct 2022 17:10:25 -0400 Message-Id: <20221015211025.16781-1-chrisfriedt@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::f2c; envelope-from=chrisfriedt@gmail.com; helo=mail-qv1-xf2c.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	[v2] hw: misc: edu: fix 2 off-by-one errors \| expand [v2] hw: misc: edu: fix 2 off-by-one errors

Christopher Friedt Oct. 15, 2022, 9:10 p.m. UTC

From: Christopher Friedt <cfriedt@meta.com>

In the case that size1 was zero, because of the explicit
'end1 > addr' check, the range check would fail and the error
message would read as shown below. The correct comparison
is 'end1 >= addr' (or 'addr <= end1').

EDU: DMA range 0x40000-0x3ffff out of bounds (0x40000-0x40fff)!

At the opposite end, in the case that size1 was 4096, within()
would fail because of the non-inclusive check 'end1 < end2',
which should have been 'end1 <= end2'. The error message would
previously say

EDU: DMA range 0x40000-0x40fff out of bounds (0x40000-0x40fff)!

This change
1. renames local variables to be more less ambiguous
2. fixes the two off-by-one errors described above.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1254

Signed-off-by: Christopher Friedt <cfriedt@meta.com>
---
 hw/misc/edu.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

Jiri Slaby (SUSE) Oct. 17, 2022, 6:22 a.m. UTC | #1

On 15. 10. 22, 23:10, Chris Friedt wrote:
> From: Christopher Friedt <cfriedt@meta.com>
> 
> In the case that size1 was zero, because of the explicit
> 'end1 > addr' check, the range check would fail and the error
> message would read as shown below. The correct comparison
> is 'end1 >= addr' (or 'addr <= end1').
> 
> EDU: DMA range 0x40000-0x3ffff out of bounds (0x40000-0x40fff)!
> 
> At the opposite end, in the case that size1 was 4096, within()
> would fail because of the non-inclusive check 'end1 < end2',
> which should have been 'end1 <= end2'. The error message would
> previously say
> 
> EDU: DMA range 0x40000-0x40fff out of bounds (0x40000-0x40fff)!
> 
> This change
> 1. renames local variables to be more less ambiguous
> 2. fixes the two off-by-one errors described above.

This should be split into two patches. This way, it's hard to review.

> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1254


thanks,

Alexander Bulekov Oct. 17, 2022, 1:44 p.m. UTC | #2

On 221015 1710, Chris Friedt wrote:
> From: Christopher Friedt <cfriedt@meta.com>
> 
> In the case that size1 was zero, because of the explicit
> 'end1 > addr' check, the range check would fail and the error
> message would read as shown below. The correct comparison
> is 'end1 >= addr' (or 'addr <= end1').
> 
> EDU: DMA range 0x40000-0x3ffff out of bounds (0x40000-0x40fff)!
> 
> At the opposite end, in the case that size1 was 4096, within()
> would fail because of the non-inclusive check 'end1 < end2',
> which should have been 'end1 <= end2'. The error message would
> previously say
> 
> EDU: DMA range 0x40000-0x40fff out of bounds (0x40000-0x40fff)!
> 
> This change
> 1. renames local variables to be more less ambiguous
> 2. fixes the two off-by-one errors described above.
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1254
> 
> Signed-off-by: Christopher Friedt <cfriedt@meta.com>

Reviewed-by: Alexander Bulekov <alxndr@bu.edu>

As a side-note, seems strange that edu_check_range will abort the entire
VM if the check fails, rather than handling the error more elegantly.
Maybe that's useful for students developing kernel drivers against the
device.

> ---
>  hw/misc/edu.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/misc/edu.c b/hw/misc/edu.c
> index e935c418d4..52afbd792a 100644
> --- a/hw/misc/edu.c
> +++ b/hw/misc/edu.c
> @@ -103,25 +103,24 @@ static void edu_lower_irq(EduState *edu, uint32_t val)
>      }
>  }
>  
> -static bool within(uint64_t addr, uint64_t start, uint64_t end)
> +static void edu_check_range(uint64_t xfer_start, uint64_t xfer_size,
> +                uint64_t dma_start, uint64_t dma_size)
>  {
> -    return start <= addr && addr < end;
> -}
> -
> -static void edu_check_range(uint64_t addr, uint64_t size1, uint64_t start,
> -                uint64_t size2)
> -{
> -    uint64_t end1 = addr + size1;
> -    uint64_t end2 = start + size2;
> -
> -    if (within(addr, start, end2) &&
> -            end1 > addr && within(end1, start, end2)) {
> +    uint64_t xfer_end = xfer_start + xfer_size;
> +    uint64_t dma_end = dma_start + dma_size;
> +
> +    /*
> +     * 1. ensure we aren't overflowing
> +     * 2. ensure that xfer is within dma address range
> +     */
> +    if (dma_end >= dma_start && xfer_end >= xfer_start &&
> +        xfer_start >= dma_start && xfer_end <= dma_end) {
>          return;
>      }
>  
>      hw_error("EDU: DMA range 0x%016"PRIx64"-0x%016"PRIx64
>               " out of bounds (0x%016"PRIx64"-0x%016"PRIx64")!",
> -            addr, end1 - 1, start, end2 - 1);
> +            xfer_start, xfer_end - 1, dma_start, dma_end - 1);
>  }
>  
>  static dma_addr_t edu_clamp_addr(const EduState *edu, dma_addr_t addr)
> -- 
> 2.36.1
> 
>

Peter Maydell Oct. 17, 2022, 2:13 p.m. UTC | #3

On Mon, 17 Oct 2022 at 14:50, Alexander Bulekov <alxndr@bu.edu> wrote:
>
> On 221015 1710, Chris Friedt wrote:
> > From: Christopher Friedt <cfriedt@meta.com>
> >
> > In the case that size1 was zero, because of the explicit
> > 'end1 > addr' check, the range check would fail and the error
> > message would read as shown below. The correct comparison
> > is 'end1 >= addr' (or 'addr <= end1').
> >
> > EDU: DMA range 0x40000-0x3ffff out of bounds (0x40000-0x40fff)!
> >
> > At the opposite end, in the case that size1 was 4096, within()
> > would fail because of the non-inclusive check 'end1 < end2',
> > which should have been 'end1 <= end2'. The error message would
> > previously say
> >
> > EDU: DMA range 0x40000-0x40fff out of bounds (0x40000-0x40fff)!
> >
> > This change
> > 1. renames local variables to be more less ambiguous
> > 2. fixes the two off-by-one errors described above.
> >
> > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1254
> >
> > Signed-off-by: Christopher Friedt <cfriedt@meta.com>
>
> Reviewed-by: Alexander Bulekov <alxndr@bu.edu>
>
> As a side-note, seems strange that edu_check_range will abort the entire
> VM if the check fails, rather than handling the error more elegantly.

Yes, this is bad for a device model, though we have a lot of
older device models that still do it. The preferred pattern is:
 * for situations which are "if this happens there is a
   bug in QEMU itself", use assert. hw_error() is a kind of
   assert that prints a bunch of guest register state: sometimes
   you want that, but more often you just want normal assert()
 * for situations where the guest has misprogrammed the device,
   log that with qemu_log_mask(LOG_GUEST_ERROR, ...)
   and continue with whatever the real hardware would do, or
   some reasonable choice if the h/w spec is vague
 * for situations where the guest is correct but is trying to
   get the device to do something our model doesn't implement
   yet, same as above but with LOG_UNIMP.

Probably most hw_error() uses in the codebase should be
replaced with one of the above options.

thanks
-- PMM

Christopher Friedt Oct. 17, 2022, 4:36 p.m. UTC | #4

On Mon, Oct 17, 2022 at 2:23 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> On 15. 10. 22, 23:10, Chris Friedt wrote:
> > From: Christopher Friedt <cfriedt@meta.com>
> This should be split into two patches. This way, it's hard to review.

I can do that :-)

Thanks for the review

Alex Bennée Oct. 17, 2022, 5:21 p.m. UTC | #5

Peter Maydell <peter.maydell@linaro.org> writes:

> On Mon, 17 Oct 2022 at 14:50, Alexander Bulekov <alxndr@bu.edu> wrote:
>>
>> On 221015 1710, Chris Friedt wrote:
>> > From: Christopher Friedt <cfriedt@meta.com>
>> >
>> > In the case that size1 was zero, because of the explicit
>> > 'end1 > addr' check, the range check would fail and the error
>> > message would read as shown below. The correct comparison
>> > is 'end1 >= addr' (or 'addr <= end1').
>> >
>> > EDU: DMA range 0x40000-0x3ffff out of bounds (0x40000-0x40fff)!
>> >
>> > At the opposite end, in the case that size1 was 4096, within()
>> > would fail because of the non-inclusive check 'end1 < end2',
>> > which should have been 'end1 <= end2'. The error message would
>> > previously say
>> >
>> > EDU: DMA range 0x40000-0x40fff out of bounds (0x40000-0x40fff)!
>> >
>> > This change
>> > 1. renames local variables to be more less ambiguous
>> > 2. fixes the two off-by-one errors described above.
>> >
>> > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1254
>> >
>> > Signed-off-by: Christopher Friedt <cfriedt@meta.com>
>>
>> Reviewed-by: Alexander Bulekov <alxndr@bu.edu>
>>
>> As a side-note, seems strange that edu_check_range will abort the entire
>> VM if the check fails, rather than handling the error more elegantly.
>
> Yes, this is bad for a device model, though we have a lot of
> older device models that still do it. The preferred pattern is:
>  * for situations which are "if this happens there is a
>    bug in QEMU itself", use assert. hw_error() is a kind of
>    assert that prints a bunch of guest register state: sometimes
>    you want that, but more often you just want normal assert()
>  * for situations where the guest has misprogrammed the device,
>    log that with qemu_log_mask(LOG_GUEST_ERROR, ...)
>    and continue with whatever the real hardware would do, or
>    some reasonable choice if the h/w spec is vague
>  * for situations where the guest is correct but is trying to
>    get the device to do something our model doesn't implement
>    yet, same as above but with LOG_UNIMP.
>
> Probably most hw_error() uses in the codebase should be
> replaced with one of the above options.

We should probably document this best practice somewhere in docs/devel
but I guess we really need a "guide to writing device emulation"
section.

>
> thanks
> -- PMM

Chris Friedt Oct. 17, 2022, 10:05 p.m. UTC | #6

> On Oct 17, 2022, at 1:22 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
> 
> > 
> 
> Peter Maydell <peter.maydell@linaro.org> writes:
> 
>>> On Mon, 17 Oct 2022 at 14:50, Alexander Bulekov <alxndr@bu.edu> wrote:
>>> 
>>> On 221015 1710, Chris Friedt wrote:
>>>> From: Christopher Friedt <cfriedt@meta.com>
>>>> 
>>> Reviewed-by: Alexander Bulekov <alxndr@bu.edu>
>>> 
>>> As a side-note, seems strange that edu_check_range will abort the entire
>>> VM if the check fails, rather than handling the error more elegantly.
>> 
>> Yes, this is bad for a device model, though we have a lot of
>> older device models that still do it. The preferred pattern is:
>> * for situations which are "if this happens there is a
>>   bug in QEMU itself", use assert. hw_error() is a kind of
>>   assert that prints a bunch of guest register state: sometimes
>>   you want that, but more often you just want normal assert()
>> * for situations where the guest has misprogrammed the device,
>>   log that with qemu_log_mask(LOG_GUEST_ERROR, ...)
>>   and continue with whatever the real hardware would do, or
>>   some reasonable choice if the h/w spec is vague
>> * for situations where the guest is correct but is trying to
>>   get the device to do something our model doesn't implement
>>   yet, same as above but with LOG_UNIMP.
>> 
>> Probably most hw_error() uses in the codebase should be
>> replaced with one of the above options.
> 
> We should probably document this best practice somewhere in docs/devel
> but I guess we really need a "guide to writing device emulation"
> section.

Should I make a separate PR for that or attach it to the existing series (at v3 now)?

Thanks,

C

Alex Bennée Oct. 18, 2022, 6:11 a.m. UTC | #7

Chris Friedt <cfriedt@meta.com> writes:

>> On Oct 17, 2022, at 1:22 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>> 
>> > 
>> 
>> Peter Maydell <peter.maydell@linaro.org> writes:
>> 
>>>> On Mon, 17 Oct 2022 at 14:50, Alexander Bulekov <alxndr@bu.edu> wrote:
>>>> 
>>>> On 221015 1710, Chris Friedt wrote:
>>>>> From: Christopher Friedt <cfriedt@meta.com>
>>>>> 
>>>> Reviewed-by: Alexander Bulekov <alxndr@bu.edu>
>>>> 
>>>> As a side-note, seems strange that edu_check_range will abort the entire
>>>> VM if the check fails, rather than handling the error more elegantly.
>>> 
>>> Yes, this is bad for a device model, though we have a lot of
>>> older device models that still do it. The preferred pattern is:
>>> * for situations which are "if this happens there is a
>>>   bug in QEMU itself", use assert. hw_error() is a kind of
>>>   assert that prints a bunch of guest register state: sometimes
>>>   you want that, but more often you just want normal assert()
>>> * for situations where the guest has misprogrammed the device,
>>>   log that with qemu_log_mask(LOG_GUEST_ERROR, ...)
>>>   and continue with whatever the real hardware would do, or
>>>   some reasonable choice if the h/w spec is vague
>>> * for situations where the guest is correct but is trying to
>>>   get the device to do something our model doesn't implement
>>>   yet, same as above but with LOG_UNIMP.
>>> 
>>> Probably most hw_error() uses in the codebase should be
>>> replaced with one of the above options.
>> 
>> We should probably document this best practice somewhere in docs/devel
>> but I guess we really need a "guide to writing device emulation"
>> section.
>
> Should I make a separate PR for that or attach it to the existing
> series (at v3 now)?

I don't think improving our developer documentation needs to be part of
this fix. However if you want to take a run at improving it in a new
series be my guest ;-)

>
> Thanks,
>
> C

Jiri Slaby (SUSE) Oct. 18, 2022, 6:27 a.m. UTC | #8

On 17. 10. 22, 15:44, Alexander Bulekov wrote:
> On 221015 1710, Chris Friedt wrote:
>> From: Christopher Friedt <cfriedt@meta.com>
>>
>> In the case that size1 was zero, because of the explicit
>> 'end1 > addr' check, the range check would fail and the error
>> message would read as shown below. The correct comparison
>> is 'end1 >= addr' (or 'addr <= end1').
>>
>> EDU: DMA range 0x40000-0x3ffff out of bounds (0x40000-0x40fff)!
>>
>> At the opposite end, in the case that size1 was 4096, within()
>> would fail because of the non-inclusive check 'end1 < end2',
>> which should have been 'end1 <= end2'. The error message would
>> previously say
>>
>> EDU: DMA range 0x40000-0x40fff out of bounds (0x40000-0x40fff)!
>>
>> This change
>> 1. renames local variables to be more less ambiguous
>> 2. fixes the two off-by-one errors described above.
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1254
>>
>> Signed-off-by: Christopher Friedt <cfriedt@meta.com>
> 
> Reviewed-by: Alexander Bulekov <alxndr@bu.edu>
> 
> As a side-note, seems strange that edu_check_range will abort the entire
> VM if the check fails, rather than handling the error more elegantly.
> Maybe that's useful for students developing kernel drivers against the
> device.

Yes, that was exactly the intention. First, as a punishment as they do 
something really wrong. Second, so they notice -- writing something 
wrong to a register of a real HW often freezes a machine too. Especially 
when misprogramming a DMA controller.

OTOH, this sucks too. Ext4 (and other FS too) is fine, they don't lose 
data. However they need to freshly boot, repair FS and investigate/think 
a lot. This trial and run (and crash) takes several loops for some.

So I am for softening it a bit. But they still should be noticed in some 
obvious way.

thanks,

Jiri Slaby (SUSE) Oct. 18, 2022, 6:30 a.m. UTC | #9

On 17. 10. 22, 16:13, Peter Maydell wrote:
>   * for situations where the guest has misprogrammed the device,
>     log that with qemu_log_mask(LOG_GUEST_ERROR, ...)
>     and continue with whatever the real hardware would do, or
>     some reasonable choice if the h/w spec is vague

As I wrote in the previous mail, can we stop the machine after the print 
somehow, for example? So that the students have to "cont" in the qemu 
console as an acknowledgment when this happens.

thanks,

Alex Bennée Oct. 18, 2022, 9:18 a.m. UTC | #10

Jiri Slaby <jirislaby@kernel.org> writes:

> On 17. 10. 22, 16:13, Peter Maydell wrote:
>>   * for situations where the guest has misprogrammed the device,
>>     log that with qemu_log_mask(LOG_GUEST_ERROR, ...)
>>     and continue with whatever the real hardware would do, or
>>     some reasonable choice if the h/w spec is vague
>
> As I wrote in the previous mail, can we stop the machine after the
> print somehow, for example? So that the students have to "cont" in the
> qemu console as an acknowledgment when this happens.

You can bring the system to a halt with vm_stop(RUN_STATE_PAUSED) or
possible RUN_STATE_DEBUG?

I don't know how obnoxious the message should be at this point (i.e.
should it be masked by LOG_GUEST_ERROR) because if the system halts the
user might not notice. However I guess with this device they would be
expecting to check this sort of thing.

>
> thanks,

Peter Maydell Oct. 18, 2022, 9:30 a.m. UTC | #11

On Tue, 18 Oct 2022 at 10:21, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Jiri Slaby <jirislaby@kernel.org> writes:
>
> > On 17. 10. 22, 16:13, Peter Maydell wrote:
> >>   * for situations where the guest has misprogrammed the device,
> >>     log that with qemu_log_mask(LOG_GUEST_ERROR, ...)
> >>     and continue with whatever the real hardware would do, or
> >>     some reasonable choice if the h/w spec is vague
> >
> > As I wrote in the previous mail, can we stop the machine after the
> > print somehow, for example? So that the students have to "cont" in the
> > qemu console as an acknowledgment when this happens.
>
> You can bring the system to a halt with vm_stop(RUN_STATE_PAUSED) or
> possible RUN_STATE_DEBUG?

No, please don't do anything like that. This should not be special.
Just log a message if the guest does something bad. There are
an absolute ton of things that the guest can do wrong, and
in general QEMU does not attempt to be an "identify all the
ways the guest does something wrong in a friendly way" system.

thanks
-- PMM

Alex Bennée Oct. 18, 2022, 12:11 p.m. UTC | #12

Peter Maydell <peter.maydell@linaro.org> writes:

> On Tue, 18 Oct 2022 at 10:21, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Jiri Slaby <jirislaby@kernel.org> writes:
>>
>> > On 17. 10. 22, 16:13, Peter Maydell wrote:
>> >>   * for situations where the guest has misprogrammed the device,
>> >>     log that with qemu_log_mask(LOG_GUEST_ERROR, ...)
>> >>     and continue with whatever the real hardware would do, or
>> >>     some reasonable choice if the h/w spec is vague
>> >
>> > As I wrote in the previous mail, can we stop the machine after the
>> > print somehow, for example? So that the students have to "cont" in the
>> > qemu console as an acknowledgment when this happens.
>>
>> You can bring the system to a halt with vm_stop(RUN_STATE_PAUSED) or
>> possible RUN_STATE_DEBUG?
>
> No, please don't do anything like that. This should not be special.
> Just log a message if the guest does something bad. There are
> an absolute ton of things that the guest can do wrong, and
> in general QEMU does not attempt to be an "identify all the
> ways the guest does something wrong in a friendly way" system.

We should clean-up the other uses of vm_stop in hw/ then:

  ./hw/ppc/prep_systemio.c:78:        vm_stop(RUN_STATE_PAUSED);
  ./hw/ppc/vof.c:921:        vm_stop(RUN_STATE_PAUSED);
  ./hw/vfio/pci.c:2725:    vm_stop(RUN_STATE_INTERNAL_ERROR);


>
> thanks
> -- PMM

Christopher Friedt Nov. 1, 2022, 6:59 p.m. UTC | #13

Hi Jiri, Peter,

Are you able to review the more recent version of this change?

Look for the subject "[PATCH v4 x/3] hw: misc: edu: ..."

I believe I addressed all concerns.

Cheers,

C


On Mon, Oct 17, 2022 at 12:36 PM Christopher Friedt
<chrisfriedt@gmail.com> wrote:
>
> On Mon, Oct 17, 2022 at 2:23 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> > On 15. 10. 22, 23:10, Chris Friedt wrote:
> > > From: Christopher Friedt <cfriedt@meta.com>
> > This should be split into two patches. This way, it's hard to review.
>
> I can do that :-)
>
> Thanks for the review

[v2] hw: misc: edu: fix 2 off-by-one errors

Commit Message

Comments

Patch