Message ID | d7bb312e-2428-45f6-b9b3-59ba544e8b94@kili.mountain (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: phy: fix a signedness bug in genphy_loopback() | expand |
On Fri, May 26, 2023 at 02:45:54PM +0300, Dan Carpenter wrote: > The "val" variable is used to store error codes from phy_read() so > it needs to be signed for the error handling to work as expected. > > Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") LGTM. Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Thanks!
On Fri, 26 May 2023 14:45:54 +0300 Dan Carpenter wrote: > The "val" variable is used to store error codes from phy_read() so > it needs to be signed for the error handling to work as expected. > > Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") > Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Is it going to be obvious to PHY-savvy folks that the val passed to phy_read_poll_timeout() must be an int? Is it a very common pattern? My outsider intuition is that since regs are 16b, u16 is reasonable, and more people may make the same mistake. Therefore we should try to fix phy_read_poll_timeout() instead to use a local variable like it does for __ret. Weaker version would be to add a compile time check to ensure val is signed (assert(typeof(val)~0ULL < 0) or such?). Opinions?
On Mon, May 29, 2023 at 09:58:02PM -0700, Jakub Kicinski wrote: > On Fri, 26 May 2023 14:45:54 +0300 Dan Carpenter wrote: > > The "val" variable is used to store error codes from phy_read() so > > it needs to be signed for the error handling to work as expected. > > > > Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") > > Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> > > Is it going to be obvious to PHY-savvy folks that the val passed to > phy_read_poll_timeout() must be an int? Is it a very common pattern? > My outsider intuition is that since regs are 16b, u16 is reasonable, > and more people may make the same mistake. Therefore we should try to > fix phy_read_poll_timeout() instead to use a local variable like it > does for __ret. > > Weaker version would be to add a compile time check to ensure val > is signed (assert(typeof(val)~0ULL < 0) or such?). > > Opinions? Yes, I think that would be a saner approach, since phy_read_poll_timeout() returns the error value, rather than using the variable passed in. Andrew?
On Mon, 2023-05-29 at 21:58 -0700, Jakub Kicinski wrote: > On Fri, 26 May 2023 14:45:54 +0300 Dan Carpenter wrote: > > The "val" variable is used to store error codes from phy_read() so > > it needs to be signed for the error handling to work as expected. > > > > Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") > > Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> > > Is it going to be obvious to PHY-savvy folks that the val passed to > phy_read_poll_timeout() must be an int? Is it a very common pattern? > My outsider intuition is that since regs are 16b, u16 is reasonable, > and more people may make the same mistake. Therefore we should try to > fix phy_read_poll_timeout() instead to use a local variable like it > does for __ret. > > Weaker version would be to add a compile time check to ensure val > is signed (assert(typeof(val)~0ULL < 0) or such?). FTR, a BUILD_BUG_ON() the above check spots issues in several places (e.g. r8169_main.c, drivers/net/phy/phy_device.c, ...) I think it should be better resort to a signed local variable in phy_read_poll_timeout() Thanks, Paolo
On Tue, May 30, 2023 at 11:06:55AM +0200, Paolo Abeni wrote: > On Mon, 2023-05-29 at 21:58 -0700, Jakub Kicinski wrote: > > On Fri, 26 May 2023 14:45:54 +0300 Dan Carpenter wrote: > > > The "val" variable is used to store error codes from phy_read() so > > > it needs to be signed for the error handling to work as expected. > > > > > > Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") > > > Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> > > > > Is it going to be obvious to PHY-savvy folks that the val passed to > > phy_read_poll_timeout() must be an int? Is it a very common pattern? > > My outsider intuition is that since regs are 16b, u16 is reasonable, > > and more people may make the same mistake. Therefore we should try to > > fix phy_read_poll_timeout() instead to use a local variable like it > > does for __ret. > > > > Weaker version would be to add a compile time check to ensure val > > is signed (assert(typeof(val)~0ULL < 0) or such?). > > FTR, a BUILD_BUG_ON() the above check spots issues in several places > (e.g. r8169_main.c, drivers/net/phy/phy_device.c, ...) > I don't see an issue in r8169_main.c and in drivers/net/phy/phy_device.c then I only find the bug from this patch. regards, dan carpenter diff --git a/include/linux/phy.h b/include/linux/phy.h index 6478838405a08..f05fc25b77583 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -1173,6 +1173,7 @@ static inline int phy_read(struct phy_device *phydev, u32 regnum) ({ \ int __ret = read_poll_timeout(phy_read, val, val < 0 || (cond), \ sleep_us, timeout_us, sleep_before_read, phydev, regnum); \ + BUILD_BUG_ON((typeof(val))~0ULL > 0); \ if (val < 0) \ __ret = val; \ if (__ret) \
On Tue, 2023-05-30 at 12:23 +0300, Dan Carpenter wrote: > On Tue, May 30, 2023 at 11:06:55AM +0200, Paolo Abeni wrote: > > On Mon, 2023-05-29 at 21:58 -0700, Jakub Kicinski wrote: > > > On Fri, 26 May 2023 14:45:54 +0300 Dan Carpenter wrote: > > > > The "val" variable is used to store error codes from phy_read() so > > > > it needs to be signed for the error handling to work as expected. > > > > > > > > Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") > > > > Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> > > > > > > Is it going to be obvious to PHY-savvy folks that the val passed to > > > phy_read_poll_timeout() must be an int? Is it a very common pattern? > > > My outsider intuition is that since regs are 16b, u16 is reasonable, > > > and more people may make the same mistake. Therefore we should try to > > > fix phy_read_poll_timeout() instead to use a local variable like it > > > does for __ret. > > > > > > Weaker version would be to add a compile time check to ensure val > > > is signed (assert(typeof(val)~0ULL < 0) or such?). > > > > FTR, a BUILD_BUG_ON() the above check spots issues in several places > > (e.g. r8169_main.c, drivers/net/phy/phy_device.c, ...) > > > > I don't see an issue in r8169_main.c and in drivers/net/phy/phy_device.c > then I only find the bug from this patch. > > regards, > dan carpenter > > diff --git a/include/linux/phy.h b/include/linux/phy.h > index 6478838405a08..f05fc25b77583 100644 > --- a/include/linux/phy.h > +++ b/include/linux/phy.h > @@ -1173,6 +1173,7 @@ static inline int phy_read(struct phy_device *phydev, u32 regnum) > ({ \ > int __ret = read_poll_timeout(phy_read, val, val < 0 || (cond), \ > sleep_us, timeout_us, sleep_before_read, phydev, regnum); \ > + BUILD_BUG_ON((typeof(val))~0ULL > 0); \ > if (val < 0) \ > __ret = val; \ > if (__ret) \ > Uhm... I have no idea what happened to my build host. I did see more build errors in previous attempt, but now I only observe the one you address with this patch. I guess some PEBKAC hit me here. /P
On Tue, May 30, 2023 at 12:23:32PM +0300, Dan Carpenter wrote: > I don't see an issue in r8169_main.c and in drivers/net/phy/phy_device.c > then I only find the bug from this patch. I agree - inspecting the code reveals that "val" would be of type "int". > + BUILD_BUG_ON((typeof(val))~0ULL > 0); \ I've just thrown this in to my builds, and building for arm64 using debian stable's gcc, I don't see any errors with genphy_loopback() suitably hacked, even with r8169 included in the build.
On Tue, May 30, 2023 at 10:49:22AM +0100, Russell King (Oracle) wrote: > On Tue, May 30, 2023 at 12:23:32PM +0300, Dan Carpenter wrote: > > I don't see an issue in r8169_main.c and in drivers/net/phy/phy_device.c > > then I only find the bug from this patch. > > I agree - inspecting the code reveals that "val" would be of type "int". > > > + BUILD_BUG_ON((typeof(val))~0ULL > 0); \ > > I've just thrown this in to my builds, and building for arm64 using > debian stable's gcc, I don't see any errors with genphy_loopback() > suitably hacked, even with r8169 included in the build. Also successfully built with 32-bit ARM gcc.
On Mon, May 29, 2023 at 09:58:02PM -0700, Jakub Kicinski wrote: > On Fri, 26 May 2023 14:45:54 +0300 Dan Carpenter wrote: > > The "val" variable is used to store error codes from phy_read() so > > it needs to be signed for the error handling to work as expected. > > > > Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") > > Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> > > Is it going to be obvious to PHY-savvy folks that the val passed to > phy_read_poll_timeout() must be an int? Is it a very common pattern? > My outsider intuition is that since regs are 16b, u16 is reasonable, > and more people may make the same mistake. It is common to get this wrong in general with PHY drivers. Dan regularly posts fixes like this soon after a PHY driver patch it merged. I really wish we could somehow get the compiler to warn when the result from phy_read() is stored into a unsigned type. It would save Dan a lot of work. > Therefore we should try to fix phy_read_poll_timeout() instead to > use a local variable like it does for __ret. The problem with that is val is supposed to be available to the caller. I don't know if it is every actually used, but if it is, using an internal signed variable and then throwing away the sign bit on return is going to result in similar bugs. > Weaker version would be to add a compile time check to ensure val > is signed (assert(typeof(val)~0ULL < 0) or such?). I think the BUILD BUG is a better solution, since it catches all problems. And at developer compile time, rather than at Dan runs his static checker time. Andrew
On Tue, May 30, 2023 at 02:39:53PM +0200, Andrew Lunn wrote: > On Mon, May 29, 2023 at 09:58:02PM -0700, Jakub Kicinski wrote: > > On Fri, 26 May 2023 14:45:54 +0300 Dan Carpenter wrote: > > > The "val" variable is used to store error codes from phy_read() so > > > it needs to be signed for the error handling to work as expected. > > > > > > Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") > > > Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> > > > > Is it going to be obvious to PHY-savvy folks that the val passed to > > phy_read_poll_timeout() must be an int? Is it a very common pattern? > > My outsider intuition is that since regs are 16b, u16 is reasonable, > > and more people may make the same mistake. > > It is common to get this wrong in general with PHY drivers. Dan > regularly posts fixes like this soon after a PHY driver patch it > merged. I really wish we could somehow get the compiler to warn when > the result from phy_read() is stored into a unsigned type. It would > save Dan a lot of work. I don't see these as much as I used. It's maybe once per month. I'm not sure why, maybe kbuild emails everyone before I see it? GCC will warn about this with -Wtype-limits. Clang will also trigger a warning. The Smatch check for this had a bug where it only warned about if (x < 0) { if x was u32 or larger. I fixed that bug which is why I was looking at this code. I will push the fix for that in a couple days. regards, dan carpenter
> I don't see these as much as I used. It's maybe once per month. I'm > not sure why, maybe kbuild emails everyone before I see it? GCC will > warn about this with -Wtype-limits. Clang will also trigger a warning. That is interesting. Maybe we should enable that in driver/net/net and driver/net/pcs. Andrew
On Tue, 30 May 2023 14:39:53 +0200 Andrew Lunn wrote: > > Therefore we should try to fix phy_read_poll_timeout() instead to > > use a local variable like it does for __ret. > > The problem with that is val is supposed to be available to the > caller. I don't know if it is every actually used, but if it is, using > an internal signed variable and then throwing away the sign bit on > return is going to result in similar bugs. This is what I meant FWIW: diff --git a/include/linux/phy.h b/include/linux/phy.h index 7addde5d14c0..829bd57b8794 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -1206,10 +1206,13 @@ static inline int phy_read(struct phy_device *phydev, u32 regnum) #define phy_read_poll_timeout(phydev, regnum, val, cond, sleep_us, \ timeout_us, sleep_before_read) \ ({ \ - int __ret = read_poll_timeout(phy_read, val, val < 0 || (cond), \ + int __ret, __val; \ + \ + __ret = read_poll_timeout(phy_read, __val, __val < 0 || (cond), \ sleep_us, timeout_us, sleep_before_read, phydev, regnum); \ - if (val < 0) \ - __ret = val; \ + val = __val; + if (__val < 0) \ + __ret = __val; \ if (__ret) \ phydev_err(phydev, "%s failed: %d\n", __func__, __ret); \ __ret; \ I tried enabling -Wtype-limits but it's _very_ noisy :(
On Tue, May 30, 2023 at 12:19:10PM -0700, Jakub Kicinski wrote: > On Tue, 30 May 2023 14:39:53 +0200 Andrew Lunn wrote: > > > Therefore we should try to fix phy_read_poll_timeout() instead to > > > use a local variable like it does for __ret. > > > > The problem with that is val is supposed to be available to the > > caller. I don't know if it is every actually used, but if it is, using > > an internal signed variable and then throwing away the sign bit on > > return is going to result in similar bugs. > > This is what I meant FWIW: > > diff --git a/include/linux/phy.h b/include/linux/phy.h > index 7addde5d14c0..829bd57b8794 100644 > --- a/include/linux/phy.h > +++ b/include/linux/phy.h > @@ -1206,10 +1206,13 @@ static inline int phy_read(struct phy_device *phydev, u32 regnum) > #define phy_read_poll_timeout(phydev, regnum, val, cond, sleep_us, \ > timeout_us, sleep_before_read) \ > ({ \ > - int __ret = read_poll_timeout(phy_read, val, val < 0 || (cond), \ > + int __ret, __val; \ > + \ > + __ret = read_poll_timeout(phy_read, __val, __val < 0 || (cond), \ > sleep_us, timeout_us, sleep_before_read, phydev, regnum); \ > - if (val < 0) \ > - __ret = val; \ > + val = __val; > + if (__val < 0) \ > + __ret = __val; \ > if (__ret) \ > phydev_err(phydev, "%s failed: %d\n", __func__, __ret); \ > __ret; \ > > > I tried enabling -Wtype-limits but it's _very_ noisy :( Yes, looks good, that's what I thought you were meaning, and I totally agree with it. Thanks! Whatever we decide for this will also need to be applied to phy_read_mmd_poll_timeout() as well.
> > This is what I meant FWIW: > > > > diff --git a/include/linux/phy.h b/include/linux/phy.h > > index 7addde5d14c0..829bd57b8794 100644 > > --- a/include/linux/phy.h > > +++ b/include/linux/phy.h > > @@ -1206,10 +1206,13 @@ static inline int phy_read(struct phy_device *phydev, u32 regnum) > > #define phy_read_poll_timeout(phydev, regnum, val, cond, sleep_us, \ > > timeout_us, sleep_before_read) \ > > ({ \ > > - int __ret = read_poll_timeout(phy_read, val, val < 0 || (cond), \ > > + int __ret, __val; \ > > + \ > > + __ret = read_poll_timeout(phy_read, __val, __val < 0 || (cond), \ > > sleep_us, timeout_us, sleep_before_read, phydev, regnum); \ > > - if (val < 0) \ > > - __ret = val; \ > > + val = __val; This results in the sign being discarded if val is unsigned. Yes, the test is remove, which i assume will stop Smatch complaining, but it is still broken. > > + if (__val < 0) \ > > + __ret = __val; \ > > if (__ret) \ > > phydev_err(phydev, "%s failed: %d\n", __func__, __ret); \ > > __ret; \ > > I tried enabling -Wtype-limits but it's _very_ noisy :( This is a no go until GENMASK gets fixed :-( However, if that is fixed, we might be able to turn it on. But it will then trigger with this fix. So i still think a BUILD_BUG_ON is a better fix. Help developers get the code correct, rather than work around them getting it wrong. I also wonder if this needs to go down a level. Dan, how often do you see similar problems with the lower level read_poll_timeout() and friends? Andrew
On Tue, May 30, 2023 at 10:04:52PM +0200, Andrew Lunn wrote: > > > This is what I meant FWIW: > > > > > > diff --git a/include/linux/phy.h b/include/linux/phy.h > > > index 7addde5d14c0..829bd57b8794 100644 > > > --- a/include/linux/phy.h > > > +++ b/include/linux/phy.h > > > @@ -1206,10 +1206,13 @@ static inline int phy_read(struct phy_device *phydev, u32 regnum) > > > #define phy_read_poll_timeout(phydev, regnum, val, cond, sleep_us, \ > > > timeout_us, sleep_before_read) \ > > > ({ \ > > > - int __ret = read_poll_timeout(phy_read, val, val < 0 || (cond), \ > > > + int __ret, __val; \ > > > + \ > > > + __ret = read_poll_timeout(phy_read, __val, __val < 0 || (cond), \ > > > sleep_us, timeout_us, sleep_before_read, phydev, regnum); \ > > > - if (val < 0) \ > > > - __ret = val; \ > > > + val = __val; > > This results in the sign being discarded if val is unsigned. Yes, the > test is remove, which i assume will stop Smatch complaining, but it is > still broken. I was going to ask you to explain that, but having thought about this more, there's much bigger problems with the proposal. First, if I'm understanding you correctly, your point doesn't seem relevant, because if val is unsigned, we have an implicit cast from a signed int to an unsigned int _at_ _some_ _point_. With the existing code, that implicit cast is buried inside read_poll_timeout(), here to be exact: (val) = op(args); because "op" will be one of the phy_read*() functions that returns an "int", but "val" is unsigned - which means there's an implicit cast here. Jakub's patch moves that cast after read_poll_timeout(). The elephant in the room has nothing to do with this, but everything to do with "cond". "cond" is an expression to be evaluated inside the loop, which must have access to the value read from the phy_read*() function, and that value is referenced via whatever variable was provided via "val". So changing "val" immediately breaks "cond". Having thought about this, the best I can come up with is this, which I think gives us everything we want without needing BUILD_BUG_ONs: #define phy_read_poll_timeout(phydev, regnum, val, cond, sleep_us, \ timeout_us, sleep_before_read) \ ({ \ int __ret, __val; __ret = read_poll_timeout(__val = phy_read, val, __val < 0 || (cond), \ sleep_us, timeout_us, sleep_before_read, phydev, regnum); \ if (__val < 0) \ __ret = __val; \ if (__ret) \ phydev_err(phydev, "%s failed: %d\n", __func__, __ret); \ __ret; \ }) This looks rather horrid, but what it essentially does is: (val) = op(args); \ if (cond) \ break; \ expands to: (val) = __val = phy_read(args); if (__val < 0 || (cond)) break; As phy_read() returns an int, there is no cast or loss assigning it to __val, since that is also an int. The conversion from int to something else happens at the same point it always has. Hmm?
On Tue, May 30, 2023 at 10:09:24PM +0100, Russell King (Oracle) wrote: > Having thought about this, the best I can come up with is this, which > I think gives us everything we want without needing BUILD_BUG_ONs: > > #define phy_read_poll_timeout(phydev, regnum, val, cond, sleep_us, \ > timeout_us, sleep_before_read) \ > ({ \ > int __ret, __val; > __ret = read_poll_timeout(__val = phy_read, val, __val < 0 || (cond), \ > sleep_us, timeout_us, sleep_before_read, phydev, regnum); \ > if (__val < 0) \ > __ret = __val; \ > if (__ret) \ > phydev_err(phydev, "%s failed: %d\n", __func__, __ret); \ > __ret; \ > }) > > This looks rather horrid, but what it essentially does is: > > (val) = op(args); \ > if (cond) \ > break; \ > > expands to: > > (val) = __val = phy_read(args); > if (__val < 0 || (cond)) > break; > > As phy_read() returns an int, there is no cast or loss assigning it > to __val, since that is also an int. The conversion from int to > something else happens at the same point it always has. ... and actually produces nicer code on 32-bit ARM: Old (with the u16 val changed to an int val): 2f8: ebfffffe bl 0 <mdiobus_read> 2fc: e7e03150 ubfx r3, r0, #2, #1 extract bit 2 into r3 300: e1a04000 mov r4, r0 save return value 304: e2002004 and r2, r0, #4 extract bit 2 again 308: e1933fa0 orrs r3, r3, r0, lsr #31 grab sign bit 30c: 1a00000d bne 348 <genphy_loopback+0xd8> breaks out of loop if r3 is nonzero ... rest of loop ... ... 348: e3520000 cmp r2, #0 34c: 0a00000b beq 380 <genphy_loopback+0x110> basically tests whether bit 2 was zero, and jumps if it was. Basically (cond) is false. 350: e3540000 cmp r4, #0 354: a3a04000 movge r4, #0 358: ba00000a blt 388 <genphy_loopback+0x118> tests whether a phy_read returned an error and jumps if it did. r4 is basically __ret. ... 380: e3540000 cmp r4, #0 384: a3e0406d mvnge r4, #109 ; 0x6d if r4 (__ret) was >= 0, sets an error code (-ETIMEDOUT). 388: e1a03004 mov r3, r4 ... dev_err() bit. The new generated code is: 2f8: ebfffffe bl 0 <mdiobus_read> 2f8: R_ARM_CALL mdiobus_read 2fc: e2504000 subs r4, r0, #0 __val assignment 300: ba000014 blt 358 <genphy_loopback+0xe8> if <0, go direct to dev_err code 304: e3140004 tst r4, #4 cond test within loop 308: 1a00000d bne 344 <genphy_loopback+0xd4> ... rest of loop ... 344: e6ff4074 uxth r4, r4 cast to 16-bit uint 348: e3140004 tst r4, #4 test 34c: 13a04000 movne r4, #0 __ret is zero if bit set 350: 1a000007 bne 374 <genphy_loopback+0x104> basically returns 354: e3e0406d mvn r4, #109 ; 0x6d ... otherwise sets __ret to -ETIMEDOUT ... dev_err() code Is there a reason why it was written (cond) || val < 0 rather than val < 0 || (cond) ? Note that the order of these tests makes no difference in this situation, but I'm wondering whether it was intentional?
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 2cad9cc3f6b8..d52dd699ae0b 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -2700,8 +2700,8 @@ EXPORT_SYMBOL(genphy_resume); int genphy_loopback(struct phy_device *phydev, bool enable) { if (enable) { - u16 val, ctl = BMCR_LOOPBACK; - int ret; + u16 ctl = BMCR_LOOPBACK; + int val, ret; ctl |= mii_bmcr_encode_fixed(phydev->speed, phydev->duplex);
The "val" variable is used to store error codes from phy_read() so it needs to be signed for the error handling to work as expected. Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> --- drivers/net/phy/phy_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)