Message ID | 20220824163100.224449-2-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | coding-style.rst: document BUG() and WARN() rules | expand |
On 8/24/22 09:30, David Hildenbrand wrote: > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst > index 03eb53fd029a..a6d81ff578fe 100644 > --- a/Documentation/process/coding-style.rst > +++ b/Documentation/process/coding-style.rst > @@ -1186,6 +1186,33 @@ expression used. For instance: > #endif /* CONFIG_SOMETHING */ > I like the idea of adding this documentation, and this is the right place. Naturally, if one likes something, one must immediately change it. :) Therefore, here is an alternative writeup that I think captures what you and the email threads were saying. How's this sound? diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index 03eb53fd029a..32df0d503388 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -1185,6 +1185,53 @@ expression used. For instance: ... #endif /* CONFIG_SOMETHING */ +22) Do not crash the kernel +--------------------------- + +Use WARN() rather than BUG() +**************************** + +Do not add new code that uses any of the BUG() variants, such as BUG(), +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not required +if there is no reasonable way to at least partially recover. + +Use WARN_ON_ONCE() rather than WARN() or WARN_ON() +************************************************** + +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it is +common for a given warning condition, if it occurs at all, to occur multiple +times. (For example, once per file, or once per struct page.) This can fill up +and wrap the kernel log, and can even slow the system enough that the excessive +logging turns into its own, additional problem. + +Do not WARN lightly +******************* + +WARN*() is intended for unexpected, this-should-never-happen situations. WARN*() +macros are not to be used for anything that is expected to happen during normal +operation. These are not pre- or post-condition asserts, for example. Again: +WARN*() must not be used for a condition that is expected to trigger easily, for +example, by user space actions. pr_warn_once() is a possible alternative, if you +need to notify the user of a problem. + +Do not worry about panic_on_warn users +************************************** + +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an +available kernel option, and that many users set this option. This is why there +is a "Do not WARN lightly" writeup, above. However, the existence of +panic_on_warn users is not a valid reason to avoid the judicious use WARN*(). +That is because, whoever enables panic_on_warn has explicitly asked the kernel +to crash if a WARN*() fires, and such users must be prepared to deal with the +consequences of a system that is somewhat more likely to crash. + +Use BUILD_BUG_ON() for compile-time assertions +********************************************** + +The use of BUILD_BUG_ON() is acceptable and encouraged, because it is a +compile-time assertion that has no effect at runtime. thanks,
On 24.08.22 23:59, John Hubbard wrote: > On 8/24/22 09:30, David Hildenbrand wrote: >> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst >> index 03eb53fd029a..a6d81ff578fe 100644 >> --- a/Documentation/process/coding-style.rst >> +++ b/Documentation/process/coding-style.rst >> @@ -1186,6 +1186,33 @@ expression used. For instance: >> #endif /* CONFIG_SOMETHING */ >> > > I like the idea of adding this documentation, and this is the right > place. Naturally, if one likes something, one must immediately change > it. :) Therefore, here is an alternative writeup that I think captures > what you and the email threads were saying. > > How's this sound? Much better, thanks! :) > > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst > index 03eb53fd029a..32df0d503388 100644 > --- a/Documentation/process/coding-style.rst > +++ b/Documentation/process/coding-style.rst > @@ -1185,6 +1185,53 @@ expression used. For instance: > ... > #endif /* CONFIG_SOMETHING */ > > +22) Do not crash the kernel > +--------------------------- > + > +Use WARN() rather than BUG() > +**************************** > + > +Do not add new code that uses any of the BUG() variants, such as BUG(), > +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably > +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not required > +if there is no reasonable way to at least partially recover. I'll tend to keep in this section: "Unavoidable data corruption / security issues might be a very rare exception to this rule and need good justification." Because there are rare exceptions, and I'd much rather document the clear exception to this rule. > + > +Use WARN_ON_ONCE() rather than WARN() or WARN_ON() > +************************************************** > + > +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it is > +common for a given warning condition, if it occurs at all, to occur multiple > +times. (For example, once per file, or once per struct page.) This can fill up I'll drop the "For example" part. I feel like this doesn't really need an example -- most probably we've all been there already when the kernel log was flooded :) > +and wrap the kernel log, and can even slow the system enough that the excessive > +logging turns into its own, additional problem. > + > +Do not WARN lightly > +******************* > + > +WARN*() is intended for unexpected, this-should-never-happen situations. WARN*() > +macros are not to be used for anything that is expected to happen during normal > +operation. These are not pre- or post-condition asserts, for example. Again: > +WARN*() must not be used for a condition that is expected to trigger easily, for > +example, by user space actions. pr_warn_once() is a possible alternative, if you > +need to notify the user of a problem. > + > +Do not worry about panic_on_warn users > +************************************** > + > +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an > +available kernel option, and that many users set this option. This is why there > +is a "Do not WARN lightly" writeup, above. However, the existence of > +panic_on_warn users is not a valid reason to avoid the judicious use WARN*(). > +That is because, whoever enables panic_on_warn has explicitly asked the kernel > +to crash if a WARN*() fires, and such users must be prepared to deal with the > +consequences of a system that is somewhat more likely to crash. Side note: especially with kdump() I feel like we might see much more widespread use of panic_on_warn to be able to actually extract debug information in a controlled manner -- for example on enterprise distros. ... which would then make these systems more likely to crash, because there is no way to distinguish a rather harmless warning from a severe warning :/ . But let's see if some kdump() folks will share their opinion as reply to the cover letter.
Hi David, [Added more people in cc] On Thu, 25 Aug 2022 at 20:13, David Hildenbrand <david@redhat.com> wrote: > > On 24.08.22 23:59, John Hubbard wrote: > > On 8/24/22 09:30, David Hildenbrand wrote: > >> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst > >> index 03eb53fd029a..a6d81ff578fe 100644 > >> --- a/Documentation/process/coding-style.rst > >> +++ b/Documentation/process/coding-style.rst > >> @@ -1186,6 +1186,33 @@ expression used. For instance: > >> #endif /* CONFIG_SOMETHING */ > >> > > > > I like the idea of adding this documentation, and this is the right > > place. Naturally, if one likes something, one must immediately change > > it. :) Therefore, here is an alternative writeup that I think captures > > what you and the email threads were saying. > > > > How's this sound? > > Much better, thanks! :) > > > > > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst > > index 03eb53fd029a..32df0d503388 100644 > > --- a/Documentation/process/coding-style.rst > > +++ b/Documentation/process/coding-style.rst > > @@ -1185,6 +1185,53 @@ expression used. For instance: > > ... > > #endif /* CONFIG_SOMETHING */ > > > > +22) Do not crash the kernel > > +--------------------------- > > + > > +Use WARN() rather than BUG() > > +**************************** > > + > > +Do not add new code that uses any of the BUG() variants, such as BUG(), > > +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably > > +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not required > > +if there is no reasonable way to at least partially recover. > > I'll tend to keep in this section: > > "Unavoidable data corruption / security issues might be a very rare > exception to this rule and need good justification." > > Because there are rare exceptions, and I'd much rather document the > clear exception to this rule. > > > + > > +Use WARN_ON_ONCE() rather than WARN() or WARN_ON() > > +************************************************** > > + > > +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it is > > +common for a given warning condition, if it occurs at all, to occur multiple > > +times. (For example, once per file, or once per struct page.) This can fill up > > I'll drop the "For example" part. I feel like this doesn't really need > an example -- most probably we've all been there already when the kernel > log was flooded :) > > > +and wrap the kernel log, and can even slow the system enough that the excessive > > +logging turns into its own, additional problem. > > + > > +Do not WARN lightly > > +******************* > > + > > +WARN*() is intended for unexpected, this-should-never-happen situations. WARN*() > > +macros are not to be used for anything that is expected to happen during normal > > +operation. These are not pre- or post-condition asserts, for example. Again: > > +WARN*() must not be used for a condition that is expected to trigger easily, for > > +example, by user space actions. pr_warn_once() is a possible alternative, if you > > +need to notify the user of a problem. > > + > > +Do not worry about panic_on_warn users > > +************************************** > > + > > +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an > > +available kernel option, and that many users set this option. This is why there > > +is a "Do not WARN lightly" writeup, above. However, the existence of > > +panic_on_warn users is not a valid reason to avoid the judicious use WARN*(). > > +That is because, whoever enables panic_on_warn has explicitly asked the kernel > > +to crash if a WARN*() fires, and such users must be prepared to deal with the > > +consequences of a system that is somewhat more likely to crash. > > Side note: especially with kdump() I feel like we might see much more > widespread use of panic_on_warn to be able to actually extract debug > information in a controlled manner -- for example on enterprise distros. > ... which would then make these systems more likely to crash, because > there is no way to distinguish a rather harmless warning from a severe > warning :/ . But let's see if some kdump() folks will share their > opinion as reply to the cover letter. I can understand the intention of this patch, and I totally agree that BUG() should be used carefully, this is a good proposal if we can clearly define the standard about when to use BUG(). But I do have some worries, I think this standard is different for different sub components, it is not clear to me at least, so this may introduce an unstable running kernel and cause troubles (eg. data corruption) with a WARN instead of a BUG. Probably it would be better to say "Do not WARN lightly, and do not hesitate to use BUG if it is really needed"? About "patch_on_warn", it will depend on the admin/end user to set it, it is not a good idea for distribution to set it. It seems we are leaving it to end users to take the risk of a kernel panic even with all kernel WARN even if it is sometimes not necessary. > > -- > Thanks, > > David / dhildenb > Thanks Dave
On 26.08.22 03:43, Dave Young wrote: > Hi David, > > [Added more people in cc] > Hi Dave, thanks for your input! [...] >> Side note: especially with kdump() I feel like we might see much more >> widespread use of panic_on_warn to be able to actually extract debug >> information in a controlled manner -- for example on enterprise distros. >> ... which would then make these systems more likely to crash, because >> there is no way to distinguish a rather harmless warning from a severe >> warning :/ . But let's see if some kdump() folks will share their >> opinion as reply to the cover letter. > > I can understand the intention of this patch, and I totally agree that > BUG() should be used carefully, this is a good proposal if we can > clearly define the standard about when to use BUG(). But I do have Essentially, the general rule from Linus is "absolutely no new BUG_ON() calls ever" -- but I think the consensus in that thread was that there are corner cases when it comes to unavoidable data corruption/security issues. And these are rare cases, not the usual case where we'd have used BUG_ON()/VM_BUG_ON(). > some worries, I think this standard is different for different sub > components, it is not clear to me at least, so this may introduce an > unstable running kernel and cause troubles (eg. data corruption) with > a WARN instead of a BUG. Probably it would be better to say "Do not > WARN lightly, and do not hesitate to use BUG if it is really needed"? Well, I don't make the rules, I document them and share them for general awareness/comments :) Documenting this is valuable, because there seem to be quite some different opinions floating around in the community -- and I've been learning different rules from different people over the years. > > About "patch_on_warn", it will depend on the admin/end user to set it, > it is not a good idea for distribution to set it. It seems we are > leaving it to end users to take the risk of a kernel panic even with > all kernel WARN even if it is sometimes not necessary. My question would be what we could add/improve to keep systems with kdump armed running as expected for end users, that is most probably: 1) don't crash on harmless WARN() that can just be reported and the machine will continue running mostly fine without real issues. 2) crash on severe issues (previously BUG) such that we can properly capture a system dump via kdump. The restart the machine. Of course, once one would run into 2), one could try reproducing with "panic_on_warn" to get a reasonable system dump. But I guess that's not what enterprise customers expect. One wild idea (in the cover letter) was to add something new that can be configured by user space and that expresses that something is more severe than just some warning that can be recovered easily. But it can eventually be recovered to keep the system running to some degree. But still, it's configurable if we want to trigger a panic or let the system run. John mentioned PANIC_ON(). What would be your expectation for kdump users under which conditions we want to trigger kdump and when not? Regarding panic_on_warn, how often do e.g., RHEL users observe warnings that we're not able to catch during testing, such that "panic_on_warn" would be a real no-go?
Hi David, On Sat, 27 Aug 2022 at 01:02, David Hildenbrand <david@redhat.com> wrote: > > On 26.08.22 03:43, Dave Young wrote: > > Hi David, > > > > [Added more people in cc] > > > > Hi Dave, > > thanks for your input! You are welcome :) > > [...] > > >> Side note: especially with kdump() I feel like we might see much more > >> widespread use of panic_on_warn to be able to actually extract debug > >> information in a controlled manner -- for example on enterprise distros. > >> ... which would then make these systems more likely to crash, because > >> there is no way to distinguish a rather harmless warning from a severe > >> warning :/ . But let's see if some kdump() folks will share their > >> opinion as reply to the cover letter. > > > > I can understand the intention of this patch, and I totally agree that > > BUG() should be used carefully, this is a good proposal if we can > > clearly define the standard about when to use BUG(). But I do have > > Essentially, the general rule from Linus is "absolutely no new BUG_ON() > calls ever" -- but I think the consensus in that thread was that there > are corner cases when it comes to unavoidable data corruption/security > issues. And these are rare cases, not the usual case where we'd have > used BUG_ON()/VM_BUG_ON(). Yes, probably.. (say probably because those cases are hidden and not clear sometimes) > > > some worries, I think this standard is different for different sub > > components, it is not clear to me at least, so this may introduce an > > unstable running kernel and cause troubles (eg. data corruption) with > > a WARN instead of a BUG. Probably it would be better to say "Do not > > WARN lightly, and do not hesitate to use BUG if it is really needed"? > > > Well, I don't make the rules, I document them and share them for general > awareness/comments :) Documenting this is valuable, because there seem > to be quite some different opinions floating around in the community -- > and I've been learning different rules from different people over the years. Understand. > > > > > About "patch_on_warn", it will depend on the admin/end user to set it, > > it is not a good idea for distribution to set it. It seems we are > > leaving it to end users to take the risk of a kernel panic even with > > all kernel WARN even if it is sometimes not necessary. > > My question would be what we could add/improve to keep systems with > kdump armed running as expected for end users, that is most probably: > > 1) don't crash on harmless WARN() that can just be reported and the > machine will continue running mostly fine without real issues. > 2) crash on severe issues (previously BUG) such that we can properly > capture a system dump via kdump. The restart the machine. > > Of course, once one would run into 2), one could try reproducing with > "panic_on_warn" to get a reasonable system dump. But I guess that's not > what enterprise customers expect. > Sometimes the bug can not be easily reproduced again. So there seems no easy and good way to use.. > > One wild idea (in the cover letter) was to add something new that can be > configured by user space and that expresses that something is more > severe than just some warning that can be recovered easily. But it can > eventually be recovered to keep the system running to some degree. But > still, it's configurable if we want to trigger a panic or let the system > run. > > John mentioned PANIC_ON(). > I would vote for PANIC_ON(), it sounds like a good idea, because BUG_ON() is not obvious and, PANIC_ON() can alert the code author that this will cause a kernel panic and one will be more careful before using it. > > What would be your expectation for kdump users under which conditions we > want to trigger kdump and when not? > > Regarding panic_on_warn, how often do e.g., RHEL users observe warnings > that we're not able to catch during testing, such that "panic_on_warn" > would be a real no-go? Well, I'm not sure how to answer the questions, when to panic should be decided by kernel developers instead of kdump users, but I think the panic behaviour does impact the supporting team. I added Stephen who is from the RH supporting team, maybe he can have some inputs. BTW, I vaguely remember Prarit introduced the panic_on_warn, see if he has any comments here. Thanks Dave > > -- > Thanks, > > David / dhildenb >
On Sun, Aug 28, 2022 at 6:56 PM Dave Young <dyoung@redhat.com> wrote: > > > John mentioned PANIC_ON(). > > I would vote for PANIC_ON(), it sounds like a good idea, because > BUG_ON() is not obvious and, PANIC_ON() can alert the code author that > this will cause a kernel panic and one will be more careful before > using it. People, NO. We're trying to get rid of BUG_ON() because it kills the machine. Not replace it with another bogus thing that kills a machine. So no PANIC_ON(). We used to have "panic()" many many years ago, we got rid of it. We're not re-introducing it. People who want to panic on warnings can do so. WARN_ON() _becomes_ PANIC for those people. But those people are the "we have a million machines, we want to just fail things on any sign of trouble, and we have MIS people who can look at the logs". And it's not like we need to get rid of _all_ BUG_ON() cases. If you have a "this is major internal corruption, there's no way we can continue", then BUG_ON() is appropriate. It will try to kill that process and try to keep the machine running, and again, the kind of people who don't care about one machine (because - again - they have millions of them) can just turn that into a panic-and-reboot situation. But the kind of people for whom the machine they are on IS THEIR ONLY MACHINE - whether it be a workstation, a laptop, or a cellphone - there is absolutely zero situation where "let's just kill the machine" is *EVER* approproate. Even a BUG_ON() will try to continue as well as it can after killing the current thread, but it's going to be iffy, because locking etc. So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for "oops, I really don't know what to do, and I physically *cannot* continue" (and that is *not* "I'm too lazy to do error handling"). There is no room for PANIC. None. Ever. The only thing there is are "I don't care about this machine because I've got 999,999 other machines, so I'd rather take one machine offline for analysis". Understand? The "should I panic and reboot" is fundamentally not about the code, and it's not a choice that the kernel code gets to make. It's purely about the choice of the person maintaining the machine. As a kernel developer, you do not EVER get to say "panic" or "kill the machine". End of story. Linus
On 8/28/22 20:07, Linus Torvalds wrote: > On Sun, Aug 28, 2022 at 6:56 PM Dave Young <dyoung@redhat.com> wrote: >> >>> John mentioned PANIC_ON(). >> >> I would vote for PANIC_ON(), it sounds like a good idea, because >> BUG_ON() is not obvious and, PANIC_ON() can alert the code author that >> this will cause a kernel panic and one will be more careful before >> using it. > > People, NO. > > We're trying to get rid of BUG_ON() because it kills the machine. > > Not replace it with another bogus thing that kills a machine. OK, this guidance, and the write-up that follows it, is all very clear, except for one point... > > So no PANIC_ON(). We used to have "panic()" many many years ago, we > got rid of it. We're not re-introducing it. ...here. I count ~1000 calls to panic() in today's kernel, to a function in kernel/panic.c that shows no hint of being removed, nor even deprecated. $ git grep -nw panic\( | wc -l 1321 Could you please clarify that point? (I'm not trying to debate policy, but rather, to figure out what we should write in the documentation of the policy.) thanks,
On 29.08.22 05:07, Linus Torvalds wrote: > On Sun, Aug 28, 2022 at 6:56 PM Dave Young <dyoung@redhat.com> wrote: >> >>> John mentioned PANIC_ON(). >> >> I would vote for PANIC_ON(), it sounds like a good idea, because >> BUG_ON() is not obvious and, PANIC_ON() can alert the code author that >> this will cause a kernel panic and one will be more careful before >> using it. > > People, NO. > > We're trying to get rid of BUG_ON() because it kills the machine. > > Not replace it with another bogus thing that kills a machine. > > So no PANIC_ON(). We used to have "panic()" many many years ago, we > got rid of it. We're not re-introducing it. > > People who want to panic on warnings can do so. WARN_ON() _becomes_ > PANIC for those people. But those people are the "we have a million > machines, we want to just fail things on any sign of trouble, and we > have MIS people who can look at the logs". > > And it's not like we need to get rid of _all_ BUG_ON() cases. If you > have a "this is major internal corruption, there's no way we can > continue", then BUG_ON() is appropriate. It will try to kill that > process and try to keep the machine running, and again, the kind of > people who don't care about one machine (because - again - they have > millions of them) can just turn that into a panic-and-reboot > situation. > > But the kind of people for whom the machine they are on IS THEIR ONLY > MACHINE - whether it be a workstation, a laptop, or a cellphone - > there is absolutely zero situation where "let's just kill the machine" > is *EVER* approproate. Even a BUG_ON() will try to continue as well as > it can after killing the current thread, but it's going to be iffy, > because locking etc. > > So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for > "oops, I really don't know what to do, and I physically *cannot* > continue" (and that is *not* "I'm too lazy to do error handling"). > > There is no room for PANIC. None. Ever. Let me clearer what I had in mind, avoiding the PANIC_ON terminology John raised. I was wondering if it would make sense to 1) Be able to specify a severity for WARN (developer decision) 2) Be able to specify a severity for panic_on_warn (admin decision) Distributions with kdump could keep a mode whereby severe warnings (e.g., former BUG_ON) would properly kdump+reboot and harmless warnings (e.g., clean recovery path) would WARN once + continue. I agree that whether to panic should in most cases be a decision of the admin, not the developer. Now, that's a different discussion then the documentation update at hand, and I primary wanted to raise awareness for the kdump people, and ask them if a stronger move towards WARN_ON_ONCE will affect them/customer expectations. I'll work with John to document the current rules to reflect everything you said here.
On Sun, 28 Aug 2022, Linus Torvalds <torvalds@linux-foundation.org> wrote: > So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for > "oops, I really don't know what to do, and I physically *cannot* > continue" (and that is *not* "I'm too lazy to do error handling"). Any insight for the tradeoff between WARN_ON_ONCE() and WARN_ON(), i.e. wasting the static once variable per use site vs. littering the dmesg on every hit? I see there have been some improvements with the __WARN_FLAGS() stuff, but is the data use really neglible? BR, Jani.
On Sun, Aug 28, 2022 at 9:49 PM John Hubbard <jhubbard@nvidia.com> wrote: > > ...here. I count ~1000 calls to panic() in today's kernel, to a > function in kernel/panic.c that shows no hint of being removed, nor > even deprecated. Heh. I guess we never finished the panic() removal. It's been decades, I suspect we ended up deciding that the bootup failures might as well continue to panic. Anyway, please don't use it. It's one of those things that should never ever trigger, and mainly for something like "oops, I ran out of memory during boot" etc. Oh, I'm sure it's crept into other places too, but that doesn't make it ok. Linus
diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index 03eb53fd029a..a6d81ff578fe 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -1186,6 +1186,33 @@ expression used. For instance: #endif /* CONFIG_SOMETHING */ +22) Do not crash the kernel +--------------------------- + +Do not add new code that uses BUG(), BUG_ON(), VM_BUG_ON(), ... to crash +the kernel if certain conditions are not met. Instead, use WARN_ON_ONCE() +with recovery code (if reasonable) instead. Unavoidable data corruption / +security issues might be a very rare exception to this rule and need good +justification. + +There is no need for WARN_ON_ONCE() recovery code if we never expect it to +possibly trigger unless something goes seriously wrong or some other code +is changed to break invariants. Note that VM_WARN_ON_ONCE() cannot be used +in conditionals. + +Usage of WARN() and friends is fine for something that is not expected to +be triggered easily. ``panic_on_warn`` users are not an excuse to not use +WARN(): whoever enables ``panic_on_warn`` explicitly asked the kernel to +crash in this case. + +However, WARN() and friends should not be used for something that is +expected to trigger easily, for example, by user space. pr_warn_once() +might be a reasonable replacement to notify the user. + +Note that BUILD_BUG_ON() is perfectly fine because it will make compilation +fail instead of crashing the kernel. + + Appendix I) References ----------------------
Linus notes [1] that the introduction of new code that uses VM_BUG_ON() is just as bad as BUG_ON(), because it will crash the kernel on distributions that enable CONFIG_DEBUG_VM (like Fedora): VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no different, the only difference is "we can make the code smaller because these are less important". [2] This resulted in a more generic discussion about usage of BUG() and friends. While there might be corner cases that still deserve a BUG_ON(), most BUG_ON() cases should simply use WARN_ON_ONCE() and implement a recovery path if reasonable: The only possible case where BUG_ON can validly be used is "I have some fundamental data corruption and cannot possibly return an error". [2] As a very good approximation is the general rule: "absolutely no new BUG_ON() calls _ever_" [2] ... not even if something really shouldn't ever happen and is merely for documenting that an invariant always has to hold. There is only one good BUG_ON(): Now, that said, there is one very valid sub-form of BUG_ON(): BUILD_BUG_ON() is absolutely 100% fine. [2] While WARN will also crash the machine with panic_on_warn set, that's exactly to be expected: So we have two very different cases: the "virtual machine with good logging where a dead machine is fine" - use 'panic_on_warn'. And the actual real hardware with real drivers, running real loads by users. [3] The basic idea is that warnings will similarly get reported by users and be found during testing. However, in contrast to a BUG(), there is a way to actually influence the expected behavior (e.g., panic_on_warn) and to eventually keep the machine alive to extract some debug info. Ingo notes that not all WARN_ON_ONCE cases need recovery. If we don't ever expect this code to trigger in any case, recovery code is not really helpful. I'd prefer to keep all these warnings 'simple' - i.e. no attempted recovery & control flow, unless we ever expect these to trigger. [4] There have been different rules floating around that were never properly documented. Let's try to clarify. [1] https://lkml.kernel.org/r/CAHk-=wiEAH+ojSpAgx_Ep=NKPWHU8AdO3V56BXcCsU97oYJ1EA@mail.gmail.com [2] https://lore.kernel.org/r/CAHk-=wg40EAZofO16Eviaj7mfqDhZ2gVEbvfsMf6gYzspRjYvw@mail.gmail.com [3] https://lore.kernel.org/r/CAHk-=wgF7K2gSSpy=m_=K3Nov4zaceUX9puQf1TjkTJLA2XC_g@mail.gmail.com Signed-off-by: David Hildenbrand <david@redhat.com> --- Documentation/process/coding-style.rst | 27 ++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)