Message ID | 8f3e7a57b40973e62c0d758922971566ca96fb2e.1710906371.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: scrub: refine the error messages output frequency | expand |
On Wed, Mar 20, 2024 at 02:24:52PM +1030, Qu Wenruo wrote: > Currently when we increase the device statistics, it would always lead > to an error message in the kernel log. > > However this output is mostly duplicated with the existing ones: > > - For scrub operations > We always have the following messages: > * "fixed up error at logical %llu" > * "unable to fixup (regular) error at logical %llu" > > So no matter if the corruption is repaired or not, it scrub would > output an error message to indicate the problem. > > - For non-scrub read operations > We also have the following messages: > * "csum failed root %lld inode %llu off %llu" for data csum mismatch > * "bad (tree block start|fsid|tree block level)" for metadata > * "read error corrected: ino %llu off %llu" for repaired data/metadata > > So the error message from btrfs_dev_stat_inc_and_print() is duplicated. > > The real usage for the btrfs device statistics is for some user space > daemon to check if there is any new errors, acting like some checks on > SMART, thus we don't really need/want those messages in dmesg. > > This patch would reduce the log level to debug (disabled by default) for > btrfs_dev_stat_inc_and_print(). > For users really want to utilize btrfs devices statistics, they should > go check "btrfs device stats" periodically, and we should focus the > kernel error messages to more important things. I kind if disagree with each point. The message is meant to be logged as it will happen in production and outside of development, so the debug level does not make sense. The stats message is not duplicated for the individual causes, it additionally tracks the whole state. Logging important messages to system log is a common thing and we do that a lot, this makes debugging and anlyzing things easier. We can't expect that there would always be a daemon collecting the stats, there's not standardized or recommended tool for that. A quick look to dmesg can show that something is wrong. What we can do: reduce the number messages so the whole stats are printed once per transaction if there is a change. We can also tune which events also print the stats, for example flush errors are more interesting than read/write, comparing the number of events that can happen in a batch.
在 2024/4/5 06:56, David Sterba 写道: > On Wed, Mar 20, 2024 at 02:24:52PM +1030, Qu Wenruo wrote: >> Currently when we increase the device statistics, it would always lead >> to an error message in the kernel log. >> >> However this output is mostly duplicated with the existing ones: >> >> - For scrub operations >> We always have the following messages: >> * "fixed up error at logical %llu" >> * "unable to fixup (regular) error at logical %llu" >> >> So no matter if the corruption is repaired or not, it scrub would >> output an error message to indicate the problem. >> >> - For non-scrub read operations >> We also have the following messages: >> * "csum failed root %lld inode %llu off %llu" for data csum mismatch >> * "bad (tree block start|fsid|tree block level)" for metadata >> * "read error corrected: ino %llu off %llu" for repaired data/metadata >> >> So the error message from btrfs_dev_stat_inc_and_print() is duplicated. >> >> The real usage for the btrfs device statistics is for some user space >> daemon to check if there is any new errors, acting like some checks on >> SMART, thus we don't really need/want those messages in dmesg. >> >> This patch would reduce the log level to debug (disabled by default) for >> btrfs_dev_stat_inc_and_print(). >> For users really want to utilize btrfs devices statistics, they should >> go check "btrfs device stats" periodically, and we should focus the >> kernel error messages to more important things. > > I kind if disagree with each point. > > The message is meant to be logged as it will happen in production and > outside of development, so the debug level does not make sense. > > The stats message is not duplicated for the individual causes, it > additionally tracks the whole state. I'd disagree with this. We already have mount time output, and detailed causes are way more useful and just a duplicated message repeating itself. > > Logging important messages to system log is a common thing and we do that > a lot, this makes debugging and anlyzing things easier. We can't > expect that there would always be a daemon collecting the stats, there's > not standardized or recommended tool for that. A quick look to dmesg can > show that something is wrong. Then try supporting cases with all these duplicated and useless messages, you'll hardly agree that they provide any usefulness. > > What we can do: reduce the number messages so the whole stats are > printed once per transaction if there is a change. > > We can also tune which events also print the stats, for example flush > errors are more interesting than read/write, comparing the number of > events that can happen in a batch. My another point is, if it's an important error, we should output it with detailed reason/cause/extra info immediately. And that's already the case for regular/scrub read errors. For critical operations like flush, we should output extra error messages, other than relying on that generic and vague error. Thanks, Qu
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e49935a54da0..126145950ed3 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7828,7 +7828,7 @@ void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index) if (!dev->dev_stats_valid) return; - btrfs_err_rl_in_rcu(dev->fs_info, + btrfs_debug_rl_in_rcu(dev->fs_info, "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u", btrfs_dev_name(dev), btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),