Message ID | 01020179f656e348-b07c8001-7b74-432a-b215-ccc7b17fbfdf-000000@eu-west-1.amazonses.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Combining nodatasum + compression | expand |
On 2021/6/10 下午10:32, Martin Raiber wrote: > Hi, > > when btrfs is running on a block device that improves integrity (e.g. > Ceph), it's usefull to run it with nodatasum to reduce the amount of > metadata and random IO. > > In that case it would also be useful to be able to run it with > compression combined with nodatasum as well. > > I actually found that if nodatasum is specified after compress-force, > that it doesn't remove reset the compress/nodatasum bit, while the other > way around it doesn't work. > > That combined with > > --- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580 +0200 > +++ linux-5.10.39/fs/btrfs/inode.c 2021-05-31 16:11:19.461591523 +0200 > @@ -408,8 +408,7 @@ > */ > static inline bool inode_can_compress(struct btrfs_inode *inode) > { > - if (inode->flags & BTRFS_INODE_NODATACOW || > - inode->flags & BTRFS_INODE_NODATASUM) > + if (inode->flags & BTRFS_INODE_NODATACOW) > return false; > return true; > } > > should do the trick. > > The above code was added with the argument that having no checksums with > compression would damage too much data in case of corruption ( > https://lore.kernel.org/linux-btrfs/20180515073622.18732-2-wqu@suse.com/ > ). It doesn't make a difference whether it's a single device fs or not. If we don't have csum, the corruption is not only affecting the sector where the corruption is, but the full compressed extent. Furthermore, it's not that simple. Current code we always expect compressed read to find some csum. Just check btrfs_submit_compressed_read(), it will call btrfs_lookup_bio_sums(). Which will fill the csum array with 0 if it can not find any csum. Then at endio callbacks, we verify the csum against the data we read, if it's all zero, the csum will definitely mismatch and discard the data no matter if it's correct or not. The same thing applies to btrfs_submit_compressed_write(), it will always generate csum. The diff will just give you a false sense of compression without csum. It will still generate csum for write and relies on csum check at read time. Thanks, Qu > This argument doesn't apply much to single device file systems and > even less to file systems on Ceph like volumes. >
On 11.06.2021 02:18 Qu Wenruo wrote: > On 2021/6/10 下午10:32, Martin Raiber wrote: >> when btrfs is running on a block device that improves integrity (e.g. >> Ceph), it's usefull to run it with nodatasum to reduce the amount of >> metadata and random IO. >> >> In that case it would also be useful to be able to run it with >> compression combined with nodatasum as well. >> >> I actually found that if nodatasum is specified after compress-force, >> that it doesn't remove reset the compress/nodatasum bit, while the other >> way around it doesn't work. >> >> That combined with >> >> --- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580 >> +0200 >> +++ linux-5.10.39/fs/btrfs/inode.c 2021-05-31 16:11:19.461591523 >> +0200 >> @@ -408,8 +408,7 @@ >> */ >> static inline bool inode_can_compress(struct btrfs_inode *inode) >> { >> - if (inode->flags & BTRFS_INODE_NODATACOW || >> - inode->flags & BTRFS_INODE_NODATASUM) >> + if (inode->flags & BTRFS_INODE_NODATACOW) >> return false; >> return true; >> } >> >> should do the trick. > >> The above code was added with the argument that having no checksums with >> compression would damage too much data in case of corruption ( >> https://lore.kernel.org/linux-btrfs/20180515073622.18732-2-wqu@suse.com/ >> ). > > It doesn't make a difference whether it's a single device fs or not. > If we don't have csum, the corruption is not only affecting the sector > where the corruption is, but the full compressed extent. Doesn't really matter if it doesn't happen if btrfs is on a e.g. ceph volume. Furthermore it depends on the use-case if corruption affecting one page, or a whole compressed block is something one can live with. > > Furthermore, it's not that simple. > > Current code we always expect compressed read to find some csum. > Just check btrfs_submit_compressed_read(), it will call > btrfs_lookup_bio_sums(). > Which will fill the csum array with 0 if it can not find any csum. It seems to make that conditional w.r.t. BTRFS_INODE_NODATASUM if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) { ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1, sums); BUG_ON(ret); /* -ENOMEM */ }and in check_compressed_csum: if (inode->flags & BTRFS_INODE_NODATASUM) return 0; This seems to have been moved around recently ( https://github.com/torvalds/linux/commit/334c16d82cfe180f7b262a6f8ae2d9379f032b18 ). > > Then at endio callbacks, we verify the csum against the data we read, if > it's all zero, the csum will definitely mismatch and discard the data no > matter if it's correct or not. > > The same thing applies to btrfs_submit_compressed_write(), it will > always generate csum. Same here: int skip_sum = inode->flags & BTRFS_INODE_NODATASUM; ... if (!skip_sum) { ret = btrfs_csum_one_bio(inode, bio, start, 1); BUG_ON(ret); /* -ENOMEM */ } > > The diff will just give you a false sense of compression without csum. > It will still generate csum for write and relies on csum check at read > time. >
On 2021/6/11 下午6:55, Martin Raiber wrote: > On 11.06.2021 02:18 Qu Wenruo wrote: >> On 2021/6/10 下午10:32, Martin Raiber wrote: >>> when btrfs is running on a block device that improves integrity (e.g. >>> Ceph), it's usefull to run it with nodatasum to reduce the amount of >>> metadata and random IO. >>> >>> In that case it would also be useful to be able to run it with >>> compression combined with nodatasum as well. >>> >>> I actually found that if nodatasum is specified after compress-force, >>> that it doesn't remove reset the compress/nodatasum bit, while the other >>> way around it doesn't work. >>> >>> That combined with >>> >>> --- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580 >>> +0200 >>> +++ linux-5.10.39/fs/btrfs/inode.c 2021-05-31 16:11:19.461591523 >>> +0200 >>> @@ -408,8 +408,7 @@ >>> */ >>> static inline bool inode_can_compress(struct btrfs_inode *inode) >>> { >>> - if (inode->flags & BTRFS_INODE_NODATACOW || >>> - inode->flags & BTRFS_INODE_NODATASUM) >>> + if (inode->flags & BTRFS_INODE_NODATACOW) >>> return false; >>> return true; >>> } >>> >>> should do the trick. > >>> The above code was added with the argument that having no checksums with >>> compression would damage too much data in case of corruption ( >>> https://lore.kernel.org/linux-btrfs/20180515073622.18732-2-wqu@suse.com/ >>> ). >> >> It doesn't make a difference whether it's a single device fs or not. >> If we don't have csum, the corruption is not only affecting the sector >> where the corruption is, but the full compressed extent. > Doesn't really matter if it doesn't happen if btrfs is on a e.g. ceph > volume. But this still means, all other regular end user can hit a case where they set nodatasum for a directory and later corrupt their data. This will increase the corruption loss, if user just migrate from older kernel to newer one. Yeah, you can blame the users for doing that, but I'm still not that sure if such behavior change is fine. Especially when this increase the chance of data loss. It may be fine for ceph, but btrfs still have a wider user base to consider. > Furthermore it depends on the use-case if corruption affecting > one page, or a whole compressed block is something one can live with. That's true, but my point is, at least we shouldn't increase the data loss possibility for the end users. If in the past, we allowed NODATASUM + compression, then converting to current situation, I'm more or less fine, since it reduce the possibility of data loss. But not in the other direction. >> >> Furthermore, it's not that simple. >> >> Current code we always expect compressed read to find some csum. >> Just check btrfs_submit_compressed_read(), it will call >> btrfs_lookup_bio_sums(). >> Which will fill the csum array with 0 if it can not find any csum. > > It seems to make that conditional w.r.t. BTRFS_INODE_NODATASUM > > if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) { > ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1, sums); > BUG_ON(ret); /* -ENOMEM */ > }and in check_compressed_csum: if (inode->flags & BTRFS_INODE_NODATASUM) > return 0; This seems to have been moved around recently ( > https://github.com/torvalds/linux/commit/334c16d82cfe180f7b262a6f8ae2d9379f032b18 > ). Forgot that patch, I should look one function deeper to confirm the behavior. That indeed reduces my concerns from the code point of view. Thanks, Qu > >> >> Then at endio callbacks, we verify the csum against the data we read, if >> it's all zero, the csum will definitely mismatch and discard the data no >> matter if it's correct or not. >> >> The same thing applies to btrfs_submit_compressed_write(), it will >> always generate csum. > > Same here: > > int skip_sum = inode->flags & BTRFS_INODE_NODATASUM; > > ... > > if (!skip_sum) { > ret = btrfs_csum_one_bio(inode, bio, start, 1); > BUG_ON(ret); /* -ENOMEM */ > } > >> >> The diff will just give you a false sense of compression without csum. >> It will still generate csum for write and relies on csum check at read >> time. >> > >
On Fri, Jun 11, 2021 at 07:15:29PM +0800, Qu Wenruo wrote: > On 2021/6/11 下午6:55, Martin Raiber wrote: > > On 11.06.2021 02:18 Qu Wenruo wrote: > >> On 2021/6/10 下午10:32, Martin Raiber wrote: > > But this still means, all other regular end user can hit a case where > they set nodatasum for a directory and later corrupt their data. So this where I'd say give users what they ask for, like the usecase where the data correctness is done on another layer or solved by completely different approach like reprovisioning in case of errors. I would certainly not recommend to turning off nodatasum in general but if thre's a usecase with known pros and cons then I'm for allowing it (if feasible on technical grounds). An example from past is (not)allowing DUP on a single device for data. The counterargument for that was something like "but it won't help on SD cards anyway", but we're not choosing hw or usecase for users. I had a prototype to do "nometasum", it's obviously simple but with lack of usecase I did not push it for a merge. If there's an interest for a checksum-free usecase we can do that. > This will increase the corruption loss, if user just migrate from older > kernel to newer one. > > Yeah, you can blame the users for doing that, but I'm still not that > sure if such behavior change is fine. > > Especially when this increase the chance of data loss. > It may be fine for ceph, but btrfs still have a wider user base to consider. So the silent change would make a difference, but it's still after a decision to do nodatasum, that gets on a directory and inherited. That it actually happens is not what user asked for, so from that point I think it's the other way around. > > Furthermore it depends on the use-case if corruption affecting > > one page, or a whole compressed block is something one can live with. > > That's true, but my point is, at least we shouldn't increase the data > loss possibility for the end users. > > If in the past, we allowed NODATASUM + compression, then converting to > current situation, I'm more or less fine, since it reduce the > possibility of data loss. I'm going to read the reasoning again.
On 2021/6/17 下午8:15, David Sterba wrote: > On Fri, Jun 11, 2021 at 07:15:29PM +0800, Qu Wenruo wrote: >> On 2021/6/11 下午6:55, Martin Raiber wrote: >>> On 11.06.2021 02:18 Qu Wenruo wrote: >>>> On 2021/6/10 下午10:32, Martin Raiber wrote: >> >> But this still means, all other regular end user can hit a case where >> they set nodatasum for a directory and later corrupt their data. > > So this where I'd say give users what they ask for, like the usecase > where the data correctness is done on another layer or solved by > completely different approach like reprovisioning in case of errors. Fine, then I guess what we need is to talk about the technical details then. > > I would certainly not recommend to turning off nodatasum in general but > if thre's a usecase with known pros and cons then I'm for allowing it > (if feasible on technical grounds). > > An example from past is (not)allowing DUP on a single device for data. > The counterargument for that was something like "but it won't help on SD > cards anyway", but we're not choosing hw or usecase for users. > > I had a prototype to do "nometasum", it's obviously simple but with lack > of usecase I did not push it for a merge. If there's an interest for a > checksum-free usecase we can do that. Already off-topic, but I doubt if it can really bring some benefit. In the old days, maybe. Nowadays, even with checksum disabled, we still have mandatory tree-checker, which on average takes a similar amount of time of csum the tree-block... > >> This will increase the corruption loss, if user just migrate from older >> kernel to newer one. >> >> Yeah, you can blame the users for doing that, but I'm still not that >> sure if such behavior change is fine. >> >> Especially when this increase the chance of data loss. >> It may be fine for ceph, but btrfs still have a wider user base to consider. > > So the silent change would make a difference, but it's still after a > decision to do nodatasum, that gets on a directory and inherited. > That it actually happens is not what user asked for, so from that point > I think it's the other way around. > >>> Furthermore it depends on the use-case if corruption affecting >>> one page, or a whole compressed block is something one can live with. >> >> That's true, but my point is, at least we shouldn't increase the data >> loss possibility for the end users. >> >> If in the past, we allowed NODATASUM + compression, then converting to >> current situation, I'm more or less fine, since it reduce the >> possibility of data loss. > > I'm going to read the reasoning again. > That's because we used to force csum check for compressed data, but it's no longer the case anymore. As Martin already stated, we skip the compressed csum for read, as long as the inode has NODATASUM flag. Thanks, Qu
--- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580 +0200 +++ linux-5.10.39/fs/btrfs/inode.c 2021-05-31 16:11:19.461591523 +0200 @@ -408,8 +408,7 @@ */ static inline bool inode_can_compress(struct btrfs_inode *inode) { - if (inode->flags & BTRFS_INODE_NODATACOW || - inode->flags & BTRFS_INODE_NODATASUM) + if (inode->flags & BTRFS_INODE_NODATACOW) return false; return true; }