Message ID | 1460711302-2478-1-git-send-email-quwenruo@cn.fujitsu.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Hi, I have couple of queries related to btrfs-image, btrfs send and with combination of two. 1) I would like to know if a btrfs source file system is spread across more than 1 disks, does btrfs-image require same number of disks to create empty file system without files content?? 2) would btrfs-image can be modified to keep only given subvolume foot print and related meta data to bring back file system live on destination disk? To elaborate more on this, Lets say I have 5 subvolumes on source btrfs and i run btrfs-image written to destination disk say /dev/sdd. In this process, can btrfs-image modified to just have only 1 subvolume and skipp other 4 subvolumes and write to destination i.. /dev/sdd so that when I mount /dev/sdd , I will have btrfs with only 1 subvolume with no data. 3) If 3 can successful, can btrfs-image further changed to include data of selected subvolume which gives files data also written to /dev/sdd which would be kind of a backup of a subvolume taken out of a btrfs file system which is having more than 1 subvolumes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 15, 2016 at 12:41:36PM +0000, sri wrote: > Hi, > > I have couple of queries related to btrfs-image, btrfs send and with > combination of two. > 1) > I would like to know if a btrfs source file system is spread across more > than 1 disks, does btrfs-image require same number of disks to create > empty file system without files content?? I don't _think_ you need as many devices as there were originally. > 2) would btrfs-image can be modified to keep only given subvolume foot > print and related meta data to bring back file system live on destination > disk? > > To elaborate more on this, Lets say I have 5 subvolumes on source btrfs > and i run btrfs-image written to destination disk say /dev/sdd. In this > process, can btrfs-image modified to just have only 1 subvolume and skipp > other 4 subvolumes and write to destination i.. /dev/sdd so that when I > mount /dev/sdd , I will have btrfs with only 1 subvolume with no data. For a first approximation, you could just drop any FS tree from the image which wasn't the target one. After that, it turns into a complicated accounting exercise to drop all of the back-refs to the missing FS trees, and to drop all the extent records for the non-shared data and the metadata for the missing FS trees. It's probably going to be complicated, and will basically involve rewriting most of the image to avoid the metadata you didn't want. > 3) If 3 can successful, can btrfs-image further changed to include data of > selected subvolume which gives files data also written to /dev/sdd which > would be kind of a backup of a subvolume taken out of a btrfs file system > which is having more than 1 subvolumes. If you're going to do all the hard work of (2), then (3) is a reasonable logical(?) extension. On the other hand, what's wrong with simply using send/receive? It gives you a data structure (a FAR-format send stream) which contains everything you need to reconstruct a subvolume on a btrfs different to the original. Hugo.
On Fri, Apr 15, 2016 at 2:49 PM, Hugo Mills <hugo@carfax.org.uk> wrote: > On Fri, Apr 15, 2016 at 12:41:36PM +0000, sri wrote: >> Hi, >> >> I have couple of queries related to btrfs-image, btrfs send and with >> combination of two. >> 1) >> I would like to know if a btrfs source file system is spread across more >> than 1 disks, does btrfs-image require same number of disks to create >> empty file system without files content?? > > I don't _think_ you need as many devices as there were originally. Indeed, if you run btrfs-image -r on a dump from a multi device fs, you get 1 big fs image. I once did that for a 4x4TB RAID10 fs (tools v4.3.x at that time I believe), resulting in a 17TB (sparse) file. I was expecting that option -m would create multiple files, however, scanning the source-code revealed that there are things not implemented (it resulted in a 34T file that was simply not a valid fs). Or I did something wrong or there is a bug. For just my case, it was much quicker to patch the kernel so that it worked with the 17T file. There are then still issues w.r.t. devices, but data is missing so anyhow only a limited set of tool actions or issues can be researched with such a generated image. But for a multi-TB fs, the data volume is acceptable (roughly in 1G or 10G order). I think it would make sense that the btrfs-image restore output can be split into multiple files, so that the multidevice aspects are better represented (or modelled). >> 2) would btrfs-image can be modified to keep only given subvolume foot >> print and related meta data to bring back file system live on destination >> disk? >> >> To elaborate more on this, Lets say I have 5 subvolumes on source btrfs >> and i run btrfs-image written to destination disk say /dev/sdd. In this >> process, can btrfs-image modified to just have only 1 subvolume and skipp >> other 4 subvolumes and write to destination i.. /dev/sdd so that when I >> mount /dev/sdd , I will have btrfs with only 1 subvolume with no data. > > For a first approximation, you could just drop any FS tree from the > image which wasn't the target one. After that, it turns into a > complicated accounting exercise to drop all of the back-refs to the > missing FS trees, and to drop all the extent records for the > non-shared data and the metadata for the missing FS trees. > > It's probably going to be complicated, and will basically involve > rewriting most of the image to avoid the metadata you didn't want. > >> 3) If 3 can successful, can btrfs-image further changed to include data of >> selected subvolume which gives files data also written to /dev/sdd which >> would be kind of a backup of a subvolume taken out of a btrfs file system >> which is having more than 1 subvolumes. > > If you're going to do all the hard work of (2), then (3) is a > reasonable logical(?) extension. > > On the other hand, what's wrong with simply using send/receive? It > gives you a data structure (a FAR-format send stream) which contains > everything you need to reconstruct a subvolume on a btrfs different > to the original. > > Hugo. > > -- > Hugo Mills | Mary had a little lamb > hugo@... carfax.org.uk | You've heard this tale before > http://carfax.org.uk/ | But did you know she passed her plate > PGP: E2AB1DE4 | And had a little more? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I tried btrfs-image and created image file and ran btrfs-image -r to a different disk. Once recovered and mounted, I can able to see data is not zeroed out as mentioned in btrfs-image man page. I tried on same machine. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(your email keeps ending up in gmail spam folder) On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_sridhar@yahoo.co.in> wrote: > I tried btrfs-image and created image file and ran btrfs-image -r to a > different disk. Once recovered and mounted, I can able to see data is > not zeroed out as mentioned in btrfs-image man page. "different disk" you mention, that is important info. If you doe the restore to a image file, that image file is sparse and all data blocks are read as zeros. However, if you restore to a block device, then you can assume it just writes the device blocks for metadata and leaves the rest untouched. So trim whole device first or brute-force overwrite completely with zeros. So maybe the man pages needs some correction / extra notes. > I tried on same machine. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 18 Apr 2016 16:13:28 +0200 Henk Slager <eye1tm@gmail.com> wrote: > (your email keeps ending up in gmail spam folder) > > On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_sridhar@yahoo.co.in> wrote: > > I tried btrfs-image and created image file and ran btrfs-image -r to a > > different disk. Once recovered and mounted, I can able to see data is > > not zeroed out as mentioned in btrfs-image man page. > > "different disk" you mention, that is important info. If you doe the > restore to a image file, that image file is sparse and all data blocks > are read as zeros. > > However, if you restore to a block device, then you can assume it just > writes the device blocks for metadata and leaves the rest untouched. > So trim whole device first or brute-force overwrite completely with > zeros. > > So maybe the man pages needs some correction / extra notes. > > > I tried on same machine. Does btrfs-image store/restore the FS UUID? If it does, then potentially both the source FS and the restored one were visible at the same time to the kernel with identical UUIDs, and maybe it was actually accessing/mounting the source one.
On Mon, Apr 18, 2016 at 4:26 PM, Roman Mamedov <rm@romanrm.net> wrote: > On Mon, 18 Apr 2016 16:13:28 +0200 > Henk Slager <eye1tm@gmail.com> wrote: > >> (your email keeps ending up in gmail spam folder) >> >> On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_sridhar@yahoo.co.in> wrote: >> > I tried btrfs-image and created image file and ran btrfs-image -r to a >> > different disk. Once recovered and mounted, I can able to see data is >> > not zeroed out as mentioned in btrfs-image man page. >> >> "different disk" you mention, that is important info. If you doe the >> restore to a image file, that image file is sparse and all data blocks >> are read as zeros. >> >> However, if you restore to a block device, then you can assume it just >> writes the device blocks for metadata and leaves the rest untouched. >> So trim whole device first or brute-force overwrite completely with >> zeros. >> >> So maybe the man pages needs some correction / extra notes. >> >> > I tried on same machine. > > Does btrfs-image store/restore the FS UUID? If it does, then potentially both > the source FS and the restored one were visible at the same time to the kernel > with identical UUIDs, and maybe it was actually accessing/mounting the source > one. Very good point! The UUID's are the same. I remember I used a VM to separate the source FS from the restored FS Also, the assumption I made about restoring to a block device is not correct: If you restore to a loopdev that has a file with all non-zeros as backing-store, the files in the mounted restored FS are read as zeros. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Henk Slager <eye1tm <at> gmail.com> writes: > > On Mon, Apr 18, 2016 at 4:26 PM, Roman Mamedov <rm <at> romanrm.net> wrote: > > On Mon, 18 Apr 2016 16:13:28 +0200 > > Henk Slager <eye1tm <at> gmail.com> wrote: > > > >> (your email keeps ending up in gmail spam folder) > >> > >> On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_sridhar <at> yahoo.co.in> wrote: > >> > I tried btrfs-image and created image file and ran btrfs-image -r to a > >> > different disk. Once recovered and mounted, I can able to see data is > >> > not zeroed out as mentioned in btrfs-image man page. > >> > >> "different disk" you mention, that is important info. If you doe the > >> restore to a image file, that image file is sparse and all data blocks > >> are read as zeros. > >> > >> However, if you restore to a block device, then you can assume it just > >> writes the device blocks for metadata and leaves the rest untouched. > >> So trim whole device first or brute-force overwrite completely with > >> zeros. > >> > >> So maybe the man pages needs some correction / extra notes. > >> > >> > I tried on same machine. > > > > Does btrfs-image store/restore the FS UUID? If it does, then potentially both > > the source FS and the restored one were visible at the same time to the kernel > > with identical UUIDs, and maybe it was actually accessing/mounting the source > > one. > > Very good point! The UUID's are the same. I remember I used a VM to > separate the source FS from the restored FS > > Also, the assumption I made about restoring to a block device is not > correct: If you restore to a loopdev that has a file with all > non-zeros as backing-store, the files in the mounted restored FS are > read as zeros. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo <at> vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > thank you for inputs. Actually I have tried in the following way. I have /dev/sdb , /dev/sdc. Using wipefs -fa I cleared both devices and cre= ated btrfs on /dev/sdb. Mounted and written some files and unmounted it. Then I ran btrfs-image /dev/sdc /img1.img and got the dump. Once image created I again ran wipefs -fa /dev/sdb. Then I ran btrfs-image -r /img1.img /dev/sdc and mounted /dev/sdc. ls to dumped filesystem shows the file size as original and no file content= . I guess btrfs-image doesn't modify files stat so i feel it is showing siz= e as original. However running cat on some of files give i/o error qd67:/btrfs1 # cat shadow.hcat: shadow.h: Input/output error These errors are not on all files on other files, since dump doesn't contai= ns any data it just prompts for cat as below. qd67:/btrfs1 # cat stab.hqd67:/btrfs1 # cat stdio_ext.h Not sure why i/o errors are coming. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I have /dev/sdb , /dev/sdc. Using wipefs -fa I cleared both devices and > created btrfs on /dev/sdb. Mounted and written some files and unmounted it. > > Then I ran btrfs-image /dev/sdc /img1.img and got the dump. It looks like you imaged the wrong device, that might clarify the IO errors later on > Once image created I again ran wipefs -fa /dev/sdb. > > Then I ran btrfs-image -r /img1.img /dev/sdc and mounted /dev/sdc. > > ls to dumped filesystem shows the file size as original and no file content. > I guess btrfs-image doesn't modify files stat so i feel it is showing size > as original. > > However running cat on some of files give i/o error > > qd67:/btrfs1 # cat shadow.h > cat: shadow.h: Input/output error > > These errors are not on all files on other files, since dump doesn't > contains any data it just prompts for cat as below. > > qd67:/btrfs1 # cat stab.h > qd67:/btrfs1 # cat stdio_ext.h > > Not sure why i/o errors are coming. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 15, 2016 at 05:08:22PM +0800, Qu Wenruo wrote: > Current btrfs qgroup design implies a requirement that after calling > btrfs_qgroup_account_extents() there must be a commit root switch. > > Normally this is OK, as btrfs_qgroup_accounting_extents() is only called > inside btrfs_commit_transaction() just be commit_cowonly_roots(). > > However there is a exception at create_pending_snapshot(), which will > call btrfs_qgroup_account_extents() but no any commit root switch. > > In case of creating a snapshot whose parent root is itself (create a > snapshot of fs tree), it will corrupt qgroup by the following trace: > (skipped unrelated data) > ====== > btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 > qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 > qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 > btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 > ====== > > The problem here is in first qgroup_account_extent(), the > nr_new_roots of the extent is 1, which means its reference got > increased, and qgroup increased its rfer and excl. > > But at second qgroup_account_extent(), its reference got decreased, but > between these two qgroup_account_extent(), there is no switch roots. > This leads to the same nr_old_roots, and this extent just got ignored by > qgroup, which means this extent is wrongly accounted. > > Fix it by call commit_cowonly_roots() after qgroup_account_extent() in > create_pending_snapshot(), with needed preparation. > > Reported-by: Mark Fasheh <mfasheh@suse.de> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Ok, this version seems to be giving me the right numbers. I'll send a test for it shortly. I'd still like to know if this patch introduces an unintended side effects but otherwise, thanks Qu. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry. Its typo I used original disk /dev/sdb where filesystem is created and seeing these errors. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 19 Apr 2016 15:19:46 -0700, Mark Fasheh wrote: > On Fri, Apr 15, 2016 at 05:08:22PM +0800, Qu Wenruo wrote: >> Current btrfs qgroup design implies a requirement that after calling >> btrfs_qgroup_account_extents() there must be a commit root switch. >> >> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called >> inside btrfs_commit_transaction() just be commit_cowonly_roots(). >> >> However there is a exception at create_pending_snapshot(), which will >> call btrfs_qgroup_account_extents() but no any commit root switch. >> >> In case of creating a snapshot whose parent root is itself (create a >> snapshot of fs tree), it will corrupt qgroup by the following trace: >> (skipped unrelated data) >> ====== >> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 >> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 >> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 >> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 >> ====== >> >> The problem here is in first qgroup_account_extent(), the >> nr_new_roots of the extent is 1, which means its reference got >> increased, and qgroup increased its rfer and excl. >> >> But at second qgroup_account_extent(), its reference got decreased, but >> between these two qgroup_account_extent(), there is no switch roots. >> This leads to the same nr_old_roots, and this extent just got ignored by >> qgroup, which means this extent is wrongly accounted. >> >> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in >> create_pending_snapshot(), with needed preparation. >> >> Reported-by: Mark Fasheh <mfasheh@suse.de> >> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > > Ok, this version seems to be giving me the right numbers. I'll send a test > for it shortly. I'd still like to know if this patch introduces an > unintended side effects but otherwise, thanks Qu. > --Mark Hi Mark, Can't speak about other side effects since I have not observed any so far, but I can confirm that the previously failing case of deleting a renamed snapshot [1] now works properly with v4 without getting the commit roots in a twist. So: Tested-by: holger.hoffstaette@googlemail.com cheers Holger [1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/55052 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/15/2016 05:08 AM, Qu Wenruo wrote: > Current btrfs qgroup design implies a requirement that after calling > btrfs_qgroup_account_extents() there must be a commit root switch. > > Normally this is OK, as btrfs_qgroup_accounting_extents() is only called > inside btrfs_commit_transaction() just be commit_cowonly_roots(). > > However there is a exception at create_pending_snapshot(), which will > call btrfs_qgroup_account_extents() but no any commit root switch. > > In case of creating a snapshot whose parent root is itself (create a > snapshot of fs tree), it will corrupt qgroup by the following trace: > (skipped unrelated data) > ====== > btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 > qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 > qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 > btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 > ====== > > The problem here is in first qgroup_account_extent(), the > nr_new_roots of the extent is 1, which means its reference got > increased, and qgroup increased its rfer and excl. > > But at second qgroup_account_extent(), its reference got decreased, but > between these two qgroup_account_extent(), there is no switch roots. > This leads to the same nr_old_roots, and this extent just got ignored by > qgroup, which means this extent is wrongly accounted. > > Fix it by call commit_cowonly_roots() after qgroup_account_extent() in > create_pending_snapshot(), with needed preparation. > > Reported-by: Mark Fasheh <mfasheh@suse.de> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > --- > v2: > Fix a soft lockup caused by missing switch_commit_root() call. > Fix a warning caused by dirty-but-not-committed root. > v3: > Fix a bug which will cause qgroup accounting for dropping snapshot > wrong > v4: > Fix a bug caused by non-cowed btree modification. > > To Filipe: > I'm sorry I didn't wait for your reply on the dropped roots. > I reverted back the version where we deleted dropped roots in > switch_commit_roots(). > > As I think as long as we called btrfs_qgroup_prepare_account_extents() > and btrfs_qgroup_account_extents(), it should have already accounted > extents for dropped roots, and then we are OK to drop them. > > It would be very nice if you could point out what I missed. > Thanks > Qu > --- > fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 93 insertions(+), 24 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 43885e5..92f8193 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -311,10 +311,11 @@ loop: > * when the transaction commits > */ > static int record_root_in_trans(struct btrfs_trans_handle *trans, > - struct btrfs_root *root) > + struct btrfs_root *root, > + int force) > { > - if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) && > - root->last_trans < trans->transid) { > + if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) && > + root->last_trans < trans->transid) || force) { > WARN_ON(root == root->fs_info->extent_root); > WARN_ON(root->commit_root != root->node); > > @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans, > smp_wmb(); > > spin_lock(&root->fs_info->fs_roots_radix_lock); > - if (root->last_trans == trans->transid) { > + if (root->last_trans == trans->transid && !force) { > spin_unlock(&root->fs_info->fs_roots_radix_lock); > return 0; > } > @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans, > return 0; > > mutex_lock(&root->fs_info->reloc_mutex); > - record_root_in_trans(trans, root); > + record_root_in_trans(trans, root, 0); > mutex_unlock(&root->fs_info->reloc_mutex); > > return 0; > @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root) > } > > /* > + * Do all special snapshot related qgroup dirty hack. > + * > + * Will do all needed qgroup inherit and dirty hack like switch commit > + * roots inside one transaction and write all btree into disk, to make > + * qgroup works. > + */ > +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans, > + struct btrfs_root *src, > + struct btrfs_root *parent, > + struct btrfs_qgroup_inherit *inherit, > + u64 dst_objectid) > +{ > + struct btrfs_fs_info *fs_info = src->fs_info; > + int ret; > + > + /* > + * We are going to commit transaction, see btrfs_commit_transaction() > + * comment for reason locking tree_log_mutex > + */ > + mutex_lock(&fs_info->tree_log_mutex); > + > + ret = commit_fs_roots(trans, src); > + if (ret) > + goto out; > + ret = btrfs_qgroup_prepare_account_extents(trans, fs_info); > + if (ret < 0) > + goto out; > + ret = btrfs_qgroup_account_extents(trans, fs_info); > + if (ret < 0) > + goto out; > + > + /* Now qgroup are all updated, we can inherit it to new qgroups */ > + ret = btrfs_qgroup_inherit(trans, fs_info, > + src->root_key.objectid, dst_objectid, > + inherit); > + if (ret < 0) > + goto out; > + > + /* > + * Now we do a simplified commit transaction, which will: > + * 1) commit all subvolume and extent tree > + * To ensure all subvolume and extent tree have a valid > + * commit_root to accounting later insert_dir_item() > + * 2) write all btree blocks onto disk > + * This is to make sure later btree modification will be cowed > + * Or commit_root can be populated and cause wrong qgroup numbers > + * In this simplified commit, we don't really care about other trees > + * like chunk and root tree, as they won't affect qgroup. > + * And we don't write super to avoid half committed status. > + */ > + ret = commit_cowonly_roots(trans, src); > + if (ret) > + goto out; > + switch_commit_roots(trans->transaction, fs_info); > + ret = btrfs_write_and_wait_transaction(trans, src); > + if (ret) > + btrfs_std_error(fs_info, ret, > + "Error while writing out transaction for qgroup"); > + > +out: > + mutex_unlock(&fs_info->tree_log_mutex); > + > + /* > + * Force parent root to be updated, as we recorded it before so its > + * last_trans == cur_transid. > + * Or it won't be committed again onto disk after later > + * insert_dir_item() > + */ > + if (!ret) > + record_root_in_trans(trans, parent, 1); > + return ret; > +} NACK, holy shit we aren't adding a special transaction commit only for qgroup snapshots. Figure out a different way. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: > On 04/15/2016 05:08 AM, Qu Wenruo wrote: > >+ /* > >+ * Force parent root to be updated, as we recorded it before so its > >+ * last_trans == cur_transid. > >+ * Or it won't be committed again onto disk after later > >+ * insert_dir_item() > >+ */ > >+ if (!ret) > >+ record_root_in_trans(trans, parent, 1); > >+ return ret; > >+} > > NACK, holy shit we aren't adding a special transaction commit only > for qgroup snapshots. Figure out a different way. Thanks, Yeah I saw that. To be fair, we run a whole lot of the transaction stuff multiple times (at least from my reading) so I'm really unclear on what the performance impact is. Do you have any suggestion though? We've been banging our heads against this for a while now and as slow as this patch might be, it actually works where nothing else has so far. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/22/2016 02:21 PM, Mark Fasheh wrote: > On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: >> On 04/15/2016 05:08 AM, Qu Wenruo wrote: >>> + /* >>> + * Force parent root to be updated, as we recorded it before so its >>> + * last_trans == cur_transid. >>> + * Or it won't be committed again onto disk after later >>> + * insert_dir_item() >>> + */ >>> + if (!ret) >>> + record_root_in_trans(trans, parent, 1); >>> + return ret; >>> +} >> >> NACK, holy shit we aren't adding a special transaction commit only >> for qgroup snapshots. Figure out a different way. Thanks, > > Yeah I saw that. To be fair, we run a whole lot of the transaction stuff > multiple times (at least from my reading) so I'm really unclear on what the > performance impact is. > > Do you have any suggestion though? We've been banging our heads against this > for a while now and as slow as this patch might be, it actually works where > nothing else has so far. I'm less concerned about committing another transaction and more concerned about the fact that it is an special variant of the transaction commit. If this goes wrong, or at some point in the future we fail to update it along with btrfs_transaction_commit we suddenly are corrupting metadata. If we have to commit a transaction then call btrfs_commit_transaction(), don't open code a stripped down version, here be dragons. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 22, 2016 at 02:23:59PM -0400, Josef Bacik wrote: > On 04/22/2016 02:21 PM, Mark Fasheh wrote: > >On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: > >>On 04/15/2016 05:08 AM, Qu Wenruo wrote: > >>>+ /* > >>>+ * Force parent root to be updated, as we recorded it before so its > >>>+ * last_trans == cur_transid. > >>>+ * Or it won't be committed again onto disk after later > >>>+ * insert_dir_item() > >>>+ */ > >>>+ if (!ret) > >>>+ record_root_in_trans(trans, parent, 1); > >>>+ return ret; > >>>+} > >> > >>NACK, holy shit we aren't adding a special transaction commit only > >>for qgroup snapshots. Figure out a different way. Thanks, > > > >Yeah I saw that. To be fair, we run a whole lot of the transaction stuff > >multiple times (at least from my reading) so I'm really unclear on what the > >performance impact is. > > > >Do you have any suggestion though? We've been banging our heads against this > >for a while now and as slow as this patch might be, it actually works where > >nothing else has so far. > > I'm less concerned about committing another transaction and more > concerned about the fact that it is an special variant of the > transaction commit. If this goes wrong, or at some point in the > future we fail to update it along with btrfs_transaction_commit we > suddenly are corrupting metadata. If we have to commit a > transaction then call btrfs_commit_transaction(), don't open code a > stripped down version, here be dragons. Thanks, Ok yeah that makes perfect sense - I thought you were telling me that this would be a big performance problem. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik wrote on 2016/04/22 14:23 -0400: > On 04/22/2016 02:21 PM, Mark Fasheh wrote: >> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: >>> On 04/15/2016 05:08 AM, Qu Wenruo wrote: >>>> + /* >>>> + * Force parent root to be updated, as we recorded it before so >>>> its >>>> + * last_trans == cur_transid. >>>> + * Or it won't be committed again onto disk after later >>>> + * insert_dir_item() >>>> + */ >>>> + if (!ret) >>>> + record_root_in_trans(trans, parent, 1); >>>> + return ret; >>>> +} >>> >>> NACK, holy shit we aren't adding a special transaction commit only >>> for qgroup snapshots. Figure out a different way. Thanks, >> >> Yeah I saw that. To be fair, we run a whole lot of the transaction stuff >> multiple times (at least from my reading) so I'm really unclear on >> what the >> performance impact is. >> >> Do you have any suggestion though? We've been banging our heads >> against this >> for a while now and as slow as this patch might be, it actually works >> where >> nothing else has so far. > > I'm less concerned about committing another transaction and more > concerned about the fact that it is an special variant of the > transaction commit. If this goes wrong, or at some point in the future > we fail to update it along with btrfs_transaction_commit we suddenly are > corrupting metadata. If we have to commit a transaction then call > btrfs_commit_transaction(), don't open code a stripped down version, > here be dragons. Thanks, > > Josef > > > Yes, I also don't like the dirty hack. Although the problem is, we have no other good choice. If we can call commit_transaction() that's the best case, but the problem is, in create_pending_snapshots(), we are already inside commit_transaction(). Or commit_transaction() can be called inside commit_transaction()? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/24/2016 08:56 PM, Qu Wenruo wrote: > > > Josef Bacik wrote on 2016/04/22 14:23 -0400: >> On 04/22/2016 02:21 PM, Mark Fasheh wrote: >>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: >>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote: >>>>> + /* >>>>> + * Force parent root to be updated, as we recorded it before so >>>>> its >>>>> + * last_trans == cur_transid. >>>>> + * Or it won't be committed again onto disk after later >>>>> + * insert_dir_item() >>>>> + */ >>>>> + if (!ret) >>>>> + record_root_in_trans(trans, parent, 1); >>>>> + return ret; >>>>> +} >>>> >>>> NACK, holy shit we aren't adding a special transaction commit only >>>> for qgroup snapshots. Figure out a different way. Thanks, >>> >>> Yeah I saw that. To be fair, we run a whole lot of the transaction stuff >>> multiple times (at least from my reading) so I'm really unclear on >>> what the >>> performance impact is. >>> >>> Do you have any suggestion though? We've been banging our heads >>> against this >>> for a while now and as slow as this patch might be, it actually works >>> where >>> nothing else has so far. >> >> I'm less concerned about committing another transaction and more >> concerned about the fact that it is an special variant of the >> transaction commit. If this goes wrong, or at some point in the future >> we fail to update it along with btrfs_transaction_commit we suddenly are >> corrupting metadata. If we have to commit a transaction then call >> btrfs_commit_transaction(), don't open code a stripped down version, >> here be dragons. Thanks, >> >> Josef >> >> >> > Yes, I also don't like the dirty hack. > > Although the problem is, we have no other good choice. > > If we can call commit_transaction() that's the best case, but the > problem is, in create_pending_snapshots(), we are already inside > commit_transaction(). > > Or commit_transaction() can be called inside commit_transaction()? > No, figure out a different way. IIRC I dealt with this with the no_quota flag for inc_ref/dec_ref since the copy root stuff does strange things with the reference counts, but all this code is gone now. I looked around to see if I could figure out how the refs are ending up this way but it doesn't make sense to me and there isn't enough information in your changelog for me to be able to figure it out. You've created this mess, clean it up without making it messier. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik wrote on 2016/04/25 10:24 -0400: > On 04/24/2016 08:56 PM, Qu Wenruo wrote: >> >> >> Josef Bacik wrote on 2016/04/22 14:23 -0400: >>> On 04/22/2016 02:21 PM, Mark Fasheh wrote: >>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: >>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote: >>>>>> + /* >>>>>> + * Force parent root to be updated, as we recorded it before so >>>>>> its >>>>>> + * last_trans == cur_transid. >>>>>> + * Or it won't be committed again onto disk after later >>>>>> + * insert_dir_item() >>>>>> + */ >>>>>> + if (!ret) >>>>>> + record_root_in_trans(trans, parent, 1); >>>>>> + return ret; >>>>>> +} >>>>> >>>>> NACK, holy shit we aren't adding a special transaction commit only >>>>> for qgroup snapshots. Figure out a different way. Thanks, >>>> >>>> Yeah I saw that. To be fair, we run a whole lot of the transaction >>>> stuff >>>> multiple times (at least from my reading) so I'm really unclear on >>>> what the >>>> performance impact is. >>>> >>>> Do you have any suggestion though? We've been banging our heads >>>> against this >>>> for a while now and as slow as this patch might be, it actually works >>>> where >>>> nothing else has so far. >>> >>> I'm less concerned about committing another transaction and more >>> concerned about the fact that it is an special variant of the >>> transaction commit. If this goes wrong, or at some point in the future >>> we fail to update it along with btrfs_transaction_commit we suddenly are >>> corrupting metadata. If we have to commit a transaction then call >>> btrfs_commit_transaction(), don't open code a stripped down version, >>> here be dragons. Thanks, >>> >>> Josef >>> >>> >>> >> Yes, I also don't like the dirty hack. >> >> Although the problem is, we have no other good choice. >> >> If we can call commit_transaction() that's the best case, but the >> problem is, in create_pending_snapshots(), we are already inside >> commit_transaction(). >> >> Or commit_transaction() can be called inside commit_transaction()? >> > > No, figure out a different way. IIRC I dealt with this with the > no_quota flag for inc_ref/dec_ref since the copy root stuff does strange > things with the reference counts, but all this code is gone now. I > looked around to see if I could figure out how the refs are ending up > this way but it doesn't make sense to me and there isn't enough > information in your changelog for me to be able to figure it out. You've > created this mess, clean it up without making it messier. Thanks, > > Josef > > Unfortunately, your original no_quota flag just hide the bug, and hide it in a bad method. Originally, no_quota flag is used for case like this, to skip quota at snapshot creation, and use quota_inherit() to hack the quota accounting. It seems work, but in fact, if the DIR_ITEM insert need to create a new cousin leaf, then quota is messed up. Your quota rework doesn't really help, as it won't even accounting things well, just check fstest/btrfs/091 on 4.1 kernel. The only perfect fix for this already nasty subvolume creation is to do full subtree rescan. Or no one knows when higher qgroups will be broken. If you think splitting commit_transaction into two variants can cause problem, I can merge this two variants into one. As in btrfs_commit_transaction() the commit process is much the same as the one used in create_pending_snapshot(). If there is only one __commit_roots() to do such commit, then there is nothing special only for quota. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/25/2016 08:35 PM, Qu Wenruo wrote: > > > Josef Bacik wrote on 2016/04/25 10:24 -0400: >> On 04/24/2016 08:56 PM, Qu Wenruo wrote: >>> >>> >>> Josef Bacik wrote on 2016/04/22 14:23 -0400: >>>> On 04/22/2016 02:21 PM, Mark Fasheh wrote: >>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: >>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote: >>>>>>> + /* >>>>>>> + * Force parent root to be updated, as we recorded it before so >>>>>>> its >>>>>>> + * last_trans == cur_transid. >>>>>>> + * Or it won't be committed again onto disk after later >>>>>>> + * insert_dir_item() >>>>>>> + */ >>>>>>> + if (!ret) >>>>>>> + record_root_in_trans(trans, parent, 1); >>>>>>> + return ret; >>>>>>> +} >>>>>> >>>>>> NACK, holy shit we aren't adding a special transaction commit only >>>>>> for qgroup snapshots. Figure out a different way. Thanks, >>>>> >>>>> Yeah I saw that. To be fair, we run a whole lot of the transaction >>>>> stuff >>>>> multiple times (at least from my reading) so I'm really unclear on >>>>> what the >>>>> performance impact is. >>>>> >>>>> Do you have any suggestion though? We've been banging our heads >>>>> against this >>>>> for a while now and as slow as this patch might be, it actually works >>>>> where >>>>> nothing else has so far. >>>> >>>> I'm less concerned about committing another transaction and more >>>> concerned about the fact that it is an special variant of the >>>> transaction commit. If this goes wrong, or at some point in the future >>>> we fail to update it along with btrfs_transaction_commit we suddenly >>>> are >>>> corrupting metadata. If we have to commit a transaction then call >>>> btrfs_commit_transaction(), don't open code a stripped down version, >>>> here be dragons. Thanks, >>>> >>>> Josef >>>> >>>> >>>> >>> Yes, I also don't like the dirty hack. >>> >>> Although the problem is, we have no other good choice. >>> >>> If we can call commit_transaction() that's the best case, but the >>> problem is, in create_pending_snapshots(), we are already inside >>> commit_transaction(). >>> >>> Or commit_transaction() can be called inside commit_transaction()? >>> >> >> No, figure out a different way. IIRC I dealt with this with the >> no_quota flag for inc_ref/dec_ref since the copy root stuff does strange >> things with the reference counts, but all this code is gone now. I >> looked around to see if I could figure out how the refs are ending up >> this way but it doesn't make sense to me and there isn't enough >> information in your changelog for me to be able to figure it out. You've >> created this mess, clean it up without making it messier. Thanks, >> >> Josef >> >> > Unfortunately, your original no_quota flag just hide the bug, and hide > it in a bad method. > > Originally, no_quota flag is used for case like this, to skip quota at > snapshot creation, and use quota_inherit() to hack the quota accounting. > It seems work, but in fact, if the DIR_ITEM insert need to create a new > cousin leaf, then quota is messed up. > No, and this is the problem, you fundamentally didn't understand what I wrote, and instead of trying to understand it and fix the bug you just threw it all away. The no_quota stuff was not a hack, it was put in place to deal with refs we already knew where accounted for, such as when we converted to mixed refs or other such operations. There were bugs in my rework, but now the situation is untenable. What we now have is something that holds delayed refs in memory for the entire transaction, which is a non-starter for anybody who wants to use this in production. On our gluster machines we get millions of delayed refs per transaction, and then there are multiple file systems. Then you have to post-process all of that during the critical section of the commit? So now suddenly I'm walking millions of delayed refs doing accounting all at once, that is going to cause commit latencies in the seconds which is completely unacceptable. Anyway I spent some time this morning reading through the new stuff to figure out how it works now and I've got a patch to fix this problem that doesn't involve screwing with the transaction commit stuff at all. I sent it along separately Btrfs: fix qgroup accounting when snapshotting This fixes the basic case that was described originally. I haven't tested it other than that but I'm pretty sure it is correct. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik wrote on 2016/04/26 10:26 -0400: > On 04/25/2016 08:35 PM, Qu Wenruo wrote: >> >> >> Josef Bacik wrote on 2016/04/25 10:24 -0400: >>> On 04/24/2016 08:56 PM, Qu Wenruo wrote: >>>> >>>> >>>> Josef Bacik wrote on 2016/04/22 14:23 -0400: >>>>> On 04/22/2016 02:21 PM, Mark Fasheh wrote: >>>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik >>>>>> wrote: >>>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote: >>>>>>>> + /* + * Force parent root to be updated, as we >>>>>>>> recorded it before so its + * last_trans == >>>>>>>> cur_transid. + * Or it won't be committed again >>>>>>>> onto disk after later + * insert_dir_item() + >>>>>>>> */ + if (!ret) + record_root_in_trans(trans, >>>>>>>> parent, 1); + return ret; +} >>>>>>> >>>>>>> NACK, holy shit we aren't adding a special transaction >>>>>>> commit only for qgroup snapshots. Figure out a different >>>>>>> way. Thanks, >>>>>> >>>>>> Yeah I saw that. To be fair, we run a whole lot of the >>>>>> transaction stuff multiple times (at least from my reading) >>>>>> so I'm really unclear on what the performance impact is. >>>>>> >>>>>> Do you have any suggestion though? We've been banging our >>>>>> heads against this for a while now and as slow as this >>>>>> patch might be, it actually works where nothing else has so >>>>>> far. >>>>> >>>>> I'm less concerned about committing another transaction and >>>>> more concerned about the fact that it is an special variant >>>>> of the transaction commit. If this goes wrong, or at some >>>>> point in the future we fail to update it along with >>>>> btrfs_transaction_commit we suddenly are corrupting metadata. >>>>> If we have to commit a transaction then call >>>>> btrfs_commit_transaction(), don't open code a stripped down >>>>> version, here be dragons. Thanks, >>>>> >>>>> Josef >>>>> >>>>> >>>>> >>>> Yes, I also don't like the dirty hack. >>>> >>>> Although the problem is, we have no other good choice. >>>> >>>> If we can call commit_transaction() that's the best case, but >>>> the problem is, in create_pending_snapshots(), we are already >>>> inside commit_transaction(). >>>> >>>> Or commit_transaction() can be called inside >>>> commit_transaction()? >>>> >>> >>> No, figure out a different way. IIRC I dealt with this with the >>> no_quota flag for inc_ref/dec_ref since the copy root stuff does >>> strange things with the reference counts, but all this code is >>> gone now. I looked around to see if I could figure out how the >>> refs are ending up this way but it doesn't make sense to me and >>> there isn't enough information in your changelog for me to be >>> able to figure it out. You've created this mess, clean it up >>> without making it messier. Thanks, >>> >>> Josef >>> >>> >> Unfortunately, your original no_quota flag just hide the bug, and >> hide it in a bad method. >> >> Originally, no_quota flag is used for case like this, to skip quota >> at snapshot creation, and use quota_inherit() to hack the quota >> accounting. It seems work, but in fact, if the DIR_ITEM insert need >> to create a new cousin leaf, then quota is messed up. >> > > No, and this is the problem, you fundamentally didn't understand what > I wrote, and instead of trying to understand it and fix the bug you > just threw it all away. The no_quota stuff was not a hack, it was > put in place to deal with refs we already knew where accounted for, > such as when we converted to mixed refs or other such operations. Even we know it has been accounted, it's still a hack to jump away from normal accounting routing. Just like what we do in qgroup_inherit(). We believe snapshot creation will always make the exclusive to nodesize. But in fact, we didn't consider the higher qgroup, and qgroup_inherit() is just messing up that case. And Yes, I DID threw the old qgroup code away. As there are too many existing bugs and possible bugs, not only in qgroup, but also in backref walk. Backref walk can't handle time_seq really well, that's one of the reason for btrfs/091. And new comer must handle the extra no_quota flag, which is more easy to cause new regression. > > There were bugs in my rework, but now the situation is untenable. > What we now have is something that holds delayed refs in memory for > the entire transaction, which is a non-starter for anybody who wants > to use this in production. Nope, we didn't hold delayed refs in current qgroup codes. Delayed refs can be flushed to disk at any time just as it is. We only trace which bytenr will go through qgroup routine, nothing to do with delayed refs. Yes you can have millions of delayed refs, but there won't be millions of different extents. Or sync_fs() would have been triggered. > On our gluster machines we get millions of delayed refs per > transaction, and then there are multiple file systems. Then you have > to post-process all of that during the critical section of the > commit? So now suddenly I'm walking millions of delayed refs doing > accounting all at once, that is going to cause commit latencies in > the seconds which is completely unacceptable. In fact qgroup code will never account or walk through delayed refs. Qgroup code will only walk through backrefs in extent tree. (commit tree and just before commit tree) So, millions is already reduced to a much smaller amount. Then, for old qgroup code, each time a inc/dec_extent_ref() will walk through all backref. So in special case like in-band dedupe, thousands of inc_extent_ref() will cause thousands backref walk through, no matter when you do the walk through, it will hugely delayed the transaction. But in new qgroup code, twice for any dirty extent in one trans. Even in the best case, it is the same with old case. And much much much much much faster in worst case. And further more, since qgroup code won't bother with the already b**lsh*t delayed-refs, it only needs to care extent tree. This also avoid the backref bugs in walking delayed_ref. Although, the root problem is in backref walking. It's too time consuming, and we need some cache for it. Or dedupe(in-band or out-of-band or reflink) can easily screw it up with fiemap. Just create a 1G file with all its extents pointing to a single 128K one. Then fiemap will just hang your system. > > Anyway I spent some time this morning reading through the new stuff > to figure out how it works now and I've got a patch to fix this > problem that doesn't involve screwing with the transaction commit > stuff at all. I sent it along separately > > Btrfs: fix qgroup accounting when snapshotting > > This fixes the basic case that was described originally. I haven't > tested it other than that but I'm pretty sure it is correct. > Thanks, > > Josef > Unfortunately, it isn't correct and even didn't fix the bug. I'll comment in that thread. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Josef, On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: > On 04/15/2016 05:08 AM, Qu Wenruo wrote: > >Current btrfs qgroup design implies a requirement that after calling > >btrfs_qgroup_account_extents() there must be a commit root switch. > > > >Normally this is OK, as btrfs_qgroup_accounting_extents() is only called > >inside btrfs_commit_transaction() just be commit_cowonly_roots(). > > > >However there is a exception at create_pending_snapshot(), which will > >call btrfs_qgroup_account_extents() but no any commit root switch. > > > >In case of creating a snapshot whose parent root is itself (create a > >snapshot of fs tree), it will corrupt qgroup by the following trace: > >(skipped unrelated data) > >====== > >btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 > >qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 > >qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 > >btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 > >====== > > > >The problem here is in first qgroup_account_extent(), the > >nr_new_roots of the extent is 1, which means its reference got > >increased, and qgroup increased its rfer and excl. > > > >But at second qgroup_account_extent(), its reference got decreased, but > >between these two qgroup_account_extent(), there is no switch roots. > >This leads to the same nr_old_roots, and this extent just got ignored by > >qgroup, which means this extent is wrongly accounted. > > > >Fix it by call commit_cowonly_roots() after qgroup_account_extent() in > >create_pending_snapshot(), with needed preparation. > > > >Reported-by: Mark Fasheh <mfasheh@suse.de> > >Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > >--- > >v2: > > Fix a soft lockup caused by missing switch_commit_root() call. > > Fix a warning caused by dirty-but-not-committed root. > >v3: > > Fix a bug which will cause qgroup accounting for dropping snapshot > > wrong > >v4: > > Fix a bug caused by non-cowed btree modification. > > > >To Filipe: > > I'm sorry I didn't wait for your reply on the dropped roots. > > I reverted back the version where we deleted dropped roots in > > switch_commit_roots(). > > > > As I think as long as we called btrfs_qgroup_prepare_account_extents() > > and btrfs_qgroup_account_extents(), it should have already accounted > > extents for dropped roots, and then we are OK to drop them. > > > > It would be very nice if you could point out what I missed. > > Thanks > > Qu > >--- > > fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++---------- > > 1 file changed, 93 insertions(+), 24 deletions(-) > > > >diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > >index 43885e5..92f8193 100644 > >--- a/fs/btrfs/transaction.c > >+++ b/fs/btrfs/transaction.c > >@@ -311,10 +311,11 @@ loop: > > * when the transaction commits > > */ > > static int record_root_in_trans(struct btrfs_trans_handle *trans, > >- struct btrfs_root *root) > >+ struct btrfs_root *root, > >+ int force) > > { > >- if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) && > >- root->last_trans < trans->transid) { > >+ if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) && > >+ root->last_trans < trans->transid) || force) { > > WARN_ON(root == root->fs_info->extent_root); > > WARN_ON(root->commit_root != root->node); > > > >@@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans, > > smp_wmb(); > > > > spin_lock(&root->fs_info->fs_roots_radix_lock); > >- if (root->last_trans == trans->transid) { > >+ if (root->last_trans == trans->transid && !force) { > > spin_unlock(&root->fs_info->fs_roots_radix_lock); > > return 0; > > } > >@@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans, > > return 0; > > > > mutex_lock(&root->fs_info->reloc_mutex); > >- record_root_in_trans(trans, root); > >+ record_root_in_trans(trans, root, 0); > > mutex_unlock(&root->fs_info->reloc_mutex); > > > > return 0; > >@@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root) > > } > > > > /* > >+ * Do all special snapshot related qgroup dirty hack. > >+ * > >+ * Will do all needed qgroup inherit and dirty hack like switch commit > >+ * roots inside one transaction and write all btree into disk, to make > >+ * qgroup works. > >+ */ > >+static int qgroup_account_snapshot(struct btrfs_trans_handle *trans, > >+ struct btrfs_root *src, > >+ struct btrfs_root *parent, > >+ struct btrfs_qgroup_inherit *inherit, > >+ u64 dst_objectid) > >+{ > >+ struct btrfs_fs_info *fs_info = src->fs_info; > >+ int ret; > >+ > >+ /* > >+ * We are going to commit transaction, see btrfs_commit_transaction() > >+ * comment for reason locking tree_log_mutex > >+ */ > >+ mutex_lock(&fs_info->tree_log_mutex); > >+ > >+ ret = commit_fs_roots(trans, src); > >+ if (ret) > >+ goto out; > >+ ret = btrfs_qgroup_prepare_account_extents(trans, fs_info); > >+ if (ret < 0) > >+ goto out; > >+ ret = btrfs_qgroup_account_extents(trans, fs_info); > >+ if (ret < 0) > >+ goto out; > >+ > >+ /* Now qgroup are all updated, we can inherit it to new qgroups */ > >+ ret = btrfs_qgroup_inherit(trans, fs_info, > >+ src->root_key.objectid, dst_objectid, > >+ inherit); > >+ if (ret < 0) > >+ goto out; > >+ > >+ /* > >+ * Now we do a simplified commit transaction, which will: > >+ * 1) commit all subvolume and extent tree > >+ * To ensure all subvolume and extent tree have a valid > >+ * commit_root to accounting later insert_dir_item() > >+ * 2) write all btree blocks onto disk > >+ * This is to make sure later btree modification will be cowed > >+ * Or commit_root can be populated and cause wrong qgroup numbers > >+ * In this simplified commit, we don't really care about other trees > >+ * like chunk and root tree, as they won't affect qgroup. > >+ * And we don't write super to avoid half committed status. > >+ */ > >+ ret = commit_cowonly_roots(trans, src); > >+ if (ret) > >+ goto out; > >+ switch_commit_roots(trans->transaction, fs_info); > >+ ret = btrfs_write_and_wait_transaction(trans, src); > >+ if (ret) > >+ btrfs_std_error(fs_info, ret, > >+ "Error while writing out transaction for qgroup"); > >+ > >+out: > >+ mutex_unlock(&fs_info->tree_log_mutex); > >+ > >+ /* > >+ * Force parent root to be updated, as we recorded it before so its > >+ * last_trans == cur_transid. > >+ * Or it won't be committed again onto disk after later > >+ * insert_dir_item() > >+ */ > >+ if (!ret) > >+ record_root_in_trans(trans, parent, 1); > >+ return ret; > >+} > > NACK, holy shit we aren't adding a special transaction commit only > for qgroup snapshots. Figure out a different way. Thanks, Unfortunately I think we're going to have to swallow our pride on this one :( Per our conversations on irc, and my detailed observations in this e-mail: https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2 It seems like we have a core problem in that root counting during snapshot create is unreliable and leads to corrupted qgroups. You add into that the poor assumptions made by the rest of the code (such as qgroup_inherit() always marking dst->excl = node_size) and ti looks like we won't have a proper fix until another qgroup rewrite. In the meantime, this would make qgroups numbers correct again. If we drop a single check in there to only run when qgroups are enabled, we can mitigate the performance impact. If I send that patch would you be ok to ACK it this time around? Thanks, --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/11/2016 09:57 AM, Mark Fasheh wrote: > Hi Josef, > > On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote: >> On 04/15/2016 05:08 AM, Qu Wenruo wrote: >>> Current btrfs qgroup design implies a requirement that after calling >>> btrfs_qgroup_account_extents() there must be a commit root switch. >>> >>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called >>> inside btrfs_commit_transaction() just be commit_cowonly_roots(). >>> >>> However there is a exception at create_pending_snapshot(), which will >>> call btrfs_qgroup_account_extents() but no any commit root switch. >>> >>> In case of creating a snapshot whose parent root is itself (create a >>> snapshot of fs tree), it will corrupt qgroup by the following trace: >>> (skipped unrelated data) >>> ====== >>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 >>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 >>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 >>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 >>> ====== >>> >>> The problem here is in first qgroup_account_extent(), the >>> nr_new_roots of the extent is 1, which means its reference got >>> increased, and qgroup increased its rfer and excl. >>> >>> But at second qgroup_account_extent(), its reference got decreased, but >>> between these two qgroup_account_extent(), there is no switch roots. >>> This leads to the same nr_old_roots, and this extent just got ignored by >>> qgroup, which means this extent is wrongly accounted. >>> >>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in >>> create_pending_snapshot(), with needed preparation. >>> >>> Reported-by: Mark Fasheh <mfasheh@suse.de> >>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>> --- >>> v2: >>> Fix a soft lockup caused by missing switch_commit_root() call. >>> Fix a warning caused by dirty-but-not-committed root. >>> v3: >>> Fix a bug which will cause qgroup accounting for dropping snapshot >>> wrong >>> v4: >>> Fix a bug caused by non-cowed btree modification. >>> >>> To Filipe: >>> I'm sorry I didn't wait for your reply on the dropped roots. >>> I reverted back the version where we deleted dropped roots in >>> switch_commit_roots(). >>> >>> As I think as long as we called btrfs_qgroup_prepare_account_extents() >>> and btrfs_qgroup_account_extents(), it should have already accounted >>> extents for dropped roots, and then we are OK to drop them. >>> >>> It would be very nice if you could point out what I missed. >>> Thanks >>> Qu >>> --- >>> fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++---------- >>> 1 file changed, 93 insertions(+), 24 deletions(-) >>> >>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c >>> index 43885e5..92f8193 100644 >>> --- a/fs/btrfs/transaction.c >>> +++ b/fs/btrfs/transaction.c >>> @@ -311,10 +311,11 @@ loop: >>> * when the transaction commits >>> */ >>> static int record_root_in_trans(struct btrfs_trans_handle *trans, >>> - struct btrfs_root *root) >>> + struct btrfs_root *root, >>> + int force) >>> { >>> - if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) && >>> - root->last_trans < trans->transid) { >>> + if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) && >>> + root->last_trans < trans->transid) || force) { >>> WARN_ON(root == root->fs_info->extent_root); >>> WARN_ON(root->commit_root != root->node); >>> >>> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans, >>> smp_wmb(); >>> >>> spin_lock(&root->fs_info->fs_roots_radix_lock); >>> - if (root->last_trans == trans->transid) { >>> + if (root->last_trans == trans->transid && !force) { >>> spin_unlock(&root->fs_info->fs_roots_radix_lock); >>> return 0; >>> } >>> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans, >>> return 0; >>> >>> mutex_lock(&root->fs_info->reloc_mutex); >>> - record_root_in_trans(trans, root); >>> + record_root_in_trans(trans, root, 0); >>> mutex_unlock(&root->fs_info->reloc_mutex); >>> >>> return 0; >>> @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root) >>> } >>> >>> /* >>> + * Do all special snapshot related qgroup dirty hack. >>> + * >>> + * Will do all needed qgroup inherit and dirty hack like switch commit >>> + * roots inside one transaction and write all btree into disk, to make >>> + * qgroup works. >>> + */ >>> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans, >>> + struct btrfs_root *src, >>> + struct btrfs_root *parent, >>> + struct btrfs_qgroup_inherit *inherit, >>> + u64 dst_objectid) >>> +{ >>> + struct btrfs_fs_info *fs_info = src->fs_info; >>> + int ret; >>> + >>> + /* >>> + * We are going to commit transaction, see btrfs_commit_transaction() >>> + * comment for reason locking tree_log_mutex >>> + */ >>> + mutex_lock(&fs_info->tree_log_mutex); >>> + >>> + ret = commit_fs_roots(trans, src); >>> + if (ret) >>> + goto out; >>> + ret = btrfs_qgroup_prepare_account_extents(trans, fs_info); >>> + if (ret < 0) >>> + goto out; >>> + ret = btrfs_qgroup_account_extents(trans, fs_info); >>> + if (ret < 0) >>> + goto out; >>> + >>> + /* Now qgroup are all updated, we can inherit it to new qgroups */ >>> + ret = btrfs_qgroup_inherit(trans, fs_info, >>> + src->root_key.objectid, dst_objectid, >>> + inherit); >>> + if (ret < 0) >>> + goto out; >>> + >>> + /* >>> + * Now we do a simplified commit transaction, which will: >>> + * 1) commit all subvolume and extent tree >>> + * To ensure all subvolume and extent tree have a valid >>> + * commit_root to accounting later insert_dir_item() >>> + * 2) write all btree blocks onto disk >>> + * This is to make sure later btree modification will be cowed >>> + * Or commit_root can be populated and cause wrong qgroup numbers >>> + * In this simplified commit, we don't really care about other trees >>> + * like chunk and root tree, as they won't affect qgroup. >>> + * And we don't write super to avoid half committed status. >>> + */ >>> + ret = commit_cowonly_roots(trans, src); >>> + if (ret) >>> + goto out; >>> + switch_commit_roots(trans->transaction, fs_info); >>> + ret = btrfs_write_and_wait_transaction(trans, src); >>> + if (ret) >>> + btrfs_std_error(fs_info, ret, >>> + "Error while writing out transaction for qgroup"); >>> + >>> +out: >>> + mutex_unlock(&fs_info->tree_log_mutex); >>> + >>> + /* >>> + * Force parent root to be updated, as we recorded it before so its >>> + * last_trans == cur_transid. >>> + * Or it won't be committed again onto disk after later >>> + * insert_dir_item() >>> + */ >>> + if (!ret) >>> + record_root_in_trans(trans, parent, 1); >>> + return ret; >>> +} >> >> NACK, holy shit we aren't adding a special transaction commit only >> for qgroup snapshots. Figure out a different way. Thanks, > > > Unfortunately I think we're going to have to swallow our pride on this one :( > > Per our conversations on irc, and my detailed observations in this e-mail: > > https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2 > > It seems like we have a core problem in that root counting during snapshot > create is unreliable and leads to corrupted qgroups. You add into that the > poor assumptions made by the rest of the code (such as qgroup_inherit() > always marking dst->excl = node_size) and ti looks like we won't have a > proper fix until another qgroup rewrite. > > In the meantime, this would make qgroups numbers correct again. If we drop a > single check in there to only run when qgroups are enabled, we can mitigate > the performance impact. If I send that patch would you be ok to ACK it this > time around? > Yeah as long as it's limited to only the qgroup case then I'm fine with it for now. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
====== btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 ====== The problem here is in first qgroup_account_extent(), the nr_new_roots of the extent is 1, which means its reference got increased, and qgroup increased its rfer and excl. But at second qgroup_account_extent(), its reference got decreased, but between these two qgroup_account_extent(), there is no switch roots. This leads to the same nr_old_roots, and this extent just got ignored by qgroup, which means this extent is wrongly accounted. Fix it by call commit_cowonly_roots() after qgroup_account_extent() in create_pending_snapshot(), with needed preparation. Reported-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> --- v2: Fix a soft lockup caused by missing switch_commit_root() call. Fix a warning caused by dirty-but-not-committed root. v3: Fix a bug which will cause qgroup accounting for dropping snapshot wrong v4: Fix a bug caused by non-cowed btree modification. To Filipe: I'm sorry I didn't wait for your reply on the dropped roots. I reverted back the version where we deleted dropped roots in switch_commit_roots(). As I think as long as we called btrfs_qgroup_prepare_account_extents() and btrfs_qgroup_account_extents(), it should have already accounted extents for dropped roots, and then we are OK to drop them. It would be very nice if you could point out what I missed. Thanks Qu --- fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 93 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 43885e5..92f8193 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -311,10 +311,11 @@ loop: * when the transaction commits */ static int record_root_in_trans(struct btrfs_trans_handle *trans, - struct btrfs_root *root) + struct btrfs_root *root, + int force) { - if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) && - root->last_trans < trans->transid) { + if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) && + root->last_trans < trans->transid) || force) { WARN_ON(root == root->fs_info->extent_root); WARN_ON(root->commit_root != root->node); @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans, smp_wmb(); spin_lock(&root->fs_info->fs_roots_radix_lock); - if (root->last_trans == trans->transid) { + if (root->last_trans == trans->transid && !force) { spin_unlock(&root->fs_info->fs_roots_radix_lock); return 0; } @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans, return 0; mutex_lock(&root->fs_info->reloc_mutex); - record_root_in_trans(trans, root); + record_root_in_trans(trans, root, 0); mutex_unlock(&root->fs_info->reloc_mutex); return 0; @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root) } /* + * Do all special snapshot related qgroup dirty hack. + * + * Will do all needed qgroup inherit and dirty hack like switch commit + * roots inside one transaction and write all btree into disk, to make + * qgroup works. + */ +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans, + struct btrfs_root *src, + struct btrfs_root *parent, + struct btrfs_qgroup_inherit *inherit, + u64 dst_objectid) +{ + struct btrfs_fs_info *fs_info = src->fs_info; + int ret; + + /* + * We are going to commit transaction, see btrfs_commit_transaction() + * comment for reason locking tree_log_mutex + */ + mutex_lock(&fs_info->tree_log_mutex); + + ret = commit_fs_roots(trans, src); + if (ret) + goto out; + ret = btrfs_qgroup_prepare_account_extents(trans, fs_info); + if (ret < 0) + goto out; + ret = btrfs_qgroup_account_extents(trans, fs_info); + if (ret < 0) + goto out; + + /* Now qgroup are all updated, we can inherit it to new qgroups */ + ret = btrfs_qgroup_inherit(trans, fs_info, + src->root_key.objectid, dst_objectid, + inherit); + if (ret < 0) + goto out; + + /* + * Now we do a simplified commit transaction, which will: + * 1) commit all subvolume and extent tree + * To ensure all subvolume and extent tree have a valid + * commit_root to accounting later insert_dir_item() + * 2) write all btree blocks onto disk + * This is to make sure later btree modification will be cowed + * Or commit_root can be populated and cause wrong qgroup numbers + * In this simplified commit, we don't really care about other trees + * like chunk and root tree, as they won't affect qgroup. + * And we don't write super to avoid half committed status. + */ + ret = commit_cowonly_roots(trans, src); + if (ret) + goto out; + switch_commit_roots(trans->transaction, fs_info); + ret = btrfs_write_and_wait_transaction(trans, src); + if (ret) + btrfs_std_error(fs_info, ret, + "Error while writing out transaction for qgroup"); + +out: + mutex_unlock(&fs_info->tree_log_mutex); + + /* + * Force parent root to be updated, as we recorded it before so its + * last_trans == cur_transid. + * Or it won't be committed again onto disk after later + * insert_dir_item() + */ + if (!ret) + record_root_in_trans(trans, parent, 1); + return ret; +} + +/* * new snapshots need to be created at a very specific time in the * transaction commit. This does the actual creation. * @@ -1383,7 +1458,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, dentry = pending->dentry; parent_inode = pending->dir; parent_root = BTRFS_I(parent_inode)->root; - record_root_in_trans(trans, parent_root); + record_root_in_trans(trans, parent_root, 0); cur_time = current_fs_time(parent_inode->i_sb); @@ -1420,7 +1495,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, goto fail; } - record_root_in_trans(trans, root); + record_root_in_trans(trans, root, 0); btrfs_set_root_last_snapshot(&root->root_item, trans->transid); memcpy(new_root_item, &root->root_item, sizeof(*new_root_item)); btrfs_check_and_init_root_item(new_root_item); @@ -1516,6 +1591,17 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, goto fail; } + /* + * Do special qgroup accounting for snapshot, as we do some qgroup + * snapshot hack to do fast snapshot. + * To co-operate with that hack, we do hack again. + * Or snapshot will be greatly slowed down by a subtree qgroup rescan + */ + ret = qgroup_account_snapshot(trans, root, parent_root, + pending->inherit, objectid); + if (ret < 0) + goto fail; + ret = btrfs_insert_dir_item(trans, parent_root, dentry->d_name.name, dentry->d_name.len, parent_inode, &key, @@ -1559,23 +1645,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, goto fail; } - /* - * account qgroup counters before qgroup_inherit() - */ - ret = btrfs_qgroup_prepare_account_extents(trans, fs_info); - if (ret) - goto fail; - ret = btrfs_qgroup_account_extents(trans, fs_info); - if (ret) - goto fail; - ret = btrfs_qgroup_inherit(trans, fs_info, - root->root_key.objectid, - objectid, pending->inherit); - if (ret) { - btrfs_abort_transaction(trans, root, ret); - goto fail; - } - fail: pending->error = ret; dir_item_existed: