Message ID | 20230421083653.20380-1-heming.zhao@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ocfs2: fix missing reset j_num_trans for sync | expand |
Sorry, please pause this patch review. When I was investigating fstest generic failed case 347 361, I found the wake_up() action should move out the 'if()' area. The correct way is calling wake_up() unconditionally. Thanks, Heming On 4/21/23 4:36 PM, Heming Zhao wrote: > fstest generic case 266 272 281 trigger hanging issue when umount. > > I use 266 to describe the root cause. > > ``` > 49 _dmerror_unmount > 50 _dmerror_mount > 51 > 52 echo "Compare files" > 53 md5sum $testdir/file1 | _filter_scratch > 54 md5sum $testdir/file2 | _filter_scratch > 55 > 56 echo "CoW and unmount" > 57 sync > 58 _dmerror_load_error_table > 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ > 60 -c "fdatasync" $testdir/file2 2>&1) > 61 echo $urk >> $seqres.full > 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" > 63 > 64 echo "Clean up the mess" > 65 _dmerror_unmount > ``` > > After line 49 50 umount & mount ocfs2 dev, this case run md5sum to > verify target file. line 57 run 'sync' before line 58 changes the dm > target from dm-linear to dm-error. this case is hanging at line 65. > > In 266, md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to > do journal job. but there is only ->j_num_trans+1 in ocfs2_start_trans, > the ocfs2_commit_trans doesn't do reduction operation. > > call flow: > ``` > [md5sum] > vfs_read > ocfs2_file_read_iter > ocfs2_inode_lock_atime > ocfs2_update_inode_atime > + ocfs2_start_trans //atomic_inc j_num_trans > + ... > + ocfs2_commit_trans//no modify j_num_trans > > sync //no modify j_num_trans > > _dmerror_load_error_table //all write will return error after this line > > _dmerror_unmount //found j_num_trans is not zero, run commit thread > //but the underlying dev is dm-error, journaling IO > //failed all the time and keep going to retry. > ``` > > *** How to fix *** > > kick commit thread in sync path, which can reset j_num_trans to 0. > > Signed-off-by: Heming Zhao <heming.zhao@suse.com> > --- > fs/ocfs2/super.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > index 0b0e6a132101..9929320bd967 100644 > --- a/fs/ocfs2/super.c > +++ b/fs/ocfs2/super.c > @@ -408,6 +408,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) > > if (jbd2_journal_start_commit(osb->journal->j_journal, > &target)) { > + /* kick commit thread to reset journal->j_num_trans */ > + wake_up(&osb->checkpoint_event); > + > if (wait) > jbd2_log_wait_commit(osb->journal->j_journal, > target); >
On 4/22/23 4:55 PM, Heming Zhao via Ocfs2-devel wrote: > Sorry, please pause this patch review. I will send a new patch to supersede this one. - Heming > When I was investigating fstest generic failed case 347 361, I found > the wake_up() action should move out the 'if()' area. The correct way > is calling wake_up() unconditionally. > > Thanks, > Heming > > On 4/21/23 4:36 PM, Heming Zhao wrote: >> fstest generic case 266 272 281 trigger hanging issue when umount. >> >> I use 266 to describe the root cause. >> >> ``` >> 49 _dmerror_unmount >> 50 _dmerror_mount >> 51 >> 52 echo "Compare files" >> 53 md5sum $testdir/file1 | _filter_scratch >> 54 md5sum $testdir/file2 | _filter_scratch >> 55 >> 56 echo "CoW and unmount" >> 57 sync >> 58 _dmerror_load_error_table >> 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ >> 60 -c "fdatasync" $testdir/file2 2>&1) >> 61 echo $urk >> $seqres.full >> 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" >> 63 >> 64 echo "Clean up the mess" >> 65 _dmerror_unmount >> ``` >> >> After line 49 50 umount & mount ocfs2 dev, this case run md5sum to >> verify target file. line 57 run 'sync' before line 58 changes the dm >> target from dm-linear to dm-error. this case is hanging at line 65. >> >> In 266, md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to >> do journal job. but there is only ->j_num_trans+1 in ocfs2_start_trans, >> the ocfs2_commit_trans doesn't do reduction operation. >> >> call flow: >> ``` >> [md5sum] >> vfs_read >> ocfs2_file_read_iter >> ocfs2_inode_lock_atime >> ocfs2_update_inode_atime >> + ocfs2_start_trans //atomic_inc j_num_trans >> + ... >> + ocfs2_commit_trans//no modify j_num_trans >> >> sync //no modify j_num_trans >> >> _dmerror_load_error_table //all write will return error after this line >> >> _dmerror_unmount //found j_num_trans is not zero, run commit thread >> //but the underlying dev is dm-error, journaling IO >> //failed all the time and keep going to retry. >> ``` >> >> *** How to fix *** >> >> kick commit thread in sync path, which can reset j_num_trans to 0. >> >> Signed-off-by: Heming Zhao <heming.zhao@suse.com> >> --- >> fs/ocfs2/super.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c >> index 0b0e6a132101..9929320bd967 100644 >> --- a/fs/ocfs2/super.c >> +++ b/fs/ocfs2/super.c >> @@ -408,6 +408,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) >> if (jbd2_journal_start_commit(osb->journal->j_journal, >> &target)) { >> + /* kick commit thread to reset journal->j_num_trans */ >> + wake_up(&osb->checkpoint_event); >> + >> if (wait) >> jbd2_log_wait_commit(osb->journal->j_journal, >> target); >> > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 0b0e6a132101..9929320bd967 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -408,6 +408,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) { + /* kick commit thread to reset journal->j_num_trans */ + wake_up(&osb->checkpoint_event); + if (wait) jbd2_log_wait_commit(osb->journal->j_journal, target);
fstest generic case 266 272 281 trigger hanging issue when umount. I use 266 to describe the root cause. ``` 49 _dmerror_unmount 50 _dmerror_mount 51 52 echo "Compare files" 53 md5sum $testdir/file1 | _filter_scratch 54 md5sum $testdir/file2 | _filter_scratch 55 56 echo "CoW and unmount" 57 sync 58 _dmerror_load_error_table 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ 60 -c "fdatasync" $testdir/file2 2>&1) 61 echo $urk >> $seqres.full 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" 63 64 echo "Clean up the mess" 65 _dmerror_unmount ``` After line 49 50 umount & mount ocfs2 dev, this case run md5sum to verify target file. line 57 run 'sync' before line 58 changes the dm target from dm-linear to dm-error. this case is hanging at line 65. In 266, md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to do journal job. but there is only ->j_num_trans+1 in ocfs2_start_trans, the ocfs2_commit_trans doesn't do reduction operation. call flow: ``` [md5sum] vfs_read ocfs2_file_read_iter ocfs2_inode_lock_atime ocfs2_update_inode_atime + ocfs2_start_trans //atomic_inc j_num_trans + ... + ocfs2_commit_trans//no modify j_num_trans sync //no modify j_num_trans _dmerror_load_error_table //all write will return error after this line _dmerror_unmount //found j_num_trans is not zero, run commit thread //but the underlying dev is dm-error, journaling IO //failed all the time and keep going to retry. ``` *** How to fix *** kick commit thread in sync path, which can reset j_num_trans to 0. Signed-off-by: Heming Zhao <heming.zhao@suse.com> --- fs/ocfs2/super.c | 3 +++ 1 file changed, 3 insertions(+)