diff mbox series

ocfs2: Fix DIO failure due to insufficient transaction credits

Message ID 20240614145243.8837-1-jack@suse.cz (mailing list archive)
State New
Headers show
Series ocfs2: Fix DIO failure due to insufficient transaction credits | expand

Commit Message

Jan Kara June 14, 2024, 2:52 p.m. UTC
The code in ocfs2_dio_end_io_write() estimates number of necessary
transaction credits using ocfs2_calc_extend_credits(). This however does
not take into account that the IO could be arbitrarily large and can
contain arbitrary number of extents. Extent tree manipulations do often
extend the current transaction but not in all of the cases. For example
if we have only single block extents in the tree,
ocfs2_mark_extent_written() will end up calling
ocfs2_replace_extent_rec() all the time and we will never extend the
current transaction and eventually exhaust all the transaction credits
if the IO contains many single block extents. Make sure the transaction
always has enough credits for one extent insert before each call of
ocfs2_mark_extent_written().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/aops.c        |  5 +++++
 fs/ocfs2/journal.c     | 17 +++++++++++++++++
 fs/ocfs2/journal.h     |  2 ++
 fs/ocfs2/ocfs2_trace.h |  2 ++
 4 files changed, 26 insertions(+)

Comments

heming.zhao@suse.com June 15, 2024, 1:21 a.m. UTC | #1
On 6/14/24 22:52, Jan Kara wrote:
> The code in ocfs2_dio_end_io_write() estimates number of necessary
> transaction credits using ocfs2_calc_extend_credits(). This however does
> not take into account that the IO could be arbitrarily large and can
> contain arbitrary number of extents. Extent tree manipulations do often
> extend the current transaction but not in all of the cases. For example
> if we have only single block extents in the tree,
> ocfs2_mark_extent_written() will end up calling
> ocfs2_replace_extent_rec() all the time and we will never extend the
> current transaction and eventually exhaust all the transaction credits
> if the IO contains many single block extents. Make sure the transaction
> always has enough credits for one extent insert before each call of
> ocfs2_mark_extent_written().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks for Jan's patch, it was used to fix a SUSE customer issue.

Reviewed-by: Heming Zhao <heming.zhao@suse.com>

> ---
>   fs/ocfs2/aops.c        |  5 +++++
>   fs/ocfs2/journal.c     | 17 +++++++++++++++++
>   fs/ocfs2/journal.h     |  2 ++
>   fs/ocfs2/ocfs2_trace.h |  2 ++
>   4 files changed, 26 insertions(+)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index ba790219d528..f4d7c0675651 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2371,6 +2371,11 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>   	}
>   
>   	list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
> +		ret = ocfs2_assure_trans_credits(handle, credits);
> +		if (ret < 0) {
> +			mlog_errno(ret);
> +			break;
> +		}
>   		ret = ocfs2_mark_extent_written(inode, &et, handle,
>   						ue->ue_cpos, 1,
>   						ue->ue_phys,
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 604fea3a26ff..4866d7d3ac57 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -445,6 +445,23 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks)
>   	return status;
>   }
>   
> +/*
> + * Make sure handle has at least 'nblocks' credits available. If it does not
> + * have that many credits available, we will try to extend the handle to have
> + * enough credits. If that fails, we will restart transaction to have enough
> + * credits. Similar notes regarding data consistency and locking implications
> + * as for ocfs2_extend_trans() apply here.
> + */
> +int ocfs2_assure_trans_credits(handle_t *handle, int nblocks)
> +{
> +	int old_nblks = jbd2_handle_buffer_credits(handle);
> +
> +	trace_ocfs2_assure_trans_credits(old_nblks);
> +	if (old_nblks >= nblocks)
> +		return 0;
> +	return ocfs2_extend_trans(handle, nblocks - old_nblks);
> +}
> +
>   /*
>    * If we have fewer than thresh credits, extend by OCFS2_MAX_TRANS_DATA.
>    * If that fails, restart the transaction & regain write access for the
> diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
> index 41c9fe7e62f9..e3c3a35dc5e0 100644
> --- a/fs/ocfs2/journal.h
> +++ b/fs/ocfs2/journal.h
> @@ -243,6 +243,8 @@ handle_t		    *ocfs2_start_trans(struct ocfs2_super *osb,
>   int			     ocfs2_commit_trans(struct ocfs2_super *osb,
>   						handle_t *handle);
>   int			     ocfs2_extend_trans(handle_t *handle, int nblocks);
> +int			     ocfs2_assure_trans_credits(handle_t *handle,
> +						int nblocks);
>   int			     ocfs2_allocate_extend_trans(handle_t *handle,
>   						int thresh);
>   
> diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
> index ac4fd1d5b128..6c3f4d7df7d6 100644
> --- a/fs/ocfs2/ocfs2_trace.h
> +++ b/fs/ocfs2/ocfs2_trace.h
> @@ -2579,6 +2579,8 @@ DEFINE_OCFS2_ULL_UINT_EVENT(ocfs2_commit_cache_end);
>   
>   DEFINE_OCFS2_INT_INT_EVENT(ocfs2_extend_trans);
>   
> +DEFINE_OCFS2_INT_EVENT(ocfs2_assure_trans_credits);
> +
>   DEFINE_OCFS2_INT_EVENT(ocfs2_extend_trans_restart);
>   
>   DEFINE_OCFS2_INT_INT_EVENT(ocfs2_allocate_extend_trans);
Joseph Qi June 15, 2024, 1:56 a.m. UTC | #2
On 6/14/24 10:52 PM, Jan Kara wrote:
> The code in ocfs2_dio_end_io_write() estimates number of necessary
> transaction credits using ocfs2_calc_extend_credits(). This however does
> not take into account that the IO could be arbitrarily large and can
> contain arbitrary number of extents. Extent tree manipulations do often
> extend the current transaction but not in all of the cases. For example
> if we have only single block extents in the tree,
> ocfs2_mark_extent_written() will end up calling
> ocfs2_replace_extent_rec() all the time and we will never extend the
> current transaction and eventually exhaust all the transaction credits
> if the IO contains many single block extents. Make sure the transaction
> always has enough credits for one extent insert before each call of
> ocfs2_mark_extent_written().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good.
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

> ---
>  fs/ocfs2/aops.c        |  5 +++++
>  fs/ocfs2/journal.c     | 17 +++++++++++++++++
>  fs/ocfs2/journal.h     |  2 ++
>  fs/ocfs2/ocfs2_trace.h |  2 ++
>  4 files changed, 26 insertions(+)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index ba790219d528..f4d7c0675651 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2371,6 +2371,11 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>  	}
>  
>  	list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
> +		ret = ocfs2_assure_trans_credits(handle, credits);
> +		if (ret < 0) {
> +			mlog_errno(ret);
> +			break;
> +		}
>  		ret = ocfs2_mark_extent_written(inode, &et, handle,
>  						ue->ue_cpos, 1,
>  						ue->ue_phys,
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 604fea3a26ff..4866d7d3ac57 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -445,6 +445,23 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks)
>  	return status;
>  }
>  
> +/*
> + * Make sure handle has at least 'nblocks' credits available. If it does not
> + * have that many credits available, we will try to extend the handle to have
> + * enough credits. If that fails, we will restart transaction to have enough
> + * credits. Similar notes regarding data consistency and locking implications
> + * as for ocfs2_extend_trans() apply here.
> + */
> +int ocfs2_assure_trans_credits(handle_t *handle, int nblocks)
> +{
> +	int old_nblks = jbd2_handle_buffer_credits(handle);
> +
> +	trace_ocfs2_assure_trans_credits(old_nblks);
> +	if (old_nblks >= nblocks)
> +		return 0;
> +	return ocfs2_extend_trans(handle, nblocks - old_nblks);
> +}
> +
>  /*
>   * If we have fewer than thresh credits, extend by OCFS2_MAX_TRANS_DATA.
>   * If that fails, restart the transaction & regain write access for the
> diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
> index 41c9fe7e62f9..e3c3a35dc5e0 100644
> --- a/fs/ocfs2/journal.h
> +++ b/fs/ocfs2/journal.h
> @@ -243,6 +243,8 @@ handle_t		    *ocfs2_start_trans(struct ocfs2_super *osb,
>  int			     ocfs2_commit_trans(struct ocfs2_super *osb,
>  						handle_t *handle);
>  int			     ocfs2_extend_trans(handle_t *handle, int nblocks);
> +int			     ocfs2_assure_trans_credits(handle_t *handle,
> +						int nblocks);
>  int			     ocfs2_allocate_extend_trans(handle_t *handle,
>  						int thresh);
>  
> diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
> index ac4fd1d5b128..6c3f4d7df7d6 100644
> --- a/fs/ocfs2/ocfs2_trace.h
> +++ b/fs/ocfs2/ocfs2_trace.h
> @@ -2579,6 +2579,8 @@ DEFINE_OCFS2_ULL_UINT_EVENT(ocfs2_commit_cache_end);
>  
>  DEFINE_OCFS2_INT_INT_EVENT(ocfs2_extend_trans);
>  
> +DEFINE_OCFS2_INT_EVENT(ocfs2_assure_trans_credits);
> +
>  DEFINE_OCFS2_INT_EVENT(ocfs2_extend_trans_restart);
>  
>  DEFINE_OCFS2_INT_INT_EVENT(ocfs2_allocate_extend_trans);
Andrew Morton June 15, 2024, 2:54 a.m. UTC | #3
On Sat, 15 Jun 2024 09:21:51 +0800 Heming Zhao <heming.zhao@suse.com> wrote:

> On 6/14/24 22:52, Jan Kara wrote:
> > The code in ocfs2_dio_end_io_write() estimates number of necessary
> > transaction credits using ocfs2_calc_extend_credits(). This however does
> > not take into account that the IO could be arbitrarily large and can
> > contain arbitrary number of extents. Extent tree manipulations do often
> > extend the current transaction but not in all of the cases. For example
> > if we have only single block extents in the tree,
> > ocfs2_mark_extent_written() will end up calling
> > ocfs2_replace_extent_rec() all the time and we will never extend the
> > current transaction and eventually exhaust all the transaction credits
> > if the IO contains many single block extents. Make sure the transaction
> > always has enough credits for one extent insert before each call of
> > ocfs2_mark_extent_written().
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Thanks for Jan's patch, it was used to fix a SUSE customer issue.

That's useful info.  Could we please have, for the changelog, a
description of that issue?  Amongst other reasons, so that someone else
who is dealing with a customer issue can figure out whether this patch
will address it.

I assume a cc:stable is appropriate.

Can we please identify a suitable Fixes: target, or has this been there
for ever?
heming.zhao@suse.com June 15, 2024, 3:11 a.m. UTC | #4
On 6/15/24 10:54, Andrew Morton wrote:
> On Sat, 15 Jun 2024 09:21:51 +0800 Heming Zhao <heming.zhao@suse.com> wrote:
> 
>> On 6/14/24 22:52, Jan Kara wrote:
>>> The code in ocfs2_dio_end_io_write() estimates number of necessary
>>> transaction credits using ocfs2_calc_extend_credits(). This however does
>>> not take into account that the IO could be arbitrarily large and can
>>> contain arbitrary number of extents. Extent tree manipulations do often
>>> extend the current transaction but not in all of the cases. For example
>>> if we have only single block extents in the tree,
>>> ocfs2_mark_extent_written() will end up calling
>>> ocfs2_replace_extent_rec() all the time and we will never extend the
>>> current transaction and eventually exhaust all the transaction credits
>>> if the IO contains many single block extents. Make sure the transaction
>>> always has enough credits for one extent insert before each call of
>>> ocfs2_mark_extent_written().
>>>
>>> Signed-off-by: Jan Kara <jack@suse.cz>
>>
>> Thanks for Jan's patch, it was used to fix a SUSE customer issue.
> 
> That's useful info.  Could we please have, for the changelog, a
> description of that issue?  Amongst other reasons, so that someone else
> who is dealing with a customer issue can figure out whether this patch
> will address it.
> 

See blow crash call stack, I removed the customer-related information.

------
PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error"

PID: xxx  TASK: xxxx  CPU: 5  COMMAND: "SubmitThread-CA"
  #0 machine_kexec at ffffffff8c069932
  #1 __crash_kexec at ffffffff8c1338fa
  #2 panic at ffffffff8c1d69b9
  #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2]
  #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2]
  #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2]
  #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2]
  #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2]
  #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2]
  #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2]
#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2]
#11 dio_complete at ffffffff8c2b9fa7
#12 do_blockdev_direct_IO at ffffffff8c2bc09f
#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2]
#14 generic_file_direct_write at ffffffff8c1dcf14
#15 __generic_file_write_iter at ffffffff8c1dd07b
#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2]
#17 aio_write at ffffffff8c2cc72e
#18 kmem_cache_alloc at ffffffff8c248dde
#19 do_io_submit at ffffffff8c2ccada
#20 do_syscall_64 at ffffffff8c004984
#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba

Thanks,
Jan Kara June 17, 2024, 9:44 a.m. UTC | #5
On Fri 14-06-24 19:54:27, Andrew Morton wrote:
> On Sat, 15 Jun 2024 09:21:51 +0800 Heming Zhao <heming.zhao@suse.com> wrote:
> 
> > On 6/14/24 22:52, Jan Kara wrote:
> > > The code in ocfs2_dio_end_io_write() estimates number of necessary
> > > transaction credits using ocfs2_calc_extend_credits(). This however does
> > > not take into account that the IO could be arbitrarily large and can
> > > contain arbitrary number of extents. Extent tree manipulations do often
> > > extend the current transaction but not in all of the cases. For example
> > > if we have only single block extents in the tree,
> > > ocfs2_mark_extent_written() will end up calling
> > > ocfs2_replace_extent_rec() all the time and we will never extend the
> > > current transaction and eventually exhaust all the transaction credits
> > > if the IO contains many single block extents. Make sure the transaction
> > > always has enough credits for one extent insert before each call of
> > > ocfs2_mark_extent_written().
> > > 
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > 
> > Thanks for Jan's patch, it was used to fix a SUSE customer issue.
> 
> That's useful info.  Could we please have, for the changelog, a
> description of that issue?  Amongst other reasons, so that someone else
> who is dealing with a customer issue can figure out whether this patch
> will address it.

Sure, I'll update the changelog.
 
> I assume a cc:stable is appropriate.

Yes.

> Can we please identify a suitable Fixes: target, or has this been there
> for ever?

The problem seems to be there since the OCFS2 DIO rewrite some 8 years ago.
I'll add Fixes tag.

								Honza
diff mbox series

Patch

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index ba790219d528..f4d7c0675651 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2371,6 +2371,11 @@  static int ocfs2_dio_end_io_write(struct inode *inode,
 	}
 
 	list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
+		ret = ocfs2_assure_trans_credits(handle, credits);
+		if (ret < 0) {
+			mlog_errno(ret);
+			break;
+		}
 		ret = ocfs2_mark_extent_written(inode, &et, handle,
 						ue->ue_cpos, 1,
 						ue->ue_phys,
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 604fea3a26ff..4866d7d3ac57 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -445,6 +445,23 @@  int ocfs2_extend_trans(handle_t *handle, int nblocks)
 	return status;
 }
 
+/*
+ * Make sure handle has at least 'nblocks' credits available. If it does not
+ * have that many credits available, we will try to extend the handle to have
+ * enough credits. If that fails, we will restart transaction to have enough
+ * credits. Similar notes regarding data consistency and locking implications
+ * as for ocfs2_extend_trans() apply here.
+ */
+int ocfs2_assure_trans_credits(handle_t *handle, int nblocks)
+{
+	int old_nblks = jbd2_handle_buffer_credits(handle);
+
+	trace_ocfs2_assure_trans_credits(old_nblks);
+	if (old_nblks >= nblocks)
+		return 0;
+	return ocfs2_extend_trans(handle, nblocks - old_nblks);
+}
+
 /*
  * If we have fewer than thresh credits, extend by OCFS2_MAX_TRANS_DATA.
  * If that fails, restart the transaction & regain write access for the
diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
index 41c9fe7e62f9..e3c3a35dc5e0 100644
--- a/fs/ocfs2/journal.h
+++ b/fs/ocfs2/journal.h
@@ -243,6 +243,8 @@  handle_t		    *ocfs2_start_trans(struct ocfs2_super *osb,
 int			     ocfs2_commit_trans(struct ocfs2_super *osb,
 						handle_t *handle);
 int			     ocfs2_extend_trans(handle_t *handle, int nblocks);
+int			     ocfs2_assure_trans_credits(handle_t *handle,
+						int nblocks);
 int			     ocfs2_allocate_extend_trans(handle_t *handle,
 						int thresh);
 
diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
index ac4fd1d5b128..6c3f4d7df7d6 100644
--- a/fs/ocfs2/ocfs2_trace.h
+++ b/fs/ocfs2/ocfs2_trace.h
@@ -2579,6 +2579,8 @@  DEFINE_OCFS2_ULL_UINT_EVENT(ocfs2_commit_cache_end);
 
 DEFINE_OCFS2_INT_INT_EVENT(ocfs2_extend_trans);
 
+DEFINE_OCFS2_INT_EVENT(ocfs2_assure_trans_credits);
+
 DEFINE_OCFS2_INT_EVENT(ocfs2_extend_trans_restart);
 
 DEFINE_OCFS2_INT_INT_EVENT(ocfs2_allocate_extend_trans);