diff mbox

[2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

Message ID 51249468.6010004@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Miao Xie Feb. 20, 2013, 9:16 a.m. UTC
Now btrfs_commit_transaction() does this

ret = btrfs_run_ordered_operations(root, 0)

which async flushes all inodes on the ordered operations list, it introduced
a deadlock that transaction-start task, transaction-commit task and the flush
workers waited for each other.
(See the following URL to get the detail
 http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)

As we know, if ->in_commit is set, it means someone is committing the
current transaction, we should not try to join it if we are not JOIN
or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
the above problem. In this way, there is another benefit: there is no new
transaction handle to block the transaction which is on the way of commit,
once we set ->in_commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/transaction.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

Comments

Alex Lyakas Feb. 24, 2013, 7:49 p.m. UTC | #1
Hi Miao,
can you please explain your solution a bit more.

On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> Now btrfs_commit_transaction() does this
>
> ret = btrfs_run_ordered_operations(root, 0)
>
> which async flushes all inodes on the ordered operations list, it introduced
> a deadlock that transaction-start task, transaction-commit task and the flush
> workers waited for each other.
> (See the following URL to get the detail
>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>
> As we know, if ->in_commit is set, it means someone is committing the
> current transaction, we should not try to join it if we are not JOIN
> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
> the above problem. In this way, there is another benefit: there is no new
> transaction handle to block the transaction which is on the way of commit,
> once we set ->in_commit.
>
> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> ---
>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>  1 files changed, 16 insertions(+), 1 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index bc2f2d1..71b7e2e 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>         root->commit_root = btrfs_root_node(root);
>  }
>
> +static inline int can_join_transaction(struct btrfs_transaction *trans,
> +                                      int type)
> +{
> +       return !(trans->in_commit &&
> +                type != TRANS_JOIN &&
> +                type != TRANS_JOIN_NOLOCK);
> +}
> +
>  /*
>   * either allocate a new transaction or hop into the existing one
>   */
> @@ -86,6 +94,10 @@ loop:
>                         spin_unlock(&fs_info->trans_lock);
>                         return cur_trans->aborted;
>                 }
> +               if (!can_join_transaction(cur_trans, type)) {
> +                       spin_unlock(&fs_info->trans_lock);
> +                       return -EBUSY;
> +               }
>                 atomic_inc(&cur_trans->use_count);
>                 atomic_inc(&cur_trans->num_writers);
>                 cur_trans->num_joined++;
> @@ -360,8 +372,11 @@ again:
>
>         do {
>                 ret = join_transaction(root, type);
> -               if (ret == -EBUSY)
> +               if (ret == -EBUSY) {
>                         wait_current_trans(root);
> +                       if (unlikely(type == TRANS_ATTACH))
> +                               ret = -ENOENT;
> +               }

So I understand that instead of incrementing num_writes and joining
the current transaction, you do not join and wait for the current
transaction to unblock.

Which task in Josef's example
http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
task 1, task 2 or task 3 is the one that will not join the
transaction, but instead wait?

Also, I think I don't fully understand Josef's example. What is
preventing from async flushing to complete?
Is task 3 waiting because trans_no_join is set?
Is task 3 the one that actually does the delalloc flush?

Thanks,
Alex.






>         } while (ret == -EBUSY);
>
>         if (ret < 0) {
> --
> 1.6.5.2
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie Feb. 25, 2013, 10:20 a.m. UTC | #2
On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
> Hi Miao,
> can you please explain your solution a bit more.
> 
> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> Now btrfs_commit_transaction() does this
>>
>> ret = btrfs_run_ordered_operations(root, 0)
>>
>> which async flushes all inodes on the ordered operations list, it introduced
>> a deadlock that transaction-start task, transaction-commit task and the flush
>> workers waited for each other.
>> (See the following URL to get the detail
>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>
>> As we know, if ->in_commit is set, it means someone is committing the
>> current transaction, we should not try to join it if we are not JOIN
>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>> the above problem. In this way, there is another benefit: there is no new
>> transaction handle to block the transaction which is on the way of commit,
>> once we set ->in_commit.
>>
>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>> ---
>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index bc2f2d1..71b7e2e 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>         root->commit_root = btrfs_root_node(root);
>>  }
>>
>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>> +                                      int type)
>> +{
>> +       return !(trans->in_commit &&
>> +                type != TRANS_JOIN &&
>> +                type != TRANS_JOIN_NOLOCK);
>> +}
>> +
>>  /*
>>   * either allocate a new transaction or hop into the existing one
>>   */
>> @@ -86,6 +94,10 @@ loop:
>>                         spin_unlock(&fs_info->trans_lock);
>>                         return cur_trans->aborted;
>>                 }
>> +               if (!can_join_transaction(cur_trans, type)) {
>> +                       spin_unlock(&fs_info->trans_lock);
>> +                       return -EBUSY;
>> +               }
>>                 atomic_inc(&cur_trans->use_count);
>>                 atomic_inc(&cur_trans->num_writers);
>>                 cur_trans->num_joined++;
>> @@ -360,8 +372,11 @@ again:
>>
>>         do {
>>                 ret = join_transaction(root, type);
>> -               if (ret == -EBUSY)
>> +               if (ret == -EBUSY) {
>>                         wait_current_trans(root);
>> +                       if (unlikely(type == TRANS_ATTACH))
>> +                               ret = -ENOENT;
>> +               }
> 
> So I understand that instead of incrementing num_writes and joining
> the current transaction, you do not join and wait for the current
> transaction to unblock.

More specifically?TRANS_START?TRANS_USERSPACE and TRANS_ATTACH can not
join and just wait for the current transaction to unblock if ->in_commit
is set.

> Which task in Josef's example
> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
> task 1, task 2 or task 3 is the one that will not join the
> transaction, but instead wait?

Task1 will not join the transaction, in this way, async inode flush
won't run, and then task3 won't do anything.

Before applying the patch:
Start/Attach_Trans_Task			Commit_Task			Flush_Worker
(Task1)					(Task2)				(Task3)		-- the name in Josef's example
btrfs_start_transaction()
 |->may_wait_transaction()
 |  (return 0)
 |					btrfs_commit_transaction()
 |					 |->set ->in_commit and
 |					 |  blocked to 1
 |					 |->wait writers to be 1
 |					 |  (writers is 1)
 |->join_transaction()			 |
 |  (writers is 2)			 |
 |->btrfs_commit_transaction()		 |
     |					 |->set trans_no_join to 1
     |					 |  (close join transaction)
     |->btrfs_run_ordered_operations	 |
	(Those ordered operations	 |
	 are added when releasing	 |
	 file)				 |
	 |->async inode flush()		 |
	 |->wait_flush_comlete()	 |
					 |				work_loop()
					 |				 |->run_work()
					 |				     |->btrfs_join_transaction()
					 |					 |->wait_current_trans()
					 |->wait writers to be 1
					    
This three tasks waited for each other.

After applying this patch:
Start/Attach_Trans_Task			Commit_Task			Flush_Worker
(Task1)					(Task2)				(Task3)	
btrfs_start_transaction()
 |->may_wait_transaction()
 |  (return 0)
 |					btrfs_commit_transaction()
 |					 |->set ->in_commit and
 |					 |  blocked to 1
 |					 |->wait writers to be 1
 |					 |  (writers is 1)
 |->join_transaction() fail		 |
 |  (return -EBUSY, writers is still 1)	 |
 |->wait_current_trans()		 |
					 |->set trans_no_join to 1
					 |  (close join transaction)
					 |->wait writers to be 1
					 |->continue committing
									(Task3 does nothing)
> Also, I think I don't fully understand Josef's example. What is
> preventing from async flushing to complete?
> Is task 3 waiting because trans_no_join is set?
> Is task 3 the one that actually does the delalloc flush?

See above.

Thanks
Miao

> 
> Thanks,
> Alex.
> 
> 
> 
> 
> 
> 
>>         } while (ret == -EBUSY);
>>
>>         if (ret < 0) {
>> --
>> 1.6.5.2
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas March 2, 2013, 9:15 p.m. UTC | #3
Hi Miao,
thanks for the great ASCII graphics and detailed explanation!

Alex.


On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>> Hi Miao,
>> can you please explain your solution a bit more.
>>
>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>> Now btrfs_commit_transaction() does this
>>>
>>> ret = btrfs_run_ordered_operations(root, 0)
>>>
>>> which async flushes all inodes on the ordered operations list, it introduced
>>> a deadlock that transaction-start task, transaction-commit task and the flush
>>> workers waited for each other.
>>> (See the following URL to get the detail
>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>
>>> As we know, if ->in_commit is set, it means someone is committing the
>>> current transaction, we should not try to join it if we are not JOIN
>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>> the above problem. In this way, there is another benefit: there is no new
>>> transaction handle to block the transaction which is on the way of commit,
>>> once we set ->in_commit.
>>>
>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>> ---
>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index bc2f2d1..71b7e2e 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>>         root->commit_root = btrfs_root_node(root);
>>>  }
>>>
>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>> +                                      int type)
>>> +{
>>> +       return !(trans->in_commit &&
>>> +                type != TRANS_JOIN &&
>>> +                type != TRANS_JOIN_NOLOCK);
>>> +}
>>> +
>>>  /*
>>>   * either allocate a new transaction or hop into the existing one
>>>   */
>>> @@ -86,6 +94,10 @@ loop:
>>>                         spin_unlock(&fs_info->trans_lock);
>>>                         return cur_trans->aborted;
>>>                 }
>>> +               if (!can_join_transaction(cur_trans, type)) {
>>> +                       spin_unlock(&fs_info->trans_lock);
>>> +                       return -EBUSY;
>>> +               }
>>>                 atomic_inc(&cur_trans->use_count);
>>>                 atomic_inc(&cur_trans->num_writers);
>>>                 cur_trans->num_joined++;
>>> @@ -360,8 +372,11 @@ again:
>>>
>>>         do {
>>>                 ret = join_transaction(root, type);
>>> -               if (ret == -EBUSY)
>>> +               if (ret == -EBUSY) {
>>>                         wait_current_trans(root);
>>> +                       if (unlikely(type == TRANS_ATTACH))
>>> +                               ret = -ENOENT;
>>> +               }
>>
>> So I understand that instead of incrementing num_writes and joining
>> the current transaction, you do not join and wait for the current
>> transaction to unblock.
>
> More specifically?TRANS_START?TRANS_USERSPACE and TRANS_ATTACH can not
> join and just wait for the current transaction to unblock if ->in_commit
> is set.
>
>> Which task in Josef's example
>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>> task 1, task 2 or task 3 is the one that will not join the
>> transaction, but instead wait?
>
> Task1 will not join the transaction, in this way, async inode flush
> won't run, and then task3 won't do anything.
>
> Before applying the patch:
> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
> (Task1)                                 (Task2)                         (Task3)         -- the name in Josef's example
> btrfs_start_transaction()
>  |->may_wait_transaction()
>  |  (return 0)
>  |                                      btrfs_commit_transaction()
>  |                                       |->set ->in_commit and
>  |                                       |  blocked to 1
>  |                                       |->wait writers to be 1
>  |                                       |  (writers is 1)
>  |->join_transaction()                   |
>  |  (writers is 2)                       |
>  |->btrfs_commit_transaction()           |
>      |                                   |->set trans_no_join to 1
>      |                                   |  (close join transaction)
>      |->btrfs_run_ordered_operations     |
>         (Those ordered operations        |
>          are added when releasing        |
>          file)                           |
>          |->async inode flush()          |
>          |->wait_flush_comlete()         |
>                                          |                              work_loop()
>                                          |                               |->run_work()
>                                          |                                   |->btrfs_join_transaction()
>                                          |                                       |->wait_current_trans()
>                                          |->wait writers to be 1
>
> This three tasks waited for each other.
>
> After applying this patch:
> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
> (Task1)                                 (Task2)                         (Task3)
> btrfs_start_transaction()
>  |->may_wait_transaction()
>  |  (return 0)
>  |                                      btrfs_commit_transaction()
>  |                                       |->set ->in_commit and
>  |                                       |  blocked to 1
>  |                                       |->wait writers to be 1
>  |                                       |  (writers is 1)
>  |->join_transaction() fail              |
>  |  (return -EBUSY, writers is still 1)  |
>  |->wait_current_trans()                 |
>                                          |->set trans_no_join to 1
>                                          |  (close join transaction)
>                                          |->wait writers to be 1
>                                          |->continue committing
>                                                                         (Task3 does nothing)
>> Also, I think I don't fully understand Josef's example. What is
>> preventing from async flushing to complete?
>> Is task 3 waiting because trans_no_join is set?
>> Is task 3 the one that actually does the delalloc flush?
>
> See above.
>
> Thanks
> Miao
>
>>
>> Thanks,
>> Alex.
>>
>>
>>
>>
>>
>>
>>>         } while (ret == -EBUSY);
>>>
>>>         if (ret < 0) {
>>> --
>>> 1.6.5.2
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas March 24, 2013, 11:13 a.m. UTC | #4
Hi Miao,
I am seeing another issue. Your fix prevents from TRANS_START to get
in the way of a committing transaction. But it does not prevent from
TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
following loop:

do {
    // attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
		 (should_grow && cur_trans->num_joined != joined));

What I see is basically that new writers join the transaction, while
btrfs_commit_transaction() does this loop. I see
cur_trans->num_writers decreasing, but then it increases, then
decreases etc. This can go for several seconds during heavy IO load.
There is nothing to prevent new TRANS_JOIN writers coming and joining
a transaction over and over, thus delaying transaction commit. The IO
path uses TRANS_JOIN; for example run_delalloc_nocow() does that.

Do you observe such behavior? Do you believe it's problematic?

Thanks,
Alex.


On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>> Hi Miao,
>> can you please explain your solution a bit more.
>>
>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>> Now btrfs_commit_transaction() does this
>>>
>>> ret = btrfs_run_ordered_operations(root, 0)
>>>
>>> which async flushes all inodes on the ordered operations list, it introduced
>>> a deadlock that transaction-start task, transaction-commit task and the flush
>>> workers waited for each other.
>>> (See the following URL to get the detail
>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>
>>> As we know, if ->in_commit is set, it means someone is committing the
>>> current transaction, we should not try to join it if we are not JOIN
>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>> the above problem. In this way, there is another benefit: there is no new
>>> transaction handle to block the transaction which is on the way of commit,
>>> once we set ->in_commit.
>>>
>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>> ---
>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index bc2f2d1..71b7e2e 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>>         root->commit_root = btrfs_root_node(root);
>>>  }
>>>
>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>> +                                      int type)
>>> +{
>>> +       return !(trans->in_commit &&
>>> +                type != TRANS_JOIN &&
>>> +                type != TRANS_JOIN_NOLOCK);
>>> +}
>>> +
>>>  /*
>>>   * either allocate a new transaction or hop into the existing one
>>>   */
>>> @@ -86,6 +94,10 @@ loop:
>>>                         spin_unlock(&fs_info->trans_lock);
>>>                         return cur_trans->aborted;
>>>                 }
>>> +               if (!can_join_transaction(cur_trans, type)) {
>>> +                       spin_unlock(&fs_info->trans_lock);
>>> +                       return -EBUSY;
>>> +               }
>>>                 atomic_inc(&cur_trans->use_count);
>>>                 atomic_inc(&cur_trans->num_writers);
>>>                 cur_trans->num_joined++;
>>> @@ -360,8 +372,11 @@ again:
>>>
>>>         do {
>>>                 ret = join_transaction(root, type);
>>> -               if (ret == -EBUSY)
>>> +               if (ret == -EBUSY) {
>>>                         wait_current_trans(root);
>>> +                       if (unlikely(type == TRANS_ATTACH))
>>> +                               ret = -ENOENT;
>>> +               }
>>
>> So I understand that instead of incrementing num_writes and joining
>> the current transaction, you do not join and wait for the current
>> transaction to unblock.
>
> More specifically?TRANS_START?TRANS_USERSPACE and TRANS_ATTACH can not
> join and just wait for the current transaction to unblock if ->in_commit
> is set.
>
>> Which task in Josef's example
>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>> task 1, task 2 or task 3 is the one that will not join the
>> transaction, but instead wait?
>
> Task1 will not join the transaction, in this way, async inode flush
> won't run, and then task3 won't do anything.
>
> Before applying the patch:
> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
> (Task1)                                 (Task2)                         (Task3)         -- the name in Josef's example
> btrfs_start_transaction()
>  |->may_wait_transaction()
>  |  (return 0)
>  |                                      btrfs_commit_transaction()
>  |                                       |->set ->in_commit and
>  |                                       |  blocked to 1
>  |                                       |->wait writers to be 1
>  |                                       |  (writers is 1)
>  |->join_transaction()                   |
>  |  (writers is 2)                       |
>  |->btrfs_commit_transaction()           |
>      |                                   |->set trans_no_join to 1
>      |                                   |  (close join transaction)
>      |->btrfs_run_ordered_operations     |
>         (Those ordered operations        |
>          are added when releasing        |
>          file)                           |
>          |->async inode flush()          |
>          |->wait_flush_comlete()         |
>                                          |                              work_loop()
>                                          |                               |->run_work()
>                                          |                                   |->btrfs_join_transaction()
>                                          |                                       |->wait_current_trans()
>                                          |->wait writers to be 1
>
> This three tasks waited for each other.
>
> After applying this patch:
> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
> (Task1)                                 (Task2)                         (Task3)
> btrfs_start_transaction()
>  |->may_wait_transaction()
>  |  (return 0)
>  |                                      btrfs_commit_transaction()
>  |                                       |->set ->in_commit and
>  |                                       |  blocked to 1
>  |                                       |->wait writers to be 1
>  |                                       |  (writers is 1)
>  |->join_transaction() fail              |
>  |  (return -EBUSY, writers is still 1)  |
>  |->wait_current_trans()                 |
>                                          |->set trans_no_join to 1
>                                          |  (close join transaction)
>                                          |->wait writers to be 1
>                                          |->continue committing
>                                                                         (Task3 does nothing)
>> Also, I think I don't fully understand Josef's example. What is
>> preventing from async flushing to complete?
>> Is task 3 waiting because trans_no_join is set?
>> Is task 3 the one that actually does the delalloc flush?
>
> See above.
>
> Thanks
> Miao
>
>>
>> Thanks,
>> Alex.
>>
>>
>>
>>
>>
>>
>>>         } while (ret == -EBUSY);
>>>
>>>         if (ret < 0) {
>>> --
>>> 1.6.5.2
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie March 25, 2013, 1:51 a.m. UTC | #5
On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
> Hi Miao,
> I am seeing another issue. Your fix prevents from TRANS_START to get
> in the way of a committing transaction. But it does not prevent from
> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
> following loop:
> 
> do {
>     // attempt to do some useful stuff and/or sleep
> } while (atomic_read(&cur_trans->num_writers) > 1 ||
> 		 (should_grow && cur_trans->num_joined != joined));
> 
> What I see is basically that new writers join the transaction, while
> btrfs_commit_transaction() does this loop. I see
> cur_trans->num_writers decreasing, but then it increases, then
> decreases etc. This can go for several seconds during heavy IO load.
> There is nothing to prevent new TRANS_JOIN writers coming and joining
> a transaction over and over, thus delaying transaction commit. The IO
> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
> 
> Do you observe such behavior? Do you believe it's problematic?

I know this behavior, there is no problem with it, the latter code
will prevent from TRANS_JOIN.

1672         spin_lock(&root->fs_info->trans_lock);
1673         root->fs_info->trans_no_join = 1;
1674         spin_unlock(&root->fs_info->trans_lock);
1675         wait_event(cur_trans->writer_wait,
1676                    atomic_read(&cur_trans->num_writers) == 1);

And if we block the TRANS_JOIN at the place you point out, the deadlock
will happen because we need deal with the ordered operations which will
use TRANS_JOIN here.

(I am dealing with the problem you said above by adding a new type of
TRANS_* now)

Thanks
Miao

> Thanks,
> Alex.
> 
> 
> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>> Hi Miao,
>>> can you please explain your solution a bit more.
>>>
>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>> Now btrfs_commit_transaction() does this
>>>>
>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>
>>>> which async flushes all inodes on the ordered operations list, it introduced
>>>> a deadlock that transaction-start task, transaction-commit task and the flush
>>>> workers waited for each other.
>>>> (See the following URL to get the detail
>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>>
>>>> As we know, if ->in_commit is set, it means someone is committing the
>>>> current transaction, we should not try to join it if we are not JOIN
>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>>> the above problem. In this way, there is another benefit: there is no new
>>>> transaction handle to block the transaction which is on the way of commit,
>>>> once we set ->in_commit.
>>>>
>>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>>> ---
>>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>> index bc2f2d1..71b7e2e 100644
>>>> --- a/fs/btrfs/transaction.c
>>>> +++ b/fs/btrfs/transaction.c
>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>>>         root->commit_root = btrfs_root_node(root);
>>>>  }
>>>>
>>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>>> +                                      int type)
>>>> +{
>>>> +       return !(trans->in_commit &&
>>>> +                type != TRANS_JOIN &&
>>>> +                type != TRANS_JOIN_NOLOCK);
>>>> +}
>>>> +
>>>>  /*
>>>>   * either allocate a new transaction or hop into the existing one
>>>>   */
>>>> @@ -86,6 +94,10 @@ loop:
>>>>                         spin_unlock(&fs_info->trans_lock);
>>>>                         return cur_trans->aborted;
>>>>                 }
>>>> +               if (!can_join_transaction(cur_trans, type)) {
>>>> +                       spin_unlock(&fs_info->trans_lock);
>>>> +                       return -EBUSY;
>>>> +               }
>>>>                 atomic_inc(&cur_trans->use_count);
>>>>                 atomic_inc(&cur_trans->num_writers);
>>>>                 cur_trans->num_joined++;
>>>> @@ -360,8 +372,11 @@ again:
>>>>
>>>>         do {
>>>>                 ret = join_transaction(root, type);
>>>> -               if (ret == -EBUSY)
>>>> +               if (ret == -EBUSY) {
>>>>                         wait_current_trans(root);
>>>> +                       if (unlikely(type == TRANS_ATTACH))
>>>> +                               ret = -ENOENT;
>>>> +               }
>>>
>>> So I understand that instead of incrementing num_writes and joining
>>> the current transaction, you do not join and wait for the current
>>> transaction to unblock.
>>
>> More specifically?TRANS_START?TRANS_USERSPACE and TRANS_ATTACH can not
>> join and just wait for the current transaction to unblock if ->in_commit
>> is set.
>>
>>> Which task in Josef's example
>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>>> task 1, task 2 or task 3 is the one that will not join the
>>> transaction, but instead wait?
>>
>> Task1 will not join the transaction, in this way, async inode flush
>> won't run, and then task3 won't do anything.
>>
>> Before applying the patch:
>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>> (Task1)                                 (Task2)                         (Task3)         -- the name in Josef's example
>> btrfs_start_transaction()
>>  |->may_wait_transaction()
>>  |  (return 0)
>>  |                                      btrfs_commit_transaction()
>>  |                                       |->set ->in_commit and
>>  |                                       |  blocked to 1
>>  |                                       |->wait writers to be 1
>>  |                                       |  (writers is 1)
>>  |->join_transaction()                   |
>>  |  (writers is 2)                       |
>>  |->btrfs_commit_transaction()           |
>>      |                                   |->set trans_no_join to 1
>>      |                                   |  (close join transaction)
>>      |->btrfs_run_ordered_operations     |
>>         (Those ordered operations        |
>>          are added when releasing        |
>>          file)                           |
>>          |->async inode flush()          |
>>          |->wait_flush_comlete()         |
>>                                          |                              work_loop()
>>                                          |                               |->run_work()
>>                                          |                                   |->btrfs_join_transaction()
>>                                          |                                       |->wait_current_trans()
>>                                          |->wait writers to be 1
>>
>> This three tasks waited for each other.
>>
>> After applying this patch:
>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>> (Task1)                                 (Task2)                         (Task3)
>> btrfs_start_transaction()
>>  |->may_wait_transaction()
>>  |  (return 0)
>>  |                                      btrfs_commit_transaction()
>>  |                                       |->set ->in_commit and
>>  |                                       |  blocked to 1
>>  |                                       |->wait writers to be 1
>>  |                                       |  (writers is 1)
>>  |->join_transaction() fail              |
>>  |  (return -EBUSY, writers is still 1)  |
>>  |->wait_current_trans()                 |
>>                                          |->set trans_no_join to 1
>>                                          |  (close join transaction)
>>                                          |->wait writers to be 1
>>                                          |->continue committing
>>                                                                         (Task3 does nothing)
>>> Also, I think I don't fully understand Josef's example. What is
>>> preventing from async flushing to complete?
>>> Is task 3 waiting because trans_no_join is set?
>>> Is task 3 the one that actually does the delalloc flush?
>>
>> See above.
>>
>> Thanks
>> Miao
>>
>>>
>>> Thanks,
>>> Alex.
>>>
>>>
>>>
>>>
>>>
>>>
>>>>         } while (ret == -EBUSY);
>>>>
>>>>         if (ret < 0) {
>>>> --
>>>> 1.6.5.2
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas March 25, 2013, 9:11 a.m. UTC | #6
Hi Miao,

On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>> Hi Miao,
>> I am seeing another issue. Your fix prevents from TRANS_START to get
>> in the way of a committing transaction. But it does not prevent from
>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>> following loop:
>>
>> do {
>>     // attempt to do some useful stuff and/or sleep
>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>                (should_grow && cur_trans->num_joined != joined));
>>
>> What I see is basically that new writers join the transaction, while
>> btrfs_commit_transaction() does this loop. I see
>> cur_trans->num_writers decreasing, but then it increases, then
>> decreases etc. This can go for several seconds during heavy IO load.
>> There is nothing to prevent new TRANS_JOIN writers coming and joining
>> a transaction over and over, thus delaying transaction commit. The IO
>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
>>
>> Do you observe such behavior? Do you believe it's problematic?
>
> I know this behavior, there is no problem with it, the latter code
> will prevent from TRANS_JOIN.
>
> 1672         spin_lock(&root->fs_info->trans_lock);
> 1673         root->fs_info->trans_no_join = 1;
> 1674         spin_unlock(&root->fs_info->trans_lock);
> 1675         wait_event(cur_trans->writer_wait,
> 1676                    atomic_read(&cur_trans->num_writers) == 1);
>
Yes, this code prevents anybody from joining, but before
btrfs_commit_transaction() gets to this code, it may spend sometimes
10 seconds (in my tests) in the do-while loop, while new writers come
and go. Basically, it is not deterministic when the do-while loop will
exit, it depends on the IO pattern.

> And if we block the TRANS_JOIN at the place you point out, the deadlock
> will happen because we need deal with the ordered operations which will
> use TRANS_JOIN here.
>
> (I am dealing with the problem you said above by adding a new type of
> TRANS_* now)

Thanks.
Alex.


>
> Thanks
> Miao
>
>> Thanks,
>> Alex.
>>
>>
>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>>> Hi Miao,
>>>> can you please explain your solution a bit more.
>>>>
>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>>> Now btrfs_commit_transaction() does this
>>>>>
>>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>>
>>>>> which async flushes all inodes on the ordered operations list, it introduced
>>>>> a deadlock that transaction-start task, transaction-commit task and the flush
>>>>> workers waited for each other.
>>>>> (See the following URL to get the detail
>>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>>>
>>>>> As we know, if ->in_commit is set, it means someone is committing the
>>>>> current transaction, we should not try to join it if we are not JOIN
>>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>>>> the above problem. In this way, there is another benefit: there is no new
>>>>> transaction handle to block the transaction which is on the way of commit,
>>>>> once we set ->in_commit.
>>>>>
>>>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>>>> ---
>>>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>> index bc2f2d1..71b7e2e 100644
>>>>> --- a/fs/btrfs/transaction.c
>>>>> +++ b/fs/btrfs/transaction.c
>>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>>>>         root->commit_root = btrfs_root_node(root);
>>>>>  }
>>>>>
>>>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>>>> +                                      int type)
>>>>> +{
>>>>> +       return !(trans->in_commit &&
>>>>> +                type != TRANS_JOIN &&
>>>>> +                type != TRANS_JOIN_NOLOCK);
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>   * either allocate a new transaction or hop into the existing one
>>>>>   */
>>>>> @@ -86,6 +94,10 @@ loop:
>>>>>                         spin_unlock(&fs_info->trans_lock);
>>>>>                         return cur_trans->aborted;
>>>>>                 }
>>>>> +               if (!can_join_transaction(cur_trans, type)) {
>>>>> +                       spin_unlock(&fs_info->trans_lock);
>>>>> +                       return -EBUSY;
>>>>> +               }
>>>>>                 atomic_inc(&cur_trans->use_count);
>>>>>                 atomic_inc(&cur_trans->num_writers);
>>>>>                 cur_trans->num_joined++;
>>>>> @@ -360,8 +372,11 @@ again:
>>>>>
>>>>>         do {
>>>>>                 ret = join_transaction(root, type);
>>>>> -               if (ret == -EBUSY)
>>>>> +               if (ret == -EBUSY) {
>>>>>                         wait_current_trans(root);
>>>>> +                       if (unlikely(type == TRANS_ATTACH))
>>>>> +                               ret = -ENOENT;
>>>>> +               }
>>>>
>>>> So I understand that instead of incrementing num_writes and joining
>>>> the current transaction, you do not join and wait for the current
>>>> transaction to unblock.
>>>
>>> More specifically?TRANS_START?TRANS_USERSPACE and TRANS_ATTACH can not
>>> join and just wait for the current transaction to unblock if ->in_commit
>>> is set.
>>>
>>>> Which task in Josef's example
>>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>>>> task 1, task 2 or task 3 is the one that will not join the
>>>> transaction, but instead wait?
>>>
>>> Task1 will not join the transaction, in this way, async inode flush
>>> won't run, and then task3 won't do anything.
>>>
>>> Before applying the patch:
>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>> (Task1)                                 (Task2)                         (Task3)         -- the name in Josef's example
>>> btrfs_start_transaction()
>>>  |->may_wait_transaction()
>>>  |  (return 0)
>>>  |                                      btrfs_commit_transaction()
>>>  |                                       |->set ->in_commit and
>>>  |                                       |  blocked to 1
>>>  |                                       |->wait writers to be 1
>>>  |                                       |  (writers is 1)
>>>  |->join_transaction()                   |
>>>  |  (writers is 2)                       |
>>>  |->btrfs_commit_transaction()           |
>>>      |                                   |->set trans_no_join to 1
>>>      |                                   |  (close join transaction)
>>>      |->btrfs_run_ordered_operations     |
>>>         (Those ordered operations        |
>>>          are added when releasing        |
>>>          file)                           |
>>>          |->async inode flush()          |
>>>          |->wait_flush_comlete()         |
>>>                                          |                              work_loop()
>>>                                          |                               |->run_work()
>>>                                          |                                   |->btrfs_join_transaction()
>>>                                          |                                       |->wait_current_trans()
>>>                                          |->wait writers to be 1
>>>
>>> This three tasks waited for each other.
>>>
>>> After applying this patch:
>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>> (Task1)                                 (Task2)                         (Task3)
>>> btrfs_start_transaction()
>>>  |->may_wait_transaction()
>>>  |  (return 0)
>>>  |                                      btrfs_commit_transaction()
>>>  |                                       |->set ->in_commit and
>>>  |                                       |  blocked to 1
>>>  |                                       |->wait writers to be 1
>>>  |                                       |  (writers is 1)
>>>  |->join_transaction() fail              |
>>>  |  (return -EBUSY, writers is still 1)  |
>>>  |->wait_current_trans()                 |
>>>                                          |->set trans_no_join to 1
>>>                                          |  (close join transaction)
>>>                                          |->wait writers to be 1
>>>                                          |->continue committing
>>>                                                                         (Task3 does nothing)
>>>> Also, I think I don't fully understand Josef's example. What is
>>>> preventing from async flushing to complete?
>>>> Is task 3 waiting because trans_no_join is set?
>>>> Is task 3 the one that actually does the delalloc flush?
>>>
>>> See above.
>>>
>>> Thanks
>>> Miao
>>>
>>>>
>>>> Thanks,
>>>> Alex.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>         } while (ret == -EBUSY);
>>>>>
>>>>>         if (ret < 0) {
>>>>> --
>>>>> 1.6.5.2
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas April 10, 2013, 6:45 p.m. UTC | #7
Hi Miao,
I attempted to fix the issue by not joining a transaction that has
trans->in_commit set. I did something similar to what
wait_current_trans() does, but I did:

smp_rmb();
if (cur_trans && cur_trans->in_commit) {
...
wait_event(root->fs_info->transaction_wait,  !cur_trans->blocked);
...

I also had to change the order of setting in_commit and blocked in
btrfs_commit_transaction:
	trans->transaction->blocked = 1;
	trans->transaction->in_commit = 1;
	smp_wmb();
to make sure that if in_commit is set, then blocked cannot be 0,
because btrfs_commit_transaction haven't set it yet to 1.

However, with this fix I observe two issues:
# With large trees and heavy commits, join_transaction() is delayed
sometimes by 1-3 seconds. This delays the host IO by too much.
# With this fix, I think too many transactions happen. Basically with
this fix, once transaction->in_commit is set, then I insist to open a
new transaction and not to join the current one. It has some bad
influence on host response times pattern, but I cannot exactly tell
why is that.

Did you have other fix in mind?

Without the fix, I observe sometimes commits that take like 80
seconds, out of which like 50 seconds are spent in the do-while loop
of btrfs_commit_transaction.

Thanks,
Alex.



On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas
<alex.btrfs@zadarastorage.com> wrote:
> Hi Miao,
>
> On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>>> Hi Miao,
>>> I am seeing another issue. Your fix prevents from TRANS_START to get
>>> in the way of a committing transaction. But it does not prevent from
>>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>>> following loop:
>>>
>>> do {
>>>     // attempt to do some useful stuff and/or sleep
>>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>>                (should_grow && cur_trans->num_joined != joined));
>>>
>>> What I see is basically that new writers join the transaction, while
>>> btrfs_commit_transaction() does this loop. I see
>>> cur_trans->num_writers decreasing, but then it increases, then
>>> decreases etc. This can go for several seconds during heavy IO load.
>>> There is nothing to prevent new TRANS_JOIN writers coming and joining
>>> a transaction over and over, thus delaying transaction commit. The IO
>>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
>>>
>>> Do you observe such behavior? Do you believe it's problematic?
>>
>> I know this behavior, there is no problem with it, the latter code
>> will prevent from TRANS_JOIN.
>>
>> 1672         spin_lock(&root->fs_info->trans_lock);
>> 1673         root->fs_info->trans_no_join = 1;
>> 1674         spin_unlock(&root->fs_info->trans_lock);
>> 1675         wait_event(cur_trans->writer_wait,
>> 1676                    atomic_read(&cur_trans->num_writers) == 1);
>>
> Yes, this code prevents anybody from joining, but before
> btrfs_commit_transaction() gets to this code, it may spend sometimes
> 10 seconds (in my tests) in the do-while loop, while new writers come
> and go. Basically, it is not deterministic when the do-while loop will
> exit, it depends on the IO pattern.
>
>> And if we block the TRANS_JOIN at the place you point out, the deadlock
>> will happen because we need deal with the ordered operations which will
>> use TRANS_JOIN here.
>>
>> (I am dealing with the problem you said above by adding a new type of
>> TRANS_* now)
>
> Thanks.
> Alex.
>
>
>>
>> Thanks
>> Miao
>>
>>> Thanks,
>>> Alex.
>>>
>>>
>>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>>>> Hi Miao,
>>>>> can you please explain your solution a bit more.
>>>>>
>>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>>>> Now btrfs_commit_transaction() does this
>>>>>>
>>>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>>>
>>>>>> which async flushes all inodes on the ordered operations list, it introduced
>>>>>> a deadlock that transaction-start task, transaction-commit task and the flush
>>>>>> workers waited for each other.
>>>>>> (See the following URL to get the detail
>>>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>>>>
>>>>>> As we know, if ->in_commit is set, it means someone is committing the
>>>>>> current transaction, we should not try to join it if we are not JOIN
>>>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>>>>> the above problem. In this way, there is another benefit: there is no new
>>>>>> transaction handle to block the transaction which is on the way of commit,
>>>>>> once we set ->in_commit.
>>>>>>
>>>>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>>>>> ---
>>>>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>>> index bc2f2d1..71b7e2e 100644
>>>>>> --- a/fs/btrfs/transaction.c
>>>>>> +++ b/fs/btrfs/transaction.c
>>>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>>>>>         root->commit_root = btrfs_root_node(root);
>>>>>>  }
>>>>>>
>>>>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>>>>> +                                      int type)
>>>>>> +{
>>>>>> +       return !(trans->in_commit &&
>>>>>> +                type != TRANS_JOIN &&
>>>>>> +                type != TRANS_JOIN_NOLOCK);
>>>>>> +}
>>>>>> +
>>>>>>  /*
>>>>>>   * either allocate a new transaction or hop into the existing one
>>>>>>   */
>>>>>> @@ -86,6 +94,10 @@ loop:
>>>>>>                         spin_unlock(&fs_info->trans_lock);
>>>>>>                         return cur_trans->aborted;
>>>>>>                 }
>>>>>> +               if (!can_join_transaction(cur_trans, type)) {
>>>>>> +                       spin_unlock(&fs_info->trans_lock);
>>>>>> +                       return -EBUSY;
>>>>>> +               }
>>>>>>                 atomic_inc(&cur_trans->use_count);
>>>>>>                 atomic_inc(&cur_trans->num_writers);
>>>>>>                 cur_trans->num_joined++;
>>>>>> @@ -360,8 +372,11 @@ again:
>>>>>>
>>>>>>         do {
>>>>>>                 ret = join_transaction(root, type);
>>>>>> -               if (ret == -EBUSY)
>>>>>> +               if (ret == -EBUSY) {
>>>>>>                         wait_current_trans(root);
>>>>>> +                       if (unlikely(type == TRANS_ATTACH))
>>>>>> +                               ret = -ENOENT;
>>>>>> +               }
>>>>>
>>>>> So I understand that instead of incrementing num_writes and joining
>>>>> the current transaction, you do not join and wait for the current
>>>>> transaction to unblock.
>>>>
>>>> More specifically?TRANS_START?TRANS_USERSPACE and TRANS_ATTACH can not
>>>> join and just wait for the current transaction to unblock if ->in_commit
>>>> is set.
>>>>
>>>>> Which task in Josef's example
>>>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>>>>> task 1, task 2 or task 3 is the one that will not join the
>>>>> transaction, but instead wait?
>>>>
>>>> Task1 will not join the transaction, in this way, async inode flush
>>>> won't run, and then task3 won't do anything.
>>>>
>>>> Before applying the patch:
>>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>>> (Task1)                                 (Task2)                         (Task3)         -- the name in Josef's example
>>>> btrfs_start_transaction()
>>>>  |->may_wait_transaction()
>>>>  |  (return 0)
>>>>  |                                      btrfs_commit_transaction()
>>>>  |                                       |->set ->in_commit and
>>>>  |                                       |  blocked to 1
>>>>  |                                       |->wait writers to be 1
>>>>  |                                       |  (writers is 1)
>>>>  |->join_transaction()                   |
>>>>  |  (writers is 2)                       |
>>>>  |->btrfs_commit_transaction()           |
>>>>      |                                   |->set trans_no_join to 1
>>>>      |                                   |  (close join transaction)
>>>>      |->btrfs_run_ordered_operations     |
>>>>         (Those ordered operations        |
>>>>          are added when releasing        |
>>>>          file)                           |
>>>>          |->async inode flush()          |
>>>>          |->wait_flush_comlete()         |
>>>>                                          |                              work_loop()
>>>>                                          |                               |->run_work()
>>>>                                          |                                   |->btrfs_join_transaction()
>>>>                                          |                                       |->wait_current_trans()
>>>>                                          |->wait writers to be 1
>>>>
>>>> This three tasks waited for each other.
>>>>
>>>> After applying this patch:
>>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>>> (Task1)                                 (Task2)                         (Task3)
>>>> btrfs_start_transaction()
>>>>  |->may_wait_transaction()
>>>>  |  (return 0)
>>>>  |                                      btrfs_commit_transaction()
>>>>  |                                       |->set ->in_commit and
>>>>  |                                       |  blocked to 1
>>>>  |                                       |->wait writers to be 1
>>>>  |                                       |  (writers is 1)
>>>>  |->join_transaction() fail              |
>>>>  |  (return -EBUSY, writers is still 1)  |
>>>>  |->wait_current_trans()                 |
>>>>                                          |->set trans_no_join to 1
>>>>                                          |  (close join transaction)
>>>>                                          |->wait writers to be 1
>>>>                                          |->continue committing
>>>>                                                                         (Task3 does nothing)
>>>>> Also, I think I don't fully understand Josef's example. What is
>>>>> preventing from async flushing to complete?
>>>>> Is task 3 waiting because trans_no_join is set?
>>>>> Is task 3 the one that actually does the delalloc flush?
>>>>
>>>> See above.
>>>>
>>>> Thanks
>>>> Miao
>>>>
>>>>>
>>>>> Thanks,
>>>>> Alex.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>         } while (ret == -EBUSY);
>>>>>>
>>>>>>         if (ret < 0) {
>>>>>> --
>>>>>> 1.6.5.2
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie April 11, 2013, 2:19 a.m. UTC | #8
On wed, 10 Apr 2013 21:45:43 +0300, Alex Lyakas wrote:
> Hi Miao,
> I attempted to fix the issue by not joining a transaction that has
> trans->in_commit set. I did something similar to what
> wait_current_trans() does, but I did:
> 
> smp_rmb();
> if (cur_trans && cur_trans->in_commit) {
> ...
> wait_event(root->fs_info->transaction_wait,  !cur_trans->blocked);
> ...

But it will introduce deadlock if we need flush some dirty pages, for
example: run ordered operation.

> 
> I also had to change the order of setting in_commit and blocked in
> btrfs_commit_transaction:
> 	trans->transaction->blocked = 1;
> 	trans->transaction->in_commit = 1;
> 	smp_wmb();
> to make sure that if in_commit is set, then blocked cannot be 0,
> because btrfs_commit_transaction haven't set it yet to 1.

we need smp_wmb() between 
	trans->transaction->blocked = 1;
and
	trans->transaction->in_commit = 1;

Or the cpu may set blocked after in_commmit.

> However, with this fix I observe two issues:
> # With large trees and heavy commits, join_transaction() is delayed
> sometimes by 1-3 seconds. This delays the host IO by too much.
> # With this fix, I think too many transactions happen. Basically with
> this fix, once transaction->in_commit is set, then I insist to open a
> new transaction and not to join the current one. It has some bad
> influence on host response times pattern, but I cannot exactly tell
> why is that.
> 
> Did you have other fix in mind?
> 
> Without the fix, I observe sometimes commits that take like 80
> seconds, out of which like 50 seconds are spent in the do-while loop
> of btrfs_commit_transaction.

I'm making the patch to fix this problem, my fix is:
- don't flush the dirty page during the commit if we create a snapshot
- introduce a new counter to count the external writers(TRANS_USERSPACE/TRANS_START)
  and if this counter is zero, we will break the while loop.
- if flushoncommit is set, we start delalloc flush before the while loop, not in the
  loop, so we don't flush the dirty pages again and again.
- introduce a new transaction handle type named TRANS_JOIN_ENDIO, which is used in the endio
  process.
- introduce a new state for transaction commit, at this state, we block TRANS_JOIN, but
  don't block TRANS_JOIN_ENDIO.

Thanks
Miao

> 
> Thanks,
> Alex.
> 
> 
> 
> On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas
> <alex.btrfs@zadarastorage.com> wrote:
>> Hi Miao,
>>
>> On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>>>> Hi Miao,
>>>> I am seeing another issue. Your fix prevents from TRANS_START to get
>>>> in the way of a committing transaction. But it does not prevent from
>>>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>>>> following loop:
>>>>
>>>> do {
>>>>     // attempt to do some useful stuff and/or sleep
>>>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>>>                (should_grow && cur_trans->num_joined != joined));
>>>>
>>>> What I see is basically that new writers join the transaction, while
>>>> btrfs_commit_transaction() does this loop. I see
>>>> cur_trans->num_writers decreasing, but then it increases, then
>>>> decreases etc. This can go for several seconds during heavy IO load.
>>>> There is nothing to prevent new TRANS_JOIN writers coming and joining
>>>> a transaction over and over, thus delaying transaction commit. The IO
>>>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
>>>>
>>>> Do you observe such behavior? Do you believe it's problematic?
>>>
>>> I know this behavior, there is no problem with it, the latter code
>>> will prevent from TRANS_JOIN.
>>>
>>> 1672         spin_lock(&root->fs_info->trans_lock);
>>> 1673         root->fs_info->trans_no_join = 1;
>>> 1674         spin_unlock(&root->fs_info->trans_lock);
>>> 1675         wait_event(cur_trans->writer_wait,
>>> 1676                    atomic_read(&cur_trans->num_writers) == 1);
>>>
>> Yes, this code prevents anybody from joining, but before
>> btrfs_commit_transaction() gets to this code, it may spend sometimes
>> 10 seconds (in my tests) in the do-while loop, while new writers come
>> and go. Basically, it is not deterministic when the do-while loop will
>> exit, it depends on the IO pattern.
>>
>>> And if we block the TRANS_JOIN at the place you point out, the deadlock
>>> will happen because we need deal with the ordered operations which will
>>> use TRANS_JOIN here.
>>>
>>> (I am dealing with the problem you said above by adding a new type of
>>> TRANS_* now)
>>
>> Thanks.
>> Alex.
>>
>>
>>>
>>> Thanks
>>> Miao
>>>
>>>> Thanks,
>>>> Alex.
>>>>
>>>>
>>>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>>>>> Hi Miao,
>>>>>> can you please explain your solution a bit more.
>>>>>>
>>>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>>>>> Now btrfs_commit_transaction() does this
>>>>>>>
>>>>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>>>>
>>>>>>> which async flushes all inodes on the ordered operations list, it introduced
>>>>>>> a deadlock that transaction-start task, transaction-commit task and the flush
>>>>>>> workers waited for each other.
>>>>>>> (See the following URL to get the detail
>>>>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>>>>>
>>>>>>> As we know, if ->in_commit is set, it means someone is committing the
>>>>>>> current transaction, we should not try to join it if we are not JOIN
>>>>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>>>>>> the above problem. In this way, there is another benefit: there is no new
>>>>>>> transaction handle to block the transaction which is on the way of commit,
>>>>>>> once we set ->in_commit.
>>>>>>>
>>>>>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>>>>>> ---
>>>>>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>>>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>>>> index bc2f2d1..71b7e2e 100644
>>>>>>> --- a/fs/btrfs/transaction.c
>>>>>>> +++ b/fs/btrfs/transaction.c
>>>>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>>>>>>         root->commit_root = btrfs_root_node(root);
>>>>>>>  }
>>>>>>>
>>>>>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>>>>>> +                                      int type)
>>>>>>> +{
>>>>>>> +       return !(trans->in_commit &&
>>>>>>> +                type != TRANS_JOIN &&
>>>>>>> +                type != TRANS_JOIN_NOLOCK);
>>>>>>> +}
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * either allocate a new transaction or hop into the existing one
>>>>>>>   */
>>>>>>> @@ -86,6 +94,10 @@ loop:
>>>>>>>                         spin_unlock(&fs_info->trans_lock);
>>>>>>>                         return cur_trans->aborted;
>>>>>>>                 }
>>>>>>> +               if (!can_join_transaction(cur_trans, type)) {
>>>>>>> +                       spin_unlock(&fs_info->trans_lock);
>>>>>>> +                       return -EBUSY;
>>>>>>> +               }
>>>>>>>                 atomic_inc(&cur_trans->use_count);
>>>>>>>                 atomic_inc(&cur_trans->num_writers);
>>>>>>>                 cur_trans->num_joined++;
>>>>>>> @@ -360,8 +372,11 @@ again:
>>>>>>>
>>>>>>>         do {
>>>>>>>                 ret = join_transaction(root, type);
>>>>>>> -               if (ret == -EBUSY)
>>>>>>> +               if (ret == -EBUSY) {
>>>>>>>                         wait_current_trans(root);
>>>>>>> +                       if (unlikely(type == TRANS_ATTACH))
>>>>>>> +                               ret = -ENOENT;
>>>>>>> +               }
>>>>>>
>>>>>> So I understand that instead of incrementing num_writes and joining
>>>>>> the current transaction, you do not join and wait for the current
>>>>>> transaction to unblock.
>>>>>
>>>>> More specifically?TRANS_START?TRANS_USERSPACE and TRANS_ATTACH can not
>>>>> join and just wait for the current transaction to unblock if ->in_commit
>>>>> is set.
>>>>>
>>>>>> Which task in Josef's example
>>>>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>>>>>> task 1, task 2 or task 3 is the one that will not join the
>>>>>> transaction, but instead wait?
>>>>>
>>>>> Task1 will not join the transaction, in this way, async inode flush
>>>>> won't run, and then task3 won't do anything.
>>>>>
>>>>> Before applying the patch:
>>>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>>>> (Task1)                                 (Task2)                         (Task3)         -- the name in Josef's example
>>>>> btrfs_start_transaction()
>>>>>  |->may_wait_transaction()
>>>>>  |  (return 0)
>>>>>  |                                      btrfs_commit_transaction()
>>>>>  |                                       |->set ->in_commit and
>>>>>  |                                       |  blocked to 1
>>>>>  |                                       |->wait writers to be 1
>>>>>  |                                       |  (writers is 1)
>>>>>  |->join_transaction()                   |
>>>>>  |  (writers is 2)                       |
>>>>>  |->btrfs_commit_transaction()           |
>>>>>      |                                   |->set trans_no_join to 1
>>>>>      |                                   |  (close join transaction)
>>>>>      |->btrfs_run_ordered_operations     |
>>>>>         (Those ordered operations        |
>>>>>          are added when releasing        |
>>>>>          file)                           |
>>>>>          |->async inode flush()          |
>>>>>          |->wait_flush_comlete()         |
>>>>>                                          |                              work_loop()
>>>>>                                          |                               |->run_work()
>>>>>                                          |                                   |->btrfs_join_transaction()
>>>>>                                          |                                       |->wait_current_trans()
>>>>>                                          |->wait writers to be 1
>>>>>
>>>>> This three tasks waited for each other.
>>>>>
>>>>> After applying this patch:
>>>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>>>> (Task1)                                 (Task2)                         (Task3)
>>>>> btrfs_start_transaction()
>>>>>  |->may_wait_transaction()
>>>>>  |  (return 0)
>>>>>  |                                      btrfs_commit_transaction()
>>>>>  |                                       |->set ->in_commit and
>>>>>  |                                       |  blocked to 1
>>>>>  |                                       |->wait writers to be 1
>>>>>  |                                       |  (writers is 1)
>>>>>  |->join_transaction() fail              |
>>>>>  |  (return -EBUSY, writers is still 1)  |
>>>>>  |->wait_current_trans()                 |
>>>>>                                          |->set trans_no_join to 1
>>>>>                                          |  (close join transaction)
>>>>>                                          |->wait writers to be 1
>>>>>                                          |->continue committing
>>>>>                                                                         (Task3 does nothing)
>>>>>> Also, I think I don't fully understand Josef's example. What is
>>>>>> preventing from async flushing to complete?
>>>>>> Is task 3 waiting because trans_no_join is set?
>>>>>> Is task 3 the one that actually does the delalloc flush?
>>>>>
>>>>> See above.
>>>>>
>>>>> Thanks
>>>>> Miao
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Alex.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>         } while (ret == -EBUSY);
>>>>>>>
>>>>>>>         if (ret < 0) {
>>>>>>> --
>>>>>>> 1.6.5.2
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas June 12, 2013, 8:11 p.m. UTC | #9
Hi Miao,

On Thu, May 9, 2013 at 10:57 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> Hi, Alex
>
> Could you try the following patchset?
>
>   git://github.com/miaoxie/linux-btrfs.git trans-commit-improve
>
> I think it can avoid the problem you said below.
>
> Note: this patchset is against chris's for-linus branch.

I reviewed the code starting from:
69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
the transaction commit
until
2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()

It looks very good. Let me check if I understand the fix correctly:
# When transaction starts to commit, we want to wait only for external
writers (those that did ATTACH/START/USERSPACE)
# We guarantee at this point that no new external writers will hop on
the committing transaction, by setting ->blocked state, so we only
wait for existing extwriters to detach from transaction
# We do not care at this point for TRANS_JOIN etc, we let them hop on
if they want
# When all external writers have detached, we flush their delalloc and
then we prevent all the others to join (TRANS_JOIN etc)

# Previously, we had the do-while loop, that intended to do the same,
but it used num_writers, which counts both external writers and also
TRANS_JOIN. So the loop was racy because new joins prevented it from
completing.

Is my understanding correct?

I have some questions:
# Why was the do-while loop needed? Can we just delete the do-while
loop as it was before, call flush_all_pending stuffs(),  then set
trans_no_join and wait for all writers to detach? Is there some
correctness problem here?
Or we need to wait for external writers to detach before calling
flush_all_pending_stuffs() one last time?

# Why TRANS_ATTACH is considered external writer?

# Can I apply this fix to 3.8.x kernel (manually, of course)? Or some
additional things are needed that are missing in this kernel?

Thanks,
Alex.






>
> Thanks
> Miao
>
> On Wed, 10 Apr 2013 21:45:43 +0300, Alex Lyakas wrote:
>> Hi Miao,
>> I attempted to fix the issue by not joining a transaction that has
>> trans->in_commit set. I did something similar to what
>> wait_current_trans() does, but I did:
>>
>> smp_rmb();
>> if (cur_trans && cur_trans->in_commit) {
>> ...
>> wait_event(root->fs_info->transaction_wait,  !cur_trans->blocked);
>> ...
>>
>> I also had to change the order of setting in_commit and blocked in
>> btrfs_commit_transaction:
>>       trans->transaction->blocked = 1;
>>       trans->transaction->in_commit = 1;
>>       smp_wmb();
>> to make sure that if in_commit is set, then blocked cannot be 0,
>> because btrfs_commit_transaction haven't set it yet to 1.
>>
>> However, with this fix I observe two issues:
>> # With large trees and heavy commits, join_transaction() is delayed
>> sometimes by 1-3 seconds. This delays the host IO by too much.
>> # With this fix, I think too many transactions happen. Basically with
>> this fix, once transaction->in_commit is set, then I insist to open a
>> new transaction and not to join the current one. It has some bad
>> influence on host response times pattern, but I cannot exactly tell
>> why is that.
>>
>> Did you have other fix in mind?
>>
>> Without the fix, I observe sometimes commits that take like 80
>> seconds, out of which like 50 seconds are spent in the do-while loop
>> of btrfs_commit_transaction.
>>
>> Thanks,
>> Alex.
>>
>>
>>
>> On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas
>> <alex.btrfs@zadarastorage.com> wrote:
>>> Hi Miao,
>>>
>>> On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>>>>> Hi Miao,
>>>>> I am seeing another issue. Your fix prevents from TRANS_START to get
>>>>> in the way of a committing transaction. But it does not prevent from
>>>>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>>>>> following loop:
>>>>>
>>>>> do {
>>>>>     // attempt to do some useful stuff and/or sleep
>>>>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>>>>                (should_grow && cur_trans->num_joined != joined));
>>>>>
>>>>> What I see is basically that new writers join the transaction, while
>>>>> btrfs_commit_transaction() does this loop. I see
>>>>> cur_trans->num_writers decreasing, but then it increases, then
>>>>> decreases etc. This can go for several seconds during heavy IO load.
>>>>> There is nothing to prevent new TRANS_JOIN writers coming and joining
>>>>> a transaction over and over, thus delaying transaction commit. The IO
>>>>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
>>>>>
>>>>> Do you observe such behavior? Do you believe it's problematic?
>>>>
>>>> I know this behavior, there is no problem with it, the latter code
>>>> will prevent from TRANS_JOIN.
>>>>
>>>> 1672         spin_lock(&root->fs_info->trans_lock);
>>>> 1673         root->fs_info->trans_no_join = 1;
>>>> 1674         spin_unlock(&root->fs_info->trans_lock);
>>>> 1675         wait_event(cur_trans->writer_wait,
>>>> 1676                    atomic_read(&cur_trans->num_writers) == 1);
>>>>
>>> Yes, this code prevents anybody from joining, but before
>>> btrfs_commit_transaction() gets to this code, it may spend sometimes
>>> 10 seconds (in my tests) in the do-while loop, while new writers come
>>> and go. Basically, it is not deterministic when the do-while loop will
>>> exit, it depends on the IO pattern.
>>>
>>>> And if we block the TRANS_JOIN at the place you point out, the deadlock
>>>> will happen because we need deal with the ordered operations which will
>>>> use TRANS_JOIN here.
>>>>
>>>> (I am dealing with the problem you said above by adding a new type of
>>>> TRANS_* now)
>>>
>>> Thanks.
>>> Alex.
>>>
>>>
>>>>
>>>> Thanks
>>>> Miao
>>>>
>>>>> Thanks,
>>>>> Alex.
>>>>>
>>>>>
>>>>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>>>>>> Hi Miao,
>>>>>>> can you please explain your solution a bit more.
>>>>>>>
>>>>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>>>>>> Now btrfs_commit_transaction() does this
>>>>>>>>
>>>>>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>>>>>
>>>>>>>> which async flushes all inodes on the ordered operations list, it introduced
>>>>>>>> a deadlock that transaction-start task, transaction-commit task and the flush
>>>>>>>> workers waited for each other.
>>>>>>>> (See the following URL to get the detail
>>>>>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>>>>>>
>>>>>>>> As we know, if ->in_commit is set, it means someone is committing the
>>>>>>>> current transaction, we should not try to join it if we are not JOIN
>>>>>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>>>>>>> the above problem. In this way, there is another benefit: there is no new
>>>>>>>> transaction handle to block the transaction which is on the way of commit,
>>>>>>>> once we set ->in_commit.
>>>>>>>>
>>>>>>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>>>>>>> ---
>>>>>>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>>>>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>>>>> index bc2f2d1..71b7e2e 100644
>>>>>>>> --- a/fs/btrfs/transaction.c
>>>>>>>> +++ b/fs/btrfs/transaction.c
>>>>>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>>>>>>>         root->commit_root = btrfs_root_node(root);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>>>>>>> +                                      int type)
>>>>>>>> +{
>>>>>>>> +       return !(trans->in_commit &&
>>>>>>>> +                type != TRANS_JOIN &&
>>>>>>>> +                type != TRANS_JOIN_NOLOCK);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  /*
>>>>>>>>   * either allocate a new transaction or hop into the existing one
>>>>>>>>   */
>>>>>>>> @@ -86,6 +94,10 @@ loop:
>>>>>>>>                         spin_unlock(&fs_info->trans_lock);
>>>>>>>>                         return cur_trans->aborted;
>>>>>>>>                 }
>>>>>>>> +               if (!can_join_transaction(cur_trans, type)) {
>>>>>>>> +                       spin_unlock(&fs_info->trans_lock);
>>>>>>>> +                       return -EBUSY;
>>>>>>>> +               }
>>>>>>>>                 atomic_inc(&cur_trans->use_count);
>>>>>>>>                 atomic_inc(&cur_trans->num_writers);
>>>>>>>>                 cur_trans->num_joined++;
>>>>>>>> @@ -360,8 +372,11 @@ again:
>>>>>>>>
>>>>>>>>         do {
>>>>>>>>                 ret = join_transaction(root, type);
>>>>>>>> -               if (ret == -EBUSY)
>>>>>>>> +               if (ret == -EBUSY) {
>>>>>>>>                         wait_current_trans(root);
>>>>>>>> +                       if (unlikely(type == TRANS_ATTACH))
>>>>>>>> +                               ret = -ENOENT;
>>>>>>>> +               }
>>>>>>>
>>>>>>> So I understand that instead of incrementing num_writes and joining
>>>>>>> the current transaction, you do not join and wait for the current
>>>>>>> transaction to unblock.
>>>>>>
>>>>>> More specifically?TRANS_START?TRANS_USERSPACE and TRANS_ATTACH can not
>>>>>> join and just wait for the current transaction to unblock if ->in_commit
>>>>>> is set.
>>>>>>
>>>>>>> Which task in Josef's example
>>>>>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>>>>>>> task 1, task 2 or task 3 is the one that will not join the
>>>>>>> transaction, but instead wait?
>>>>>>
>>>>>> Task1 will not join the transaction, in this way, async inode flush
>>>>>> won't run, and then task3 won't do anything.
>>>>>>
>>>>>> Before applying the patch:
>>>>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>>>>> (Task1)                                 (Task2)                         (Task3)         -- the name in Josef's example
>>>>>> btrfs_start_transaction()
>>>>>>  |->may_wait_transaction()
>>>>>>  |  (return 0)
>>>>>>  |                                      btrfs_commit_transaction()
>>>>>>  |                                       |->set ->in_commit and
>>>>>>  |                                       |  blocked to 1
>>>>>>  |                                       |->wait writers to be 1
>>>>>>  |                                       |  (writers is 1)
>>>>>>  |->join_transaction()                   |
>>>>>>  |  (writers is 2)                       |
>>>>>>  |->btrfs_commit_transaction()           |
>>>>>>      |                                   |->set trans_no_join to 1
>>>>>>      |                                   |  (close join transaction)
>>>>>>      |->btrfs_run_ordered_operations     |
>>>>>>         (Those ordered operations        |
>>>>>>          are added when releasing        |
>>>>>>          file)                           |
>>>>>>          |->async inode flush()          |
>>>>>>          |->wait_flush_comlete()         |
>>>>>>                                          |                              work_loop()
>>>>>>                                          |                               |->run_work()
>>>>>>                                          |                                   |->btrfs_join_transaction()
>>>>>>                                          |                                       |->wait_current_trans()
>>>>>>                                          |->wait writers to be 1
>>>>>>
>>>>>> This three tasks waited for each other.
>>>>>>
>>>>>> After applying this patch:
>>>>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>>>>> (Task1)                                 (Task2)                         (Task3)
>>>>>> btrfs_start_transaction()
>>>>>>  |->may_wait_transaction()
>>>>>>  |  (return 0)
>>>>>>  |                                      btrfs_commit_transaction()
>>>>>>  |                                       |->set ->in_commit and
>>>>>>  |                                       |  blocked to 1
>>>>>>  |                                       |->wait writers to be 1
>>>>>>  |                                       |  (writers is 1)
>>>>>>  |->join_transaction() fail              |
>>>>>>  |  (return -EBUSY, writers is still 1)  |
>>>>>>  |->wait_current_trans()                 |
>>>>>>                                          |->set trans_no_join to 1
>>>>>>                                          |  (close join transaction)
>>>>>>                                          |->wait writers to be 1
>>>>>>                                          |->continue committing
>>>>>>                                                                         (Task3 does nothing)
>>>>>>> Also, I think I don't fully understand Josef's example. What is
>>>>>>> preventing from async flushing to complete?
>>>>>>> Is task 3 waiting because trans_no_join is set?
>>>>>>> Is task 3 the one that actually does the delalloc flush?
>>>>>>
>>>>>> See above.
>>>>>>
>>>>>> Thanks
>>>>>> Miao
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alex.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>         } while (ret == -EBUSY);
>>>>>>>>
>>>>>>>>         if (ret < 0) {
>>>>>>>> --
>>>>>>>> 1.6.5.2
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie June 13, 2013, 3:08 a.m. UTC | #10
On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
> I reviewed the code starting from:
> 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
> the transaction commit
> until
> 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()
> 
> It looks very good. Let me check if I understand the fix correctly:
> # When transaction starts to commit, we want to wait only for external
> writers (those that did ATTACH/START/USERSPACE)
> # We guarantee at this point that no new external writers will hop on
> the committing transaction, by setting ->blocked state, so we only
> wait for existing extwriters to detach from transaction
> # We do not care at this point for TRANS_JOIN etc, we let them hop on
> if they want
> # When all external writers have detached, we flush their delalloc and
> then we prevent all the others to join (TRANS_JOIN etc)
> 
> # Previously, we had the do-while loop, that intended to do the same,
> but it used num_writers, which counts both external writers and also
> TRANS_JOIN. So the loop was racy because new joins prevented it from
> completing.
> 
> Is my understanding correct?

Yes, you are right.

> I have some questions:
> # Why was the do-while loop needed? Can we just delete the do-while
> loop as it was before, call flush_all_pending stuffs(),  then set
> trans_no_join and wait for all writers to detach? Is there some
> correctness problem here?
> Or we need to wait for external writers to detach before calling
> flush_all_pending_stuffs() one last time?

The external writers will introduce pending works, we need flush them
after they detach, otherwise we will forget to deal with them at the current
transaction just like the following case:

	Task1				Task2
	start_transaction
					commit_transaction
					  flush_all_pending_stuffs
        add pending works
        end_transaction
					  ...


> # Why TRANS_ATTACH is considered external writer?

- at most cases, it is done by the users' operations.
- if in_commit is set, we shouldn't start it, or the deadlock will happen.
  it is the same as TRANS_START/TRANS_USERSPACE.

> # Can I apply this fix to 3.8.x kernel (manually, of course)? Or some
> additional things are needed that are missing in this kernel?

Yes, you can rebase it against 3.8.x kernel freely.

Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas June 16, 2013, 10:38 a.m. UTC | #11
Hi Miao,

On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
>> I reviewed the code starting from:
>> 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
>> the transaction commit
>> until
>> 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()
>>
>> It looks very good. Let me check if I understand the fix correctly:
>> # When transaction starts to commit, we want to wait only for external
>> writers (those that did ATTACH/START/USERSPACE)
>> # We guarantee at this point that no new external writers will hop on
>> the committing transaction, by setting ->blocked state, so we only
>> wait for existing extwriters to detach from transaction

I have a doubt about this point with your new code. Example:
Task1 - external writer
Task2 - transaction kthread

Task1                                                                   Task2
|start_transaction(TRANS_START)                           |
|-wait_current_trans(blocked=0, so it doesn't wait)     |
|-join_transaction()                                                  |
|--lock(trans_lock)                                                   |
|--can_join_transaction() YES                                  |
|
      |-btrfs_commit_transaction()
|
      |--blocked=1
|
      |--in_commit=1
|
      |--wait_event(extwriter== 0);
|
      |--btrfs_flush_all_pending_stuffs()
|                                                                            |
|--extwriter_counter_inc()                                         |
|--unlock(trans_lock)                                               |
|
      | lock(trans_lock)
|
      | trans_no_join=1

Basically, the "blocked/in_commit" check is not synchronized with
joining a transaction. After checking "blocked", the external writer
may proceed and join the transaction. Right before joining, it calls
can_join_transaction(). But this function checks in_commit flag under
fs_info->trans_lock. But btrfs_commit_transaction() sets this flag not
under trans_lock, but under commit_lock, so checking this flag is not
synchronized.

Or maybe I am wrong, because btrfs_commit_transaction() locks and
unlocks trans_lock to check for previous transaction, so by accident
there is no problem, and above scenario cannot happen?


>> # We do not care at this point for TRANS_JOIN etc, we let them hop on
>> if they want
>> # When all external writers have detached, we flush their delalloc and
>> then we prevent all the others to join (TRANS_JOIN etc)
>>
>> # Previously, we had the do-while loop, that intended to do the same,
>> but it used num_writers, which counts both external writers and also
>> TRANS_JOIN. So the loop was racy because new joins prevented it from
>> completing.
>>
>> Is my understanding correct?
>
> Yes, you are right.
>
>> I have some questions:
>> # Why was the do-while loop needed? Can we just delete the do-while
>> loop as it was before, call flush_all_pending stuffs(),  then set
>> trans_no_join and wait for all writers to detach? Is there some
>> correctness problem here?
>> Or we need to wait for external writers to detach before calling
>> flush_all_pending_stuffs() one last time?
>
> The external writers will introduce pending works, we need flush them
> after they detach, otherwise we will forget to deal with them at the current
> transaction just like the following case:
>
>         Task1                           Task2
>         start_transaction
>                                         commit_transaction
>                                           flush_all_pending_stuffs
>         add pending works
>         end_transaction
>                                           ...
>
>
>> # Why TRANS_ATTACH is considered external writer?
>
> - at most cases, it is done by the users' operations.
> - if in_commit is set, we shouldn't start it, or the deadlock will happen.
>   it is the same as TRANS_START/TRANS_USERSPACE.
>
>> # Can I apply this fix to 3.8.x kernel (manually, of course)? Or some
>> additional things are needed that are missing in this kernel?
>
> Yes, you can rebase it against 3.8.x kernel freely.
>
> Thanks
> Miao

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie June 17, 2013, 1:51 a.m. UTC | #12
On 	sun, 16 Jun 2013 13:38:42 +0300, Alex Lyakas wrote:
> Hi Miao,
> 
> On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
>>> I reviewed the code starting from:
>>> 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
>>> the transaction commit
>>> until
>>> 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()
>>>
>>> It looks very good. Let me check if I understand the fix correctly:
>>> # When transaction starts to commit, we want to wait only for external
>>> writers (those that did ATTACH/START/USERSPACE)
>>> # We guarantee at this point that no new external writers will hop on
>>> the committing transaction, by setting ->blocked state, so we only
>>> wait for existing extwriters to detach from transaction
> 
> I have a doubt about this point with your new code. Example:
> Task1 - external writer
> Task2 - transaction kthread
> 
> Task1                                                                   Task2
> |start_transaction(TRANS_START)                           |
> |-wait_current_trans(blocked=0, so it doesn't wait)     |
> |-join_transaction()                                                  |
> |--lock(trans_lock)                                                   |
> |--can_join_transaction() YES                                  |
> |
>       |-btrfs_commit_transaction()
> |
>       |--blocked=1
> |
>       |--in_commit=1
> |
>       |--wait_event(extwriter== 0);
> |
>       |--btrfs_flush_all_pending_stuffs()
> |                                                                            |
> |--extwriter_counter_inc()                                         |
> |--unlock(trans_lock)                                               |
> |
>       | lock(trans_lock)
> |
>       | trans_no_join=1
> 
> Basically, the "blocked/in_commit" check is not synchronized with
> joining a transaction. After checking "blocked", the external writer
> may proceed and join the transaction. Right before joining, it calls
> can_join_transaction(). But this function checks in_commit flag under
> fs_info->trans_lock. But btrfs_commit_transaction() sets this flag not
> under trans_lock, but under commit_lock, so checking this flag is not
> synchronized.
> 
> Or maybe I am wrong, because btrfs_commit_transaction() locks and
> unlocks trans_lock to check for previous transaction, so by accident
> there is no problem, and above scenario cannot happen?

Your analysis at the last section is right, so the right process is:

Task1							Task2
|start_transaction(TRANS_START)				|
|-wait_current_trans(blocked=0, so it doesn't wait)	|
|-join_transaction()					|
|--lock(trans_lock)					|
|--can_join_transaction() YES				|
|							|-btrfs_commit_transaction()
|							|--blocked=1
|							|--in_commit=1
|--extwriter_counter_inc()				|
|--unlock(trans_lock)					|
|							|--lock(trans_lock)
|							|--...
|							|--unlock(trans_lock)
|							|--...
|							|--wait_event(extwriter== 0);
|							|--btrfs_flush_all_pending_stuffs()

The problem you worried can not happen.

Anyway, it is not good that the "blocked/in_commit" check is not synchronized with
joining a transaction. So I modified the relative code in this patchset.

Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas June 26, 2013, 5:53 p.m. UTC | #13
Hi Miao,

On Mon, Jun 17, 2013 at 4:51 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On      sun, 16 Jun 2013 13:38:42 +0300, Alex Lyakas wrote:
>> Hi Miao,
>>
>> On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>> On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
>>>> I reviewed the code starting from:
>>>> 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
>>>> the transaction commit
>>>> until
>>>> 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()
>>>>
>>>> It looks very good. Let me check if I understand the fix correctly:
>>>> # When transaction starts to commit, we want to wait only for external
>>>> writers (those that did ATTACH/START/USERSPACE)
>>>> # We guarantee at this point that no new external writers will hop on
>>>> the committing transaction, by setting ->blocked state, so we only
>>>> wait for existing extwriters to detach from transaction
>>
>> I have a doubt about this point with your new code. Example:
>> Task1 - external writer
>> Task2 - transaction kthread
>>
>> Task1                                                                   Task2
>> |start_transaction(TRANS_START)                           |
>> |-wait_current_trans(blocked=0, so it doesn't wait)     |
>> |-join_transaction()                                                  |
>> |--lock(trans_lock)                                                   |
>> |--can_join_transaction() YES                                  |
>> |
>>       |-btrfs_commit_transaction()
>> |
>>       |--blocked=1
>> |
>>       |--in_commit=1
>> |
>>       |--wait_event(extwriter== 0);
>> |
>>       |--btrfs_flush_all_pending_stuffs()
>> |                                                                            |
>> |--extwriter_counter_inc()                                         |
>> |--unlock(trans_lock)                                               |
>> |
>>       | lock(trans_lock)
>> |
>>       | trans_no_join=1
>>
>> Basically, the "blocked/in_commit" check is not synchronized with
>> joining a transaction. After checking "blocked", the external writer
>> may proceed and join the transaction. Right before joining, it calls
>> can_join_transaction(). But this function checks in_commit flag under
>> fs_info->trans_lock. But btrfs_commit_transaction() sets this flag not
>> under trans_lock, but under commit_lock, so checking this flag is not
>> synchronized.
>>
>> Or maybe I am wrong, because btrfs_commit_transaction() locks and
>> unlocks trans_lock to check for previous transaction, so by accident
>> there is no problem, and above scenario cannot happen?
>
> Your analysis at the last section is right, so the right process is:
>
> Task1                                                   Task2
> |start_transaction(TRANS_START)                         |
> |-wait_current_trans(blocked=0, so it doesn't wait)     |
> |-join_transaction()                                    |
> |--lock(trans_lock)                                     |
> |--can_join_transaction() YES                           |
> |                                                       |-btrfs_commit_transaction()
> |                                                       |--blocked=1
> |                                                       |--in_commit=1
> |--extwriter_counter_inc()                              |
> |--unlock(trans_lock)                                   |
> |                                                       |--lock(trans_lock)
> |                                                       |--...
> |                                                       |--unlock(trans_lock)
> |                                                       |--...
> |                                                       |--wait_event(extwriter== 0);
> |                                                       |--btrfs_flush_all_pending_stuffs()
>
> The problem you worried can not happen.
>
> Anyway, it is not good that the "blocked/in_commit" check is not synchronized with
> joining a transaction. So I modified the relative code in this patchset.
>

The four patches that we applied related to extwriters issue work very
good. They definitely solve the non-deterministic behavior while
waiting for the writers to detach. Thanks for addressing this issue.
One note is that the new behavior is perhaps less "friendly" to the
transaction join flow. With your fix, the committer unconditionally
sets "trans_no_join" and waits for old writers to detach. At this
point, new joins will block. While previously, the committer was
"finding a convenient spot" in the join pattern, when all writers have
detached (although it was non-deterministic when this will happen). So
perhaps some compromise can be done - like wait for 10sec until all
writers detach, and if not, just go ahead and set trans_no_join.

Thanks for your help!
Alex.


> Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie July 4, 2013, 2:28 a.m. UTC | #14
On Wed, 26 Jun 2013 20:53:00 +0300, Alex Lyakas wrote:
> Hi Miao,
> 
> On Mon, Jun 17, 2013 at 4:51 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> On      sun, 16 Jun 2013 13:38:42 +0300, Alex Lyakas wrote:
>>> Hi Miao,
>>>
>>> On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>> On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
>>>>> I reviewed the code starting from:
>>>>> 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
>>>>> the transaction commit
>>>>> until
>>>>> 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()
>>>>>
>>>>> It looks very good. Let me check if I understand the fix correctly:
>>>>> # When transaction starts to commit, we want to wait only for external
>>>>> writers (those that did ATTACH/START/USERSPACE)
>>>>> # We guarantee at this point that no new external writers will hop on
>>>>> the committing transaction, by setting ->blocked state, so we only
>>>>> wait for existing extwriters to detach from transaction
>>>
>>> I have a doubt about this point with your new code. Example:
>>> Task1 - external writer
>>> Task2 - transaction kthread
>>>
>>> Task1                                                                   Task2
>>> |start_transaction(TRANS_START)                           |
>>> |-wait_current_trans(blocked=0, so it doesn't wait)     |
>>> |-join_transaction()                                                  |
>>> |--lock(trans_lock)                                                   |
>>> |--can_join_transaction() YES                                  |
>>> |
>>>       |-btrfs_commit_transaction()
>>> |
>>>       |--blocked=1
>>> |
>>>       |--in_commit=1
>>> |
>>>       |--wait_event(extwriter== 0);
>>> |
>>>       |--btrfs_flush_all_pending_stuffs()
>>> |                                                                            |
>>> |--extwriter_counter_inc()                                         |
>>> |--unlock(trans_lock)                                               |
>>> |
>>>       | lock(trans_lock)
>>> |
>>>       | trans_no_join=1
>>>
>>> Basically, the "blocked/in_commit" check is not synchronized with
>>> joining a transaction. After checking "blocked", the external writer
>>> may proceed and join the transaction. Right before joining, it calls
>>> can_join_transaction(). But this function checks in_commit flag under
>>> fs_info->trans_lock. But btrfs_commit_transaction() sets this flag not
>>> under trans_lock, but under commit_lock, so checking this flag is not
>>> synchronized.
>>>
>>> Or maybe I am wrong, because btrfs_commit_transaction() locks and
>>> unlocks trans_lock to check for previous transaction, so by accident
>>> there is no problem, and above scenario cannot happen?
>>
>> Your analysis at the last section is right, so the right process is:
>>
>> Task1                                                   Task2
>> |start_transaction(TRANS_START)                         |
>> |-wait_current_trans(blocked=0, so it doesn't wait)     |
>> |-join_transaction()                                    |
>> |--lock(trans_lock)                                     |
>> |--can_join_transaction() YES                           |
>> |                                                       |-btrfs_commit_transaction()
>> |                                                       |--blocked=1
>> |                                                       |--in_commit=1
>> |--extwriter_counter_inc()                              |
>> |--unlock(trans_lock)                                   |
>> |                                                       |--lock(trans_lock)
>> |                                                       |--...
>> |                                                       |--unlock(trans_lock)
>> |                                                       |--...
>> |                                                       |--wait_event(extwriter== 0);
>> |                                                       |--btrfs_flush_all_pending_stuffs()
>>
>> The problem you worried can not happen.
>>
>> Anyway, it is not good that the "blocked/in_commit" check is not synchronized with
>> joining a transaction. So I modified the relative code in this patchset.
>>
> 
> The four patches that we applied related to extwriters issue work very
> good. They definitely solve the non-deterministic behavior while
> waiting for the writers to detach. Thanks for addressing this issue.
> One note is that the new behavior is perhaps less "friendly" to the
> transaction join flow. With your fix, the committer unconditionally
> sets "trans_no_join" and waits for old writers to detach. At this
> point, new joins will block. While previously, the committer was
> "finding a convenient spot" in the join pattern, when all writers have
> detached (although it was non-deterministic when this will happen). So
> perhaps some compromise can be done - like wait for 10sec until all
> writers detach, and if not, just go ahead and set trans_no_join.

I don't like the compromise because it will make the user task wait for a long time.
I think the transaction commit process should be done as quickly as possible.

Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index bc2f2d1..71b7e2e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -51,6 +51,14 @@  static noinline void switch_commit_root(struct btrfs_root *root)
 	root->commit_root = btrfs_root_node(root);
 }
 
+static inline int can_join_transaction(struct btrfs_transaction *trans,
+				       int type)
+{
+	return !(trans->in_commit &&
+		 type != TRANS_JOIN &&
+		 type != TRANS_JOIN_NOLOCK);
+}
+
 /*
  * either allocate a new transaction or hop into the existing one
  */
@@ -86,6 +94,10 @@  loop:
 			spin_unlock(&fs_info->trans_lock);
 			return cur_trans->aborted;
 		}
+		if (!can_join_transaction(cur_trans, type)) {
+			spin_unlock(&fs_info->trans_lock);
+			return -EBUSY;
+		}
 		atomic_inc(&cur_trans->use_count);
 		atomic_inc(&cur_trans->num_writers);
 		cur_trans->num_joined++;
@@ -360,8 +372,11 @@  again:
 
 	do {
 		ret = join_transaction(root, type);
-		if (ret == -EBUSY)
+		if (ret == -EBUSY) {
 			wait_current_trans(root);
+			if (unlikely(type == TRANS_ATTACH))
+				ret = -ENOENT;
+		}
 	} while (ret == -EBUSY);
 
 	if (ret < 0) {