diff mbox series

[v2,1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag

Message ID 20200617175310.20912-2-longman@redhat.com (mailing list archive)
State Deferred, archived
Headers show
Series sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs | expand

Commit Message

Waiman Long June 17, 2020, 5:53 p.m. UTC
There are cases where calling kmalloc() can lead to false positive
lockdep splat. One notable example that can happen in the freezing of
the xfs filesystem is as follows:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_internal);
                               lock(fs_reclaim);
                               lock(sb_internal);
  lock(fs_reclaim);

 *** DEADLOCK ***

This is a false positive as all the dirty pages are flushed out before
the filesystem can be frozen. However, there is no easy way to modify
lockdep to handle this situation properly.

One possible workaround is to disable lockdep by setting __GFP_NOLOCKDEP
in the appropriate kmalloc() calls.  However, it will be cumbersome to
locate all the right kmalloc() calls to insert __GFP_NOLOCKDEP and it
is easy to miss some especially when the code is updated in the future.

Another alternative is to have a per-process global state that indicates
the equivalent of __GFP_NOLOCKDEP without the need to set the gfp_t flag
individually. To allow the latter case, a new PF_MEMALLOC_NOLOCKDEP
per-process flag is now added. After adding this new bit, there are
still 2 free bits left.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched.h    |  7 +++++++
 include/linux/sched/mm.h | 15 ++++++++++-----
 2 files changed, 17 insertions(+), 5 deletions(-)

Comments

Dave Chinner June 18, 2020, 12:01 a.m. UTC | #1
On Wed, Jun 17, 2020 at 01:53:09PM -0400, Waiman Long wrote:
> There are cases where calling kmalloc() can lead to false positive
> lockdep splat. One notable example that can happen in the freezing of
> the xfs filesystem is as follows:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(sb_internal);
>                                lock(fs_reclaim);
>                                lock(sb_internal);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
> This is a false positive as all the dirty pages are flushed out before
> the filesystem can be frozen. However, there is no easy way to modify
> lockdep to handle this situation properly.
> 
> One possible workaround is to disable lockdep by setting __GFP_NOLOCKDEP
> in the appropriate kmalloc() calls.  However, it will be cumbersome to
> locate all the right kmalloc() calls to insert __GFP_NOLOCKDEP and it
> is easy to miss some especially when the code is updated in the future.
> 
> Another alternative is to have a per-process global state that indicates
> the equivalent of __GFP_NOLOCKDEP without the need to set the gfp_t flag
> individually. To allow the latter case, a new PF_MEMALLOC_NOLOCKDEP
> per-process flag is now added. After adding this new bit, there are
> still 2 free bits left.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/sched.h    |  7 +++++++
>  include/linux/sched/mm.h | 15 ++++++++++-----
>  2 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b62e6aaf28f0..44247cbc9073 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1508,6 +1508,7 @@ extern struct pid *cad_pid;
>  #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
>  #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
>  						 * I am cleaning dirty pages from some other bdi. */
> +#define __PF_MEMALLOC_NOLOCKDEP	0x00100000	/* All allocation requests will inherit __GFP_NOLOCKDEP */

Why is this considered a safe thing to do? Any context that sets
__PF_MEMALLOC_NOLOCKDEP will now behave differently in memory
reclaim as it will think that PF_LOCAL_THROTTLE is set when lockdep
is enabled.

>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
>  #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
> @@ -1519,6 +1520,12 @@ extern struct pid *cad_pid;
>  #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
>  
> +#ifdef CONFIG_LOCKDEP
> +#define PF_MEMALLOC_NOLOCKDEP	__PF_MEMALLOC_NOLOCKDEP
> +#else
> +#define PF_MEMALLOC_NOLOCKDEP	0
> +#endif
> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 480a4d1b7dd8..4a076a148568 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -177,22 +177,27 @@ static inline bool in_vfork(struct task_struct *tsk)
>   * Applies per-task gfp context to the given allocation flags.
>   * PF_MEMALLOC_NOIO implies GFP_NOIO
>   * PF_MEMALLOC_NOFS implies GFP_NOFS
> + * PF_MEMALLOC_NOLOCKDEP implies __GFP_NOLOCKDEP
>   * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
>   */
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
> -	if (unlikely(current->flags &
> -		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
> +	unsigned int pflags = current->flags;
> +
> +	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS |
> +			       PF_MEMALLOC_NOCMA | PF_MEMALLOC_NOLOCKDEP))) {

That needs a PF_MEMALLOC_MASK.

And, really, if we are playing "re-use existing bits" games because
we've run out of process flags, all these memalloc flags should be
moved to a new field in the task, say current->memalloc_flags. You
could also move PF_SWAPWRITE, PF_LOCAL_THROTTLE, and PF_KSWAPD into
that field as well as they are all memory allocation context process
flags...

Cheers,

Dave.
Waiman Long June 18, 2020, 1:32 a.m. UTC | #2
On 6/17/20 8:01 PM, Dave Chinner wrote:
> On Wed, Jun 17, 2020 at 01:53:09PM -0400, Waiman Long wrote:
>> There are cases where calling kmalloc() can lead to false positive
>> lockdep splat. One notable example that can happen in the freezing of
>> the xfs filesystem is as follows:
>>
>>   Possible unsafe locking scenario:
>>
>>         CPU0                    CPU1
>>         ----                    ----
>>    lock(sb_internal);
>>                                 lock(fs_reclaim);
>>                                 lock(sb_internal);
>>    lock(fs_reclaim);
>>
>>   *** DEADLOCK ***
>>
>> This is a false positive as all the dirty pages are flushed out before
>> the filesystem can be frozen. However, there is no easy way to modify
>> lockdep to handle this situation properly.
>>
>> One possible workaround is to disable lockdep by setting __GFP_NOLOCKDEP
>> in the appropriate kmalloc() calls.  However, it will be cumbersome to
>> locate all the right kmalloc() calls to insert __GFP_NOLOCKDEP and it
>> is easy to miss some especially when the code is updated in the future.
>>
>> Another alternative is to have a per-process global state that indicates
>> the equivalent of __GFP_NOLOCKDEP without the need to set the gfp_t flag
>> individually. To allow the latter case, a new PF_MEMALLOC_NOLOCKDEP
>> per-process flag is now added. After adding this new bit, there are
>> still 2 free bits left.
>>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   include/linux/sched.h    |  7 +++++++
>>   include/linux/sched/mm.h | 15 ++++++++++-----
>>   2 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index b62e6aaf28f0..44247cbc9073 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1508,6 +1508,7 @@ extern struct pid *cad_pid;
>>   #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
>>   #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
>>   						 * I am cleaning dirty pages from some other bdi. */
>> +#define __PF_MEMALLOC_NOLOCKDEP	0x00100000	/* All allocation requests will inherit __GFP_NOLOCKDEP */
> Why is this considered a safe thing to do? Any context that sets
> __PF_MEMALLOC_NOLOCKDEP will now behave differently in memory
> reclaim as it will think that PF_LOCAL_THROTTLE is set when lockdep
> is enabled.

Oh, my mistake, it should be 0x01000000 which is not currently being 
used. Thank for catching that. I will repost a new version. I have no 
intention to reuse any existing bit. As said in the commit log, there 
are actually 2 more free bits left.


>
>>   #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>>   #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
>>   #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
>> @@ -1519,6 +1520,12 @@ extern struct pid *cad_pid;
>>   #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
>>   #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
>>   
>> +#ifdef CONFIG_LOCKDEP
>> +#define PF_MEMALLOC_NOLOCKDEP	__PF_MEMALLOC_NOLOCKDEP
>> +#else
>> +#define PF_MEMALLOC_NOLOCKDEP	0
>> +#endif
>> +
>>   /*
>>    * Only the _current_ task can read/write to tsk->flags, but other
>>    * tasks can access tsk->flags in readonly mode for example
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 480a4d1b7dd8..4a076a148568 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -177,22 +177,27 @@ static inline bool in_vfork(struct task_struct *tsk)
>>    * Applies per-task gfp context to the given allocation flags.
>>    * PF_MEMALLOC_NOIO implies GFP_NOIO
>>    * PF_MEMALLOC_NOFS implies GFP_NOFS
>> + * PF_MEMALLOC_NOLOCKDEP implies __GFP_NOLOCKDEP
>>    * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
>>    */
>>   static inline gfp_t current_gfp_context(gfp_t flags)
>>   {
>> -	if (unlikely(current->flags &
>> -		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
>> +	unsigned int pflags = current->flags;
>> +
>> +	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS |
>> +			       PF_MEMALLOC_NOCMA | PF_MEMALLOC_NOLOCKDEP))) {
> That needs a PF_MEMALLOC_MASK.

Will add that in the next version.

Thanks,
Longman
Peter Zijlstra June 22, 2020, 7:16 p.m. UTC | #3
On Thu, Jun 18, 2020 at 10:01:10AM +1000, Dave Chinner wrote:

> And, really, if we are playing "re-use existing bits" games because
> we've run out of process flags, all these memalloc flags should be
> moved to a new field in the task, say current->memalloc_flags. You
> could also move PF_SWAPWRITE, PF_LOCAL_THROTTLE, and PF_KSWAPD into
> that field as well as they are all memory allocation context process
> flags...

FWIW

There's still 23 bits free after task_struct::in_memstall. That word has
'current only' semantics, just like PF.
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b62e6aaf28f0..44247cbc9073 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1508,6 +1508,7 @@  extern struct pid *cad_pid;
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
 						 * I am cleaning dirty pages from some other bdi. */
+#define __PF_MEMALLOC_NOLOCKDEP	0x00100000	/* All allocation requests will inherit __GFP_NOLOCKDEP */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
@@ -1519,6 +1520,12 @@  extern struct pid *cad_pid;
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
+#ifdef CONFIG_LOCKDEP
+#define PF_MEMALLOC_NOLOCKDEP	__PF_MEMALLOC_NOLOCKDEP
+#else
+#define PF_MEMALLOC_NOLOCKDEP	0
+#endif
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 480a4d1b7dd8..4a076a148568 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -177,22 +177,27 @@  static inline bool in_vfork(struct task_struct *tsk)
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_NOLOCKDEP implies __GFP_NOLOCKDEP
  * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
-	if (unlikely(current->flags &
-		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
+	unsigned int pflags = current->flags;
+
+	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS |
+			       PF_MEMALLOC_NOCMA | PF_MEMALLOC_NOLOCKDEP))) {
 		/*
 		 * NOIO implies both NOIO and NOFS and it is a weaker context
 		 * so always make sure it makes precedence
 		 */
-		if (current->flags & PF_MEMALLOC_NOIO)
+		if (pflags & PF_MEMALLOC_NOIO)
 			flags &= ~(__GFP_IO | __GFP_FS);
-		else if (current->flags & PF_MEMALLOC_NOFS)
+		else if (pflags & PF_MEMALLOC_NOFS)
 			flags &= ~__GFP_FS;
+		if (pflags & PF_MEMALLOC_NOLOCKDEP)
+			flags |= __GFP_NOLOCKDEP;
 #ifdef CONFIG_CMA
-		if (current->flags & PF_MEMALLOC_NOCMA)
+		if (pflags & PF_MEMALLOC_NOCMA)
 			flags &= ~__GFP_MOVABLE;
 #endif
 	}