diff mbox

mm allocation failure and hang when running xfstests generic/269 on xfs

Message ID 20170302103520.GC1404@dhcp22.suse.cz (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Michal Hocko March 2, 2017, 10:35 a.m. UTC
On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
[...]
> So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> killed") implemented __GFP_KILLABLE flag and automatically applied that
> flag. As a result, those who are not ready to fail upon SIGKILL are
> confused. ;-)

You are right! The function is documented it might fail but the code
doesn't really allow that. This seems like a bug to me. What do you
think about the following?
---
From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 2 Mar 2017 11:31:11 +0100
Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail

Even though kmem_zalloc_greedy is documented it might fail the current
code doesn't really implement this properly and loops on the smallest
allowed size for ever. This is a problem because vzalloc might fail
permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
task is killed") such a failure is much more probable than it used to
be. Fix this by bailing out if the minimum size request failed.

This has been noticed by a hung generic/269 xfstest by Xiong Zhou.

Reported-by: Xiong Zhou <xzhou@redhat.com>
Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Michal Hocko March 2, 2017, 12:49 p.m. UTC | #1
On Thu 02-03-17 07:24:27, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > [...]
> > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > confused. ;-)
> > 
> > You are right! The function is documented it might fail but the code
> > doesn't really allow that. This seems like a bug to me. What do you
> > think about the following?
> > ---
> > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > 
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > task is killed") such a failure is much more probable than it used to
> > be. Fix this by bailing out if the minimum size request failed.
> > 
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > 
> > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  	size_t		kmsize = maxsize;
> >  
> >  	while (!(ptr = vzalloc(kmsize))) {
> > +		if (kmsize == minsize)
> > +			break;
> >  		if ((kmsize >>= 1) <= minsize)
> >  			kmsize = minsize;
> >  	}
> 
> More consistent with the rest of the kmem code might be to accept a
> flags argument and do something like this based on KM_MAYFAIL.

Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
the same reason it doesn't support GFP_NOFS. So I am not sure this is a
good idea.

> The one
> current caller looks like it would pass it, but I suppose we'd still
> need a mechanism to break out should a new caller not pass that flag.
> Would a fatal_signal_pending() check in the loop as well allow us to
> break out in the scenario that is reproduced here?

Yes that check would work as well I just thought the break out when the
minsize request fails to be more logical. There might be other reasons
to fail the request and looping here seems just wrong. But whatever you
or other xfs people prefer.
Brian Foster March 2, 2017, 1 p.m. UTC | #2
On Thu, Mar 02, 2017 at 01:49:09PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 07:24:27, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > > [...]
> > > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > > confused. ;-)
> > > 
> > > You are right! The function is documented it might fail but the code
> > > doesn't really allow that. This seems like a bug to me. What do you
> > > think about the following?
> > > ---
> > > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.com>
> > > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > > 
> > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > code doesn't really implement this properly and loops on the smallest
> > > allowed size for ever. This is a problem because vzalloc might fail
> > > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > > task is killed") such a failure is much more probable than it used to
> > > be. Fix this by bailing out if the minimum size request failed.
> > > 
> > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > > 
> > > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  fs/xfs/kmem.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index 339c696bbc01..ee95f5c6db45 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > >  	size_t		kmsize = maxsize;
> > >  
> > >  	while (!(ptr = vzalloc(kmsize))) {
> > > +		if (kmsize == minsize)
> > > +			break;
> > >  		if ((kmsize >>= 1) <= minsize)
> > >  			kmsize = minsize;
> > >  	}
> > 
> > More consistent with the rest of the kmem code might be to accept a
> > flags argument and do something like this based on KM_MAYFAIL.
> 
> Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
> the same reason it doesn't support GFP_NOFS. So I am not sure this is a
> good idea.
> 

Not sure I follow..? I'm just suggesting to control the loop behavior
based on the KM_ flag, not to do or change anything wrt to GFP_ flags.

> > The one
> > current caller looks like it would pass it, but I suppose we'd still
> > need a mechanism to break out should a new caller not pass that flag.
> > Would a fatal_signal_pending() check in the loop as well allow us to
> > break out in the scenario that is reproduced here?
> 
> Yes that check would work as well I just thought the break out when the
> minsize request fails to be more logical. There might be other reasons
> to fail the request and looping here seems just wrong. But whatever you
> or other xfs people prefer.

There may be higher level reasons for why this code should or should not
loop, that just seems like a separate issue to me. My thinking is more
that this appears to be how every kmem_*() function operates today and
it seems a bit out of place to change behavior of one to fix a bug.

Maybe I'm missing something though.. are we subject to the same general
problem in any of the other kmem_*() functions that can currently loop
indefinitely?

Brian

> -- 
> Michal Hocko
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tetsuo Handa March 2, 2017, 1:07 p.m. UTC | #3
Brian Foster wrote:
> On Thu, Mar 02, 2017 at 01:49:09PM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 07:24:27, Brian Foster wrote:
> > > On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > > > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > > > [...]
> > > > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > > > confused. ;-)
> > > > 
> > > > You are right! The function is documented it might fail but the code
> > > > doesn't really allow that. This seems like a bug to me. What do you
> > > > think about the following?
> > > > ---
> > > > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > > > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > > > 
> > > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > > code doesn't really implement this properly and loops on the smallest
> > > > allowed size for ever. This is a problem because vzalloc might fail
> > > > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > > > task is killed") such a failure is much more probable than it used to
> > > > be. Fix this by bailing out if the minimum size request failed.
> > > > 
> > > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > > > 
> > > > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > > > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  fs/xfs/kmem.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > > index 339c696bbc01..ee95f5c6db45 100644
> > > > --- a/fs/xfs/kmem.c
> > > > +++ b/fs/xfs/kmem.c
> > > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > > >  	size_t		kmsize = maxsize;
> > > >  
> > > >  	while (!(ptr = vzalloc(kmsize))) {
> > > > +		if (kmsize == minsize)
> > > > +			break;
> > > >  		if ((kmsize >>= 1) <= minsize)
> > > >  			kmsize = minsize;
> > > >  	}
> > > 
> > > More consistent with the rest of the kmem code might be to accept a
> > > flags argument and do something like this based on KM_MAYFAIL.
> > 
> > Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
> > the same reason it doesn't support GFP_NOFS. So I am not sure this is a
> > good idea.
> > 
> 
> Not sure I follow..? I'm just suggesting to control the loop behavior
> based on the KM_ flag, not to do or change anything wrt to GFP_ flags.

vmalloc() cannot support __GFP_NOFAIL. Therefore, kmem_zalloc_greedy()
cannot support !KM_MAYFAIL. We will change kmem_zalloc_greedy() not to
use vmalloc() when we need to support !KM_MAYFAIL. That won't be now.

> 
> > > The one
> > > current caller looks like it would pass it, but I suppose we'd still
> > > need a mechanism to break out should a new caller not pass that flag.
> > > Would a fatal_signal_pending() check in the loop as well allow us to
> > > break out in the scenario that is reproduced here?
> > 
> > Yes that check would work as well I just thought the break out when the
> > minsize request fails to be more logical. There might be other reasons
> > to fail the request and looping here seems just wrong. But whatever you
> > or other xfs people prefer.
> 
> There may be higher level reasons for why this code should or should not
> loop, that just seems like a separate issue to me. My thinking is more
> that this appears to be how every kmem_*() function operates today and
> it seems a bit out of place to change behavior of one to fix a bug.
> 
> Maybe I'm missing something though.. are we subject to the same general
> problem in any of the other kmem_*() functions that can currently loop
> indefinitely?
> 

The kmem code looks to me a source of problems. For example,
kmem_alloc() by RESCUER workqueue threads got stuck doing GFP_NOFS allocation.
Due to __GFP_NOWARN, warn_alloc() cannot warn allocation stalls.
Due to order <= PAGE_ALLOC_COSTLY_ORDER, the
"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)"
message cannot be printed because __alloc_pages_nodemask() does not
return. And due to GFP_NOFS without __GFP_HIGH nor __GFP_NOFAIL,
memory cannot be allocated to RESCUER thread which are trying to
allocate memory for reclaiming memory; the result is silent lockup.
(I must stop here, for this is a different thread.)

----------------------------------------
[ 1095.633625] MemAlloc: xfs-data/sda1(451) flags=0x4228060 switches=45509 seq=1 gfp=0x1604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=652073
[ 1095.633626] xfs-data/sda1   R  running task    12696   451      2 0x00000000
[ 1095.633663] Workqueue: xfs-data/sda1 xfs_end_io [xfs]
[ 1095.633665] Call Trace:
[ 1095.633668]  __schedule+0x336/0xe00
[ 1095.633671]  schedule+0x3d/0x90
[ 1095.633672]  schedule_timeout+0x20d/0x510
[ 1095.633675]  ? lock_timer_base+0xa0/0xa0
[ 1095.633678]  schedule_timeout_uninterruptible+0x2a/0x30
[ 1095.633680]  __alloc_pages_slowpath+0x2b5/0xd95
[ 1095.633687]  __alloc_pages_nodemask+0x3e4/0x460
[ 1095.633699]  alloc_pages_current+0x97/0x1b0
[ 1095.633702]  new_slab+0x4cb/0x6b0
[ 1095.633706]  ___slab_alloc+0x3a3/0x620
[ 1095.633728]  ? kmem_alloc+0x96/0x120 [xfs]
[ 1095.633730]  ? ___slab_alloc+0x5c6/0x620
[ 1095.633732]  ? cpuacct_charge+0x38/0x1e0
[ 1095.633767]  ? kmem_alloc+0x96/0x120 [xfs]
[ 1095.633770]  __slab_alloc+0x46/0x7d
[ 1095.633773]  __kmalloc+0x301/0x3b0
[ 1095.633802]  kmem_alloc+0x96/0x120 [xfs]
[ 1095.633804]  ? kfree+0x1fa/0x330
[ 1095.633842]  xfs_log_commit_cil+0x489/0x710 [xfs]
[ 1095.633864]  __xfs_trans_commit+0x83/0x260 [xfs]
[ 1095.633883]  xfs_trans_commit+0x10/0x20 [xfs]
[ 1095.633901]  __xfs_setfilesize+0xdb/0x240 [xfs]
[ 1095.633936]  xfs_setfilesize_ioend+0x89/0xb0 [xfs]
[ 1095.633954]  ? xfs_setfilesize_ioend+0x5/0xb0 [xfs]
[ 1095.633971]  xfs_end_io+0x81/0x110 [xfs]
[ 1095.633973]  process_one_work+0x22b/0x760
[ 1095.633975]  ? process_one_work+0x194/0x760
[ 1095.633997]  rescuer_thread+0x1f2/0x3d0
[ 1095.634002]  kthread+0x10f/0x150
[ 1095.634003]  ? worker_thread+0x4b0/0x4b0
[ 1095.634004]  ? kthread_create_on_node+0x70/0x70
[ 1095.634007]  ret_from_fork+0x31/0x40
[ 1095.634013] MemAlloc: xfs-eofblocks/s(456) flags=0x4228860 switches=15435 seq=1 gfp=0x1400240(GFP_NOFS|__GFP_NOWARN) order=0 delay=293074
[ 1095.634014] xfs-eofblocks/s R  running task    12032   456      2 0x00000000
[ 1095.634037] Workqueue: xfs-eofblocks/sda1 xfs_eofblocks_worker [xfs]
[ 1095.634038] Call Trace:
[ 1095.634040]  ? _raw_spin_lock+0x3d/0x80
[ 1095.634042]  ? vmpressure+0xd0/0x120
[ 1095.634044]  ? vmpressure+0xd0/0x120
[ 1095.634047]  ? vmpressure_prio+0x21/0x30
[ 1095.634049]  ? do_try_to_free_pages+0x70/0x300
[ 1095.634052]  ? try_to_free_pages+0x131/0x3f0
[ 1095.634058]  ? __alloc_pages_slowpath+0x3ec/0xd95
[ 1095.634065]  ? __alloc_pages_nodemask+0x3e4/0x460
[ 1095.634069]  ? alloc_pages_current+0x97/0x1b0
[ 1095.634111]  ? xfs_buf_allocate_memory+0x160/0x2a3 [xfs]
[ 1095.634133]  ? xfs_buf_get_map+0x2be/0x480 [xfs]
[ 1095.634169]  ? xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1095.634204]  ? xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1095.634222]  ? xfs_btree_read_buf_block.constprop.34+0x78/0xc0 [xfs]
[ 1095.634239]  ? xfs_btree_lookup_get_block+0x8a/0x180 [xfs]
[ 1095.634257]  ? xfs_btree_lookup+0xd0/0x3f0 [xfs]
[ 1095.634296]  ? kmem_zone_alloc+0x96/0x120 [xfs]
[ 1095.634299]  ? _raw_spin_unlock+0x27/0x40
[ 1095.634315]  ? xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
[ 1095.634348]  ? xfs_bmap_del_extent+0x1b2/0x1610 [xfs]
[ 1095.634380]  ? kmem_zone_alloc+0x96/0x120 [xfs]
[ 1095.634400]  ? __xfs_bunmapi+0x4db/0xda0 [xfs]
[ 1095.634421]  ? xfs_bunmapi+0x2b/0x40 [xfs]
[ 1095.634459]  ? xfs_itruncate_extents+0x1df/0x780 [xfs]
[ 1095.634502]  ? xfs_rename+0xc70/0x1080 [xfs]
[ 1095.634525]  ? xfs_free_eofblocks+0x1c4/0x230 [xfs]
[ 1095.634546]  ? xfs_inode_free_eofblocks+0x18d/0x280 [xfs]
[ 1095.634565]  ? xfs_inode_ag_walk.isra.13+0x2b5/0x620 [xfs]
[ 1095.634582]  ? xfs_inode_ag_walk.isra.13+0x91/0x620 [xfs]
[ 1095.634618]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
[ 1095.634630]  ? radix_tree_next_chunk+0x10b/0x2d0
[ 1095.634635]  ? radix_tree_gang_lookup_tag+0xd7/0x150
[ 1095.634672]  ? xfs_perag_get_tag+0x11d/0x370 [xfs]
[ 1095.634690]  ? xfs_perag_get_tag+0x5/0x370 [xfs]
[ 1095.634709]  ? xfs_inode_ag_iterator_tag+0x71/0xa0 [xfs]
[ 1095.634726]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
[ 1095.634744]  ? __xfs_icache_free_eofblocks+0x3b/0x40 [xfs]
[ 1095.634759]  ? xfs_eofblocks_worker+0x27/0x40 [xfs]
[ 1095.634762]  ? process_one_work+0x22b/0x760
[ 1095.634763]  ? process_one_work+0x194/0x760
[ 1095.634784]  ? rescuer_thread+0x1f2/0x3d0
[ 1095.634788]  ? kthread+0x10f/0x150
[ 1095.634789]  ? worker_thread+0x4b0/0x4b0
[ 1095.634790]  ? kthread_create_on_node+0x70/0x70
[ 1095.634793]  ? ret_from_fork+0x31/0x40
----------------------------------------

> Brian
> 
> > -- 
> > Michal Hocko
> > SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko March 2, 2017, 1:27 p.m. UTC | #4
On Thu 02-03-17 08:00:09, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 01:49:09PM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 07:24:27, Brian Foster wrote:
> > > On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > > > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > > > [...]
> > > > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > > > confused. ;-)
> > > > 
> > > > You are right! The function is documented it might fail but the code
> > > > doesn't really allow that. This seems like a bug to me. What do you
> > > > think about the following?
> > > > ---
> > > > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > > > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > > > 
> > > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > > code doesn't really implement this properly and loops on the smallest
> > > > allowed size for ever. This is a problem because vzalloc might fail
> > > > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > > > task is killed") such a failure is much more probable than it used to
> > > > be. Fix this by bailing out if the minimum size request failed.
> > > > 
> > > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > > > 
> > > > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > > > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  fs/xfs/kmem.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > > index 339c696bbc01..ee95f5c6db45 100644
> > > > --- a/fs/xfs/kmem.c
> > > > +++ b/fs/xfs/kmem.c
> > > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > > >  	size_t		kmsize = maxsize;
> > > >  
> > > >  	while (!(ptr = vzalloc(kmsize))) {
> > > > +		if (kmsize == minsize)
> > > > +			break;
> > > >  		if ((kmsize >>= 1) <= minsize)
> > > >  			kmsize = minsize;
> > > >  	}
> > > 
> > > More consistent with the rest of the kmem code might be to accept a
> > > flags argument and do something like this based on KM_MAYFAIL.
> > 
> > Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
> > the same reason it doesn't support GFP_NOFS. So I am not sure this is a
> > good idea.
> > 
> 
> Not sure I follow..? I'm just suggesting to control the loop behavior
> based on the KM_ flag, not to do or change anything wrt to GFP_ flags.

As Tetsuo already pointed out, vmalloc cannot really support never-fail
semantic with the current implementation so the semantic would have
to be implemented in kmem_zalloc_greedy and the only way to do that
would be to loop there and this is rather nasty as you can see from the
reported issue because the vmalloc failure might be permanent so there
won't be any way to make a forward progress. Breaking out of the loop
on fatal_signal_pending pending would break the non-failing sementic.

Besides that, there doesn't really seem to be any demand for this
semantic in the first place so why to make this more complicated than
necessary?

I see your argument about being in sync with other kmem helpers but
those are bit different because regular page/slab allocators allow never
fail semantic (even though this is mostly ignored by those helpers which
implement their own retries but that is a different topic).
Brian Foster March 2, 2017, 1:41 p.m. UTC | #5
On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 08:00:09, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 01:49:09PM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 07:24:27, Brian Foster wrote:
> > > > On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > > > > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > > > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > > > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > > > > confused. ;-)
> > > > > 
> > > > > You are right! The function is documented it might fail but the code
> > > > > doesn't really allow that. This seems like a bug to me. What do you
> > > > > think about the following?
> > > > > ---
> > > > > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > > > > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > > > > 
> > > > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > > > code doesn't really implement this properly and loops on the smallest
> > > > > allowed size for ever. This is a problem because vzalloc might fail
> > > > > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > > > > task is killed") such a failure is much more probable than it used to
> > > > > be. Fix this by bailing out if the minimum size request failed.
> > > > > 
> > > > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > > > > 
> > > > > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > > > > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > > ---
> > > > >  fs/xfs/kmem.c | 2 ++
> > > > >  1 file changed, 2 insertions(+)
> > > > > 
> > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > > > index 339c696bbc01..ee95f5c6db45 100644
> > > > > --- a/fs/xfs/kmem.c
> > > > > +++ b/fs/xfs/kmem.c
> > > > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > > > >  	size_t		kmsize = maxsize;
> > > > >  
> > > > >  	while (!(ptr = vzalloc(kmsize))) {
> > > > > +		if (kmsize == minsize)
> > > > > +			break;
> > > > >  		if ((kmsize >>= 1) <= minsize)
> > > > >  			kmsize = minsize;
> > > > >  	}
> > > > 
> > > > More consistent with the rest of the kmem code might be to accept a
> > > > flags argument and do something like this based on KM_MAYFAIL.
> > > 
> > > Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
> > > the same reason it doesn't support GFP_NOFS. So I am not sure this is a
> > > good idea.
> > > 
> > 
> > Not sure I follow..? I'm just suggesting to control the loop behavior
> > based on the KM_ flag, not to do or change anything wrt to GFP_ flags.
> 
> As Tetsuo already pointed out, vmalloc cannot really support never-fail
> semantic with the current implementation so the semantic would have
> to be implemented in kmem_zalloc_greedy and the only way to do that
> would be to loop there and this is rather nasty as you can see from the
> reported issue because the vmalloc failure might be permanent so there
> won't be any way to make a forward progress. Breaking out of the loop
> on fatal_signal_pending pending would break the non-failing sementic.
> 

Sure..

> Besides that, there doesn't really seem to be any demand for this
> semantic in the first place so why to make this more complicated than
> necessary?
> 

That may very well be the case. I'm not necessarily against this...

> I see your argument about being in sync with other kmem helpers but
> those are bit different because regular page/slab allocators allow never
> fail semantic (even though this is mostly ignored by those helpers which
> implement their own retries but that is a different topic).
> 

... but what I'm trying to understand here is whether this failure
scenario is specific to vmalloc() or whether the other kmem_*()
functions are susceptible to the same problem. For example, suppose we
replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
KM_SLEEP) call. Could we hit the same problem if the process is killed?

Brian

> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko March 2, 2017, 1:50 p.m. UTC | #6
On Thu 02-03-17 08:41:58, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
[...]
> > I see your argument about being in sync with other kmem helpers but
> > those are bit different because regular page/slab allocators allow never
> > fail semantic (even though this is mostly ignored by those helpers which
> > implement their own retries but that is a different topic).
> > 
> 
> ... but what I'm trying to understand here is whether this failure
> scenario is specific to vmalloc() or whether the other kmem_*()
> functions are susceptible to the same problem. For example, suppose we
> replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> KM_SLEEP) call. Could we hit the same problem if the process is killed?

Well, kmem_zalloc uses kmalloc which can also fail when we are out of
memory but in that case we can expect the OOM killer releasing some
memory which would allow us to make a forward progress on the next
retry. So essentially retrying around kmalloc is much more safe in this
regard. Failing vmalloc might be permanent because there is no vmalloc
space to allocate from or much more likely due to already mentioned
patch. So vmalloc is different, really.
Brian Foster March 2, 2017, 2:23 p.m. UTC | #7
On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> [...]
> > > I see your argument about being in sync with other kmem helpers but
> > > those are bit different because regular page/slab allocators allow never
> > > fail semantic (even though this is mostly ignored by those helpers which
> > > implement their own retries but that is a different topic).
> > > 
> > 
> > ... but what I'm trying to understand here is whether this failure
> > scenario is specific to vmalloc() or whether the other kmem_*()
> > functions are susceptible to the same problem. For example, suppose we
> > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> 
> Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> memory but in that case we can expect the OOM killer releasing some
> memory which would allow us to make a forward progress on the next
> retry. So essentially retrying around kmalloc is much more safe in this
> regard. Failing vmalloc might be permanent because there is no vmalloc
> space to allocate from or much more likely due to already mentioned
> patch. So vmalloc is different, really.

Right.. that's why I'm asking. So it's technically possible but highly
unlikely due to the different failure characteristics. That seems
reasonable to me, then. 

To be clear, do we understand what causes the vzalloc() failure to be
effectively permanent in this specific reproducer? I know you mention
above that we could be out of vmalloc space, but that doesn't clarify
whether there are other potential failure paths or then what this has to
do with the fact that the process was killed. Does the pending signal
cause the subsequent failures or are you saying that there is some other
root cause of the failure, this process would effectively be spinning
here anyways, and we're just noticing it because it's trying to exit?

Brian

> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko March 2, 2017, 2:34 p.m. UTC | #8
On Thu 02-03-17 09:23:15, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > [...]
> > > > I see your argument about being in sync with other kmem helpers but
> > > > those are bit different because regular page/slab allocators allow never
> > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > implement their own retries but that is a different topic).
> > > > 
> > > 
> > > ... but what I'm trying to understand here is whether this failure
> > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > functions are susceptible to the same problem. For example, suppose we
> > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > 
> > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > memory but in that case we can expect the OOM killer releasing some
> > memory which would allow us to make a forward progress on the next
> > retry. So essentially retrying around kmalloc is much more safe in this
> > regard. Failing vmalloc might be permanent because there is no vmalloc
> > space to allocate from or much more likely due to already mentioned
> > patch. So vmalloc is different, really.
> 
> Right.. that's why I'm asking. So it's technically possible but highly
> unlikely due to the different failure characteristics. That seems
> reasonable to me, then. 
> 
> To be clear, do we understand what causes the vzalloc() failure to be
> effectively permanent in this specific reproducer? I know you mention
> above that we could be out of vmalloc space, but that doesn't clarify
> whether there are other potential failure paths or then what this has to
> do with the fact that the process was killed. Does the pending signal
> cause the subsequent failures or are you saying that there is some other
> root cause of the failure, this process would effectively be spinning
> here anyways, and we're just noticing it because it's trying to exit?

In this particular case it is fatal_signal_pending that causes the
permanent failure. This check has been added to prevent from complete
memory reserves depletion on OOM when a killed task has a free ticket to
reserves and vmalloc requests can be really large. In this case there
was no OOM killer going on but fsstress has SIGKILL pending for other
reason. Most probably as a result of the group_exit when all threads
are killed (see zap_process). I could have turn fatal_signal_pending
into tsk_is_oom_victim which would be less likely to hit but in
principle fatal_signal_pending should be better because we do want to
bail out when the process is existing as soon as possible.

What I really wanted to say is that there are other possible permanent
failure paths in vmalloc AFAICS. They are much less probable but they
still exist.

Does that make more sense now?
Brian Foster March 2, 2017, 2:51 p.m. UTC | #9
On Thu, Mar 02, 2017 at 03:34:41PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 09:23:15, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > I see your argument about being in sync with other kmem helpers but
> > > > > those are bit different because regular page/slab allocators allow never
> > > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > > implement their own retries but that is a different topic).
> > > > > 
> > > > 
> > > > ... but what I'm trying to understand here is whether this failure
> > > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > > functions are susceptible to the same problem. For example, suppose we
> > > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > > 
> > > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > > memory but in that case we can expect the OOM killer releasing some
> > > memory which would allow us to make a forward progress on the next
> > > retry. So essentially retrying around kmalloc is much more safe in this
> > > regard. Failing vmalloc might be permanent because there is no vmalloc
> > > space to allocate from or much more likely due to already mentioned
> > > patch. So vmalloc is different, really.
> > 
> > Right.. that's why I'm asking. So it's technically possible but highly
> > unlikely due to the different failure characteristics. That seems
> > reasonable to me, then. 
> > 
> > To be clear, do we understand what causes the vzalloc() failure to be
> > effectively permanent in this specific reproducer? I know you mention
> > above that we could be out of vmalloc space, but that doesn't clarify
> > whether there are other potential failure paths or then what this has to
> > do with the fact that the process was killed. Does the pending signal
> > cause the subsequent failures or are you saying that there is some other
> > root cause of the failure, this process would effectively be spinning
> > here anyways, and we're just noticing it because it's trying to exit?
> 
> In this particular case it is fatal_signal_pending that causes the
> permanent failure. This check has been added to prevent from complete
> memory reserves depletion on OOM when a killed task has a free ticket to
> reserves and vmalloc requests can be really large. In this case there
> was no OOM killer going on but fsstress has SIGKILL pending for other
> reason. Most probably as a result of the group_exit when all threads
> are killed (see zap_process). I could have turn fatal_signal_pending
> into tsk_is_oom_victim which would be less likely to hit but in
> principle fatal_signal_pending should be better because we do want to
> bail out when the process is existing as soon as possible.
> 
> What I really wanted to say is that there are other possible permanent
> failure paths in vmalloc AFAICS. They are much less probable but they
> still exist.
> 
> Does that make more sense now?

Yes, thanks. That explains why this crops up now where it hasn't in the
past. Please include that background in the commit log description.

Also, that kind of makes me think that a fatal_signal_pending() check is
still appropriate in the loop, even if we want to drop the infinite
retry loop in kmem_zalloc_greedy() as well. There's no sense in doing
however many retries are left before we return and that's also more
explicit for the next person who goes to change this code in the future.

Otherwise, I'm fine with breaking the infinite retry loop at the same
time. It looks like Christoph added this function originally so this
should probably require his ack as well..

Brian

> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko March 2, 2017, 3:14 p.m. UTC | #10
On Thu 02-03-17 09:51:31, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 03:34:41PM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 09:23:15, Brian Foster wrote:
> > > On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > > > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > > > [...]
> > > > > > I see your argument about being in sync with other kmem helpers but
> > > > > > those are bit different because regular page/slab allocators allow never
> > > > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > > > implement their own retries but that is a different topic).
> > > > > > 
> > > > > 
> > > > > ... but what I'm trying to understand here is whether this failure
> > > > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > > > functions are susceptible to the same problem. For example, suppose we
> > > > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > > > 
> > > > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > > > memory but in that case we can expect the OOM killer releasing some
> > > > memory which would allow us to make a forward progress on the next
> > > > retry. So essentially retrying around kmalloc is much more safe in this
> > > > regard. Failing vmalloc might be permanent because there is no vmalloc
> > > > space to allocate from or much more likely due to already mentioned
> > > > patch. So vmalloc is different, really.
> > > 
> > > Right.. that's why I'm asking. So it's technically possible but highly
> > > unlikely due to the different failure characteristics. That seems
> > > reasonable to me, then. 
> > > 
> > > To be clear, do we understand what causes the vzalloc() failure to be
> > > effectively permanent in this specific reproducer? I know you mention
> > > above that we could be out of vmalloc space, but that doesn't clarify
> > > whether there are other potential failure paths or then what this has to
> > > do with the fact that the process was killed. Does the pending signal
> > > cause the subsequent failures or are you saying that there is some other
> > > root cause of the failure, this process would effectively be spinning
> > > here anyways, and we're just noticing it because it's trying to exit?
> > 
> > In this particular case it is fatal_signal_pending that causes the
> > permanent failure. This check has been added to prevent from complete
> > memory reserves depletion on OOM when a killed task has a free ticket to
> > reserves and vmalloc requests can be really large. In this case there
> > was no OOM killer going on but fsstress has SIGKILL pending for other
> > reason. Most probably as a result of the group_exit when all threads
> > are killed (see zap_process). I could have turn fatal_signal_pending
> > into tsk_is_oom_victim which would be less likely to hit but in
> > principle fatal_signal_pending should be better because we do want to
> > bail out when the process is existing as soon as possible.
> > 
> > What I really wanted to say is that there are other possible permanent
> > failure paths in vmalloc AFAICS. They are much less probable but they
> > still exist.
> > 
> > Does that make more sense now?
> 
> Yes, thanks. That explains why this crops up now where it hasn't in the
> past. Please include that background in the commit log description.

OK, does this sound better. I am open to any suggestions to improve this
of course

: xfs: allow kmem_zalloc_greedy to fail
: 
: Even though kmem_zalloc_greedy is documented it might fail the current
: code doesn't really implement this properly and loops on the smallest
: allowed size for ever. This is a problem because vzalloc might fail
: permanently - we might run out of vmalloc space or since 5d17a73a2ebe
: ("vmalloc: back off when the current task is killed") when the current
: task is killed. The later one makes the failure scenario much more
: probable than it used to be. Fix this by bailing out if the minimum size
: request failed.
: 
: This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
: 
: fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
: fsstress cpuset=/ mems_allowed=0-1
: CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
: Call Trace:
:  dump_stack+0x63/0x87
:  warn_alloc+0x114/0x1c0
:  ? alloc_pages_current+0x88/0x120
:  __vmalloc_node_range+0x250/0x2a0
:  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
:  ? free_hot_cold_page+0x21f/0x280
:  vzalloc+0x54/0x60
:  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
:  kmem_zalloc_greedy+0x2b/0x40 [xfs]
:  xfs_bulkstat+0x11b/0x730 [xfs]
:  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
:  ? selinux_capable+0x20/0x30
:  ? security_capable+0x48/0x60
:  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
:  xfs_file_ioctl+0x9dd/0xad0 [xfs]
:  ? do_filp_open+0xa5/0x100
:  do_vfs_ioctl+0xa7/0x5e0
:  SyS_ioctl+0x79/0x90
:  do_syscall_64+0x67/0x180
:  entry_SYSCALL64_slow_path+0x25/0x25
: 
: fsstress keeps looping inside kmem_zalloc_greedy without any way out
: because vmalloc keeps failing due to fatal_signal_pending.
: 
: Reported-by: Xiong Zhou <xzhou@redhat.com>
: Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
: Signed-off-by: Michal Hocko <mhocko@suse.com>

> Also, that kind of makes me think that a fatal_signal_pending() check is
> still appropriate in the loop, even if we want to drop the infinite
> retry loop in kmem_zalloc_greedy() as well. There's no sense in doing
> however many retries are left before we return and that's also more
> explicit for the next person who goes to change this code in the future.

I am not objecting to adding fatal_signal_pending as well I just thought
that from the logic POV breaking after reaching the minimum size is just
the right thing to do. We can optimize further by checking
fatal_signal_pending and reducing retries when we know it doesn't make
much sense but that should be done on top as an optimization IMHO.

> Otherwise, I'm fine with breaking the infinite retry loop at the same
> time. It looks like Christoph added this function originally so this
> should probably require his ack as well..

What do you think Christoph?
Brian Foster March 2, 2017, 3:30 p.m. UTC | #11
On Thu, Mar 02, 2017 at 04:14:11PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 09:51:31, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 03:34:41PM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 09:23:15, Brian Foster wrote:
> > > > On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > > > > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > > > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > > > > [...]
> > > > > > > I see your argument about being in sync with other kmem helpers but
> > > > > > > those are bit different because regular page/slab allocators allow never
> > > > > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > > > > implement their own retries but that is a different topic).
> > > > > > > 
> > > > > > 
> > > > > > ... but what I'm trying to understand here is whether this failure
> > > > > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > > > > functions are susceptible to the same problem. For example, suppose we
> > > > > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > > > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > > > > 
> > > > > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > > > > memory but in that case we can expect the OOM killer releasing some
> > > > > memory which would allow us to make a forward progress on the next
> > > > > retry. So essentially retrying around kmalloc is much more safe in this
> > > > > regard. Failing vmalloc might be permanent because there is no vmalloc
> > > > > space to allocate from or much more likely due to already mentioned
> > > > > patch. So vmalloc is different, really.
> > > > 
> > > > Right.. that's why I'm asking. So it's technically possible but highly
> > > > unlikely due to the different failure characteristics. That seems
> > > > reasonable to me, then. 
> > > > 
> > > > To be clear, do we understand what causes the vzalloc() failure to be
> > > > effectively permanent in this specific reproducer? I know you mention
> > > > above that we could be out of vmalloc space, but that doesn't clarify
> > > > whether there are other potential failure paths or then what this has to
> > > > do with the fact that the process was killed. Does the pending signal
> > > > cause the subsequent failures or are you saying that there is some other
> > > > root cause of the failure, this process would effectively be spinning
> > > > here anyways, and we're just noticing it because it's trying to exit?
> > > 
> > > In this particular case it is fatal_signal_pending that causes the
> > > permanent failure. This check has been added to prevent from complete
> > > memory reserves depletion on OOM when a killed task has a free ticket to
> > > reserves and vmalloc requests can be really large. In this case there
> > > was no OOM killer going on but fsstress has SIGKILL pending for other
> > > reason. Most probably as a result of the group_exit when all threads
> > > are killed (see zap_process). I could have turn fatal_signal_pending
> > > into tsk_is_oom_victim which would be less likely to hit but in
> > > principle fatal_signal_pending should be better because we do want to
> > > bail out when the process is existing as soon as possible.
> > > 
> > > What I really wanted to say is that there are other possible permanent
> > > failure paths in vmalloc AFAICS. They are much less probable but they
> > > still exist.
> > > 
> > > Does that make more sense now?
> > 
> > Yes, thanks. That explains why this crops up now where it hasn't in the
> > past. Please include that background in the commit log description.
> 
> OK, does this sound better. I am open to any suggestions to improve this
> of course
> 

Yeah..

> : xfs: allow kmem_zalloc_greedy to fail
> : 
> : Even though kmem_zalloc_greedy is documented it might fail the current
> : code doesn't really implement this properly and loops on the smallest
> : allowed size for ever. This is a problem because vzalloc might fail
> : permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> : ("vmalloc: back off when the current task is killed") when the current
> : task is killed. The later one makes the failure scenario much more
> : probable than it used to be. Fix this by bailing out if the minimum size
                               ^
	" because it makes vmalloc() failures permanent for tasks with fatal
	  signals pending."

> : request failed.
> : 
> : This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> : 
> : fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> : fsstress cpuset=/ mems_allowed=0-1
> : CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> : Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> : Call Trace:
> :  dump_stack+0x63/0x87
> :  warn_alloc+0x114/0x1c0
> :  ? alloc_pages_current+0x88/0x120
> :  __vmalloc_node_range+0x250/0x2a0
> :  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  ? free_hot_cold_page+0x21f/0x280
> :  vzalloc+0x54/0x60
> :  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  xfs_bulkstat+0x11b/0x730 [xfs]
> :  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> :  ? selinux_capable+0x20/0x30
> :  ? security_capable+0x48/0x60
> :  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> :  xfs_file_ioctl+0x9dd/0xad0 [xfs]
> :  ? do_filp_open+0xa5/0x100
> :  do_vfs_ioctl+0xa7/0x5e0
> :  SyS_ioctl+0x79/0x90
> :  do_syscall_64+0x67/0x180
> :  entry_SYSCALL64_slow_path+0x25/0x25
> : 
> : fsstress keeps looping inside kmem_zalloc_greedy without any way out
> : because vmalloc keeps failing due to fatal_signal_pending.
> : 
> : Reported-by: Xiong Zhou <xzhou@redhat.com>
> : Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> : Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> > Also, that kind of makes me think that a fatal_signal_pending() check is
> > still appropriate in the loop, even if we want to drop the infinite
> > retry loop in kmem_zalloc_greedy() as well. There's no sense in doing
> > however many retries are left before we return and that's also more
> > explicit for the next person who goes to change this code in the future.
> 
> I am not objecting to adding fatal_signal_pending as well I just thought
> that from the logic POV breaking after reaching the minimum size is just
> the right thing to do. We can optimize further by checking
> fatal_signal_pending and reducing retries when we know it doesn't make
> much sense but that should be done on top as an optimization IMHO.
> 

I don't think of it as an optimization to not invoke calls (a
non-deterministic number of times) that we know are going to fail, but
ultimately I don't care too much how it's framed or if it's done in
separate patches or whatnot. As long as they are posted at the same
time. ;)

Brian

> > Otherwise, I'm fine with breaking the infinite retry loop at the same
> > time. It looks like Christoph added this function originally so this
> > should probably require his ack as well..
> 
> What do you think Christoph?
> -- 
> Michal Hocko
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 2, 2017, 3:47 p.m. UTC | #12
On Thu, Mar 02, 2017 at 09:51:31AM -0500, Brian Foster wrote:
> Otherwise, I'm fine with breaking the infinite retry loop at the same
> time. It looks like Christoph added this function originally so this
> should probably require his ack as well..

I just moved the code around, but I'll take a look as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko March 2, 2017, 3:47 p.m. UTC | #13
On Thu 02-03-17 10:30:02, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 04:14:11PM +0100, Michal Hocko wrote:
[...]
> > I am not objecting to adding fatal_signal_pending as well I just thought
> > that from the logic POV breaking after reaching the minimum size is just
> > the right thing to do. We can optimize further by checking
> > fatal_signal_pending and reducing retries when we know it doesn't make
> > much sense but that should be done on top as an optimization IMHO.
> > 
> 
> I don't think of it as an optimization to not invoke calls (a
> non-deterministic number of times) that we know are going to fail, but

the point is that vmalloc failure modes are an implementation detail
which might change in the future. The fix should be really independent
on the current implementation that is why I think the
fatal_signal_pending is just an optimization.

> ultimately I don't care too much how it's framed or if it's done in
> separate patches or whatnot. As long as they are posted at the same
> time. ;)

Done
diff mbox

Patch

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..ee95f5c6db45 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -34,6 +34,8 @@  kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
 	size_t		kmsize = maxsize;
 
 	while (!(ptr = vzalloc(kmsize))) {
+		if (kmsize == minsize)
+			break;
 		if ((kmsize >>= 1) <= minsize)
 			kmsize = minsize;
 	}