diff mbox

[1/2] xfs: allow kmem_zalloc_greedy to fail

Message ID 20170302154541.16155-1-mhocko@kernel.org (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Michal Hocko March 2, 2017, 3:45 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

Even though kmem_zalloc_greedy is documented it might fail the current
code doesn't really implement this properly and loops on the smallest
allowed size for ever. This is a problem because vzalloc might fail
permanently - we might run out of vmalloc space or since 5d17a73a2ebe
("vmalloc: back off when the current task is killed") when the current
task is killed. The later one makes the failure scenario much more
probable than it used to be because it makes vmalloc() failures
permanent for tasks with fatal signals pending.. Fix this by bailing out
if the minimum size request failed.

This has been noticed by a hung generic/269 xfstest by Xiong Zhou.

fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
fsstress cpuset=/ mems_allowed=0-1
CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
Call Trace:
 dump_stack+0x63/0x87
 warn_alloc+0x114/0x1c0
 ? alloc_pages_current+0x88/0x120
 __vmalloc_node_range+0x250/0x2a0
 ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
 ? free_hot_cold_page+0x21f/0x280
 vzalloc+0x54/0x60
 ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
 kmem_zalloc_greedy+0x2b/0x40 [xfs]
 xfs_bulkstat+0x11b/0x730 [xfs]
 ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
 ? selinux_capable+0x20/0x30
 ? security_capable+0x48/0x60
 xfs_ioc_bulkstat+0xe4/0x190 [xfs]
 xfs_file_ioctl+0x9dd/0xad0 [xfs]
 ? do_filp_open+0xa5/0x100
 do_vfs_ioctl+0xa7/0x5e0
 SyS_ioctl+0x79/0x90
 do_syscall_64+0x67/0x180
 entry_SYSCALL64_slow_path+0x25/0x25

fsstress keeps looping inside kmem_zalloc_greedy without any way out
because vmalloc keeps failing due to fatal_signal_pending.

Reported-by: Xiong Zhou <xzhou@redhat.com>
Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Christoph Hellwig March 2, 2017, 3:49 p.m. UTC | #1
Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster March 2, 2017, 3:59 p.m. UTC | #2
On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Even though kmem_zalloc_greedy is documented it might fail the current
> code doesn't really implement this properly and loops on the smallest
> allowed size for ever. This is a problem because vzalloc might fail
> permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> ("vmalloc: back off when the current task is killed") when the current
> task is killed. The later one makes the failure scenario much more
> probable than it used to be because it makes vmalloc() failures
> permanent for tasks with fatal signals pending.. Fix this by bailing out
> if the minimum size request failed.
> 
> This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> 
> fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> fsstress cpuset=/ mems_allowed=0-1
> CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> Call Trace:
>  dump_stack+0x63/0x87
>  warn_alloc+0x114/0x1c0
>  ? alloc_pages_current+0x88/0x120
>  __vmalloc_node_range+0x250/0x2a0
>  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  ? free_hot_cold_page+0x21f/0x280
>  vzalloc+0x54/0x60
>  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  xfs_bulkstat+0x11b/0x730 [xfs]
>  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
>  ? selinux_capable+0x20/0x30
>  ? security_capable+0x48/0x60
>  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
>  xfs_file_ioctl+0x9dd/0xad0 [xfs]
>  ? do_filp_open+0xa5/0x100
>  do_vfs_ioctl+0xa7/0x5e0
>  SyS_ioctl+0x79/0x90
>  do_syscall_64+0x67/0x180
>  entry_SYSCALL64_slow_path+0x25/0x25
> 
> fsstress keeps looping inside kmem_zalloc_greedy without any way out
> because vmalloc keeps failing due to fatal_signal_pending.
> 
> Reported-by: Xiong Zhou <xzhou@redhat.com>
> Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/kmem.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..ee95f5c6db45 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
>  	size_t		kmsize = maxsize;
>  
>  	while (!(ptr = vzalloc(kmsize))) {
> +		if (kmsize == minsize)
> +			break;
>  		if ((kmsize >>= 1) <= minsize)
>  			kmsize = minsize;
>  	}
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko March 2, 2017, 4:16 p.m. UTC | #3
I've just realized that Darrick was not on the CC list. Let's add him.
I believe this patch should go in in the current cycle because
5d17a73a2ebe was merged in this merge window and it can be abused...

The other patch [1] is not that urgent.

[1] http://lkml.kernel.org/r/20170302154541.16155-2-mhocko@kernel.org

On Thu 02-03-17 16:45:40, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Even though kmem_zalloc_greedy is documented it might fail the current
> code doesn't really implement this properly and loops on the smallest
> allowed size for ever. This is a problem because vzalloc might fail
> permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> ("vmalloc: back off when the current task is killed") when the current
> task is killed. The later one makes the failure scenario much more
> probable than it used to be because it makes vmalloc() failures
> permanent for tasks with fatal signals pending.. Fix this by bailing out
> if the minimum size request failed.
> 
> This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> 
> fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> fsstress cpuset=/ mems_allowed=0-1
> CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> Call Trace:
>  dump_stack+0x63/0x87
>  warn_alloc+0x114/0x1c0
>  ? alloc_pages_current+0x88/0x120
>  __vmalloc_node_range+0x250/0x2a0
>  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  ? free_hot_cold_page+0x21f/0x280
>  vzalloc+0x54/0x60
>  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  xfs_bulkstat+0x11b/0x730 [xfs]
>  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
>  ? selinux_capable+0x20/0x30
>  ? security_capable+0x48/0x60
>  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
>  xfs_file_ioctl+0x9dd/0xad0 [xfs]
>  ? do_filp_open+0xa5/0x100
>  do_vfs_ioctl+0xa7/0x5e0
>  SyS_ioctl+0x79/0x90
>  do_syscall_64+0x67/0x180
>  entry_SYSCALL64_slow_path+0x25/0x25
> 
> fsstress keeps looping inside kmem_zalloc_greedy without any way out
> because vmalloc keeps failing due to fatal_signal_pending.
> 
> Reported-by: Xiong Zhou <xzhou@redhat.com>
> Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..ee95f5c6db45 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
>  	size_t		kmsize = maxsize;
>  
>  	while (!(ptr = vzalloc(kmsize))) {
> +		if (kmsize == minsize)
> +			break;
>  		if ((kmsize >>= 1) <= minsize)
>  			kmsize = minsize;
>  	}
> -- 
> 2.11.0
>
Darrick J. Wong March 2, 2017, 4:44 p.m. UTC | #4
On Thu, Mar 02, 2017 at 05:16:06PM +0100, Michal Hocko wrote:
> I've just realized that Darrick was not on the CC list. Let's add him.
> I believe this patch should go in in the current cycle because
> 5d17a73a2ebe was merged in this merge window and it can be abused...
> 
> The other patch [1] is not that urgent.
> 
> [1] http://lkml.kernel.org/r/20170302154541.16155-2-mhocko@kernel.org

Both patches look ok to me.  I'll take both patches for rc2.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

(Annoyingly I missed the whole thread yesterday due to vger slowness, in
case anyone was wondering why I didn't reply.)

--D

> 
> On Thu 02-03-17 16:45:40, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > ("vmalloc: back off when the current task is killed") when the current
> > task is killed. The later one makes the failure scenario much more
> > probable than it used to be because it makes vmalloc() failures
> > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > if the minimum size request failed.
> > 
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > 
> > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > fsstress cpuset=/ mems_allowed=0-1
> > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> > Call Trace:
> >  dump_stack+0x63/0x87
> >  warn_alloc+0x114/0x1c0
> >  ? alloc_pages_current+0x88/0x120
> >  __vmalloc_node_range+0x250/0x2a0
> >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  ? free_hot_cold_page+0x21f/0x280
> >  vzalloc+0x54/0x60
> >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  xfs_bulkstat+0x11b/0x730 [xfs]
> >  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> >  ? selinux_capable+0x20/0x30
> >  ? security_capable+0x48/0x60
> >  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> >  xfs_file_ioctl+0x9dd/0xad0 [xfs]
> >  ? do_filp_open+0xa5/0x100
> >  do_vfs_ioctl+0xa7/0x5e0
> >  SyS_ioctl+0x79/0x90
> >  do_syscall_64+0x67/0x180
> >  entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > because vmalloc keeps failing due to fatal_signal_pending.
> > 
> > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  	size_t		kmsize = maxsize;
> >  
> >  	while (!(ptr = vzalloc(kmsize))) {
> > +		if (kmsize == minsize)
> > +			break;
> >  		if ((kmsize >>= 1) <= minsize)
> >  			kmsize = minsize;
> >  	}
> > -- 
> > 2.11.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 3, 2017, 10:54 p.m. UTC | #5
On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Even though kmem_zalloc_greedy is documented it might fail the current
> code doesn't really implement this properly and loops on the smallest
> allowed size for ever. This is a problem because vzalloc might fail
> permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> ("vmalloc: back off when the current task is killed") when the current
> task is killed. The later one makes the failure scenario much more
> probable than it used to be because it makes vmalloc() failures
> permanent for tasks with fatal signals pending.. Fix this by bailing out
> if the minimum size request failed.
> 
> This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> 
> fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> fsstress cpuset=/ mems_allowed=0-1
> CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> Call Trace:
>  dump_stack+0x63/0x87
>  warn_alloc+0x114/0x1c0
>  ? alloc_pages_current+0x88/0x120
>  __vmalloc_node_range+0x250/0x2a0
>  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  ? free_hot_cold_page+0x21f/0x280
>  vzalloc+0x54/0x60
>  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  kmem_zalloc_greedy+0x2b/0x40 [xfs]
>  xfs_bulkstat+0x11b/0x730 [xfs]
>  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
>  ? selinux_capable+0x20/0x30
>  ? security_capable+0x48/0x60
>  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
>  xfs_file_ioctl+0x9dd/0xad0 [xfs]
>  ? do_filp_open+0xa5/0x100
>  do_vfs_ioctl+0xa7/0x5e0
>  SyS_ioctl+0x79/0x90
>  do_syscall_64+0x67/0x180
>  entry_SYSCALL64_slow_path+0x25/0x25
> 
> fsstress keeps looping inside kmem_zalloc_greedy without any way out
> because vmalloc keeps failing due to fatal_signal_pending.
> 
> Reported-by: Xiong Zhou <xzhou@redhat.com>
> Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..ee95f5c6db45 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
>  	size_t		kmsize = maxsize;
>  
>  	while (!(ptr = vzalloc(kmsize))) {
> +		if (kmsize == minsize)
> +			break;
>  		if ((kmsize >>= 1) <= minsize)
>  			kmsize = minsize;
>  	}

Seems wrong to me - this function used to have lots of callers and
over time we've slowly removed them or replaced them with something
else. I'd suggest removing it completely, replacing the call sites
with kmem_zalloc_large().

Cheers,

Dave.
Darrick J. Wong March 3, 2017, 11:19 p.m. UTC | #6
On Sat, Mar 04, 2017 at 09:54:44AM +1100, Dave Chinner wrote:
> On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > ("vmalloc: back off when the current task is killed") when the current
> > task is killed. The later one makes the failure scenario much more
> > probable than it used to be because it makes vmalloc() failures
> > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > if the minimum size request failed.
> > 
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > 
> > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > fsstress cpuset=/ mems_allowed=0-1
> > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> > Call Trace:
> >  dump_stack+0x63/0x87
> >  warn_alloc+0x114/0x1c0
> >  ? alloc_pages_current+0x88/0x120
> >  __vmalloc_node_range+0x250/0x2a0
> >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  ? free_hot_cold_page+0x21f/0x280
> >  vzalloc+0x54/0x60
> >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  xfs_bulkstat+0x11b/0x730 [xfs]
> >  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> >  ? selinux_capable+0x20/0x30
> >  ? security_capable+0x48/0x60
> >  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> >  xfs_file_ioctl+0x9dd/0xad0 [xfs]
> >  ? do_filp_open+0xa5/0x100
> >  do_vfs_ioctl+0xa7/0x5e0
> >  SyS_ioctl+0x79/0x90
> >  do_syscall_64+0x67/0x180
> >  entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > because vmalloc keeps failing due to fatal_signal_pending.
> > 
> > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  	size_t		kmsize = maxsize;
> >  
> >  	while (!(ptr = vzalloc(kmsize))) {
> > +		if (kmsize == minsize)
> > +			break;
> >  		if ((kmsize >>= 1) <= minsize)
> >  			kmsize = minsize;
> >  	}
> 
> Seems wrong to me - this function used to have lots of callers and
> over time we've slowly removed them or replaced them with something
> else. I'd suggest removing it completely, replacing the call sites
> with kmem_zalloc_large().

Heh.  I thought the reason why _greedy still exists (for its sole user
bulkstat) is that bulkstat had the flexibility to deal with receiving
0, 1, or 4 pages.  So yeah, we could just kill it.

But thinking even more stingily about memory, are there applications
that care about being able to bulkstat 16384 inodes at once?  How badly
does bulkstat need to be able to bulk-process more than a page's worth
of inobt records, anyway?

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 4, 2017, 4:48 a.m. UTC | #7
On Fri, Mar 03, 2017 at 03:19:12PM -0800, Darrick J. Wong wrote:
> On Sat, Mar 04, 2017 at 09:54:44AM +1100, Dave Chinner wrote:
> > On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > code doesn't really implement this properly and loops on the smallest
> > > allowed size for ever. This is a problem because vzalloc might fail
> > > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > > ("vmalloc: back off when the current task is killed") when the current
> > > task is killed. The later one makes the failure scenario much more
> > > probable than it used to be because it makes vmalloc() failures
> > > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > > if the minimum size request failed.
> > > 
> > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > > 
> > > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > > fsstress cpuset=/ mems_allowed=0-1
> > > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> > > Call Trace:
> > >  dump_stack+0x63/0x87
> > >  warn_alloc+0x114/0x1c0
> > >  ? alloc_pages_current+0x88/0x120
> > >  __vmalloc_node_range+0x250/0x2a0
> > >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > >  ? free_hot_cold_page+0x21f/0x280
> > >  vzalloc+0x54/0x60
> > >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > >  kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > >  xfs_bulkstat+0x11b/0x730 [xfs]
> > >  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> > >  ? selinux_capable+0x20/0x30
> > >  ? security_capable+0x48/0x60
> > >  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> > >  xfs_file_ioctl+0x9dd/0xad0 [xfs]
> > >  ? do_filp_open+0xa5/0x100
> > >  do_vfs_ioctl+0xa7/0x5e0
> > >  SyS_ioctl+0x79/0x90
> > >  do_syscall_64+0x67/0x180
> > >  entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > > because vmalloc keeps failing due to fatal_signal_pending.
> > > 
> > > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  fs/xfs/kmem.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index 339c696bbc01..ee95f5c6db45 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > >  	size_t		kmsize = maxsize;
> > >  
> > >  	while (!(ptr = vzalloc(kmsize))) {
> > > +		if (kmsize == minsize)
> > > +			break;
> > >  		if ((kmsize >>= 1) <= minsize)
> > >  			kmsize = minsize;
> > >  	}
> > 
> > Seems wrong to me - this function used to have lots of callers and
> > over time we've slowly removed them or replaced them with something
> > else. I'd suggest removing it completely, replacing the call sites
> > with kmem_zalloc_large().
> 
> Heh.  I thought the reason why _greedy still exists (for its sole user
> bulkstat) is that bulkstat had the flexibility to deal with receiving
> 0, 1, or 4 pages.  So yeah, we could just kill it.

irbuf is sized to minimise AGI locking, but if memory is low
it just uses what it can get. Keep in mind the number of inodes we
need to process is determined by the userspace buffer size, which
can easily be sized to hold tens of thousands of struct
xfs_bulkstat.

> But thinking even more stingily about memory, are there applications
> that care about being able to bulkstat 16384 inodes at once?

IIRC, xfsdump can bulkstat up to 64k inodes per call....

> How badly
> does bulkstat need to be able to bulk-process more than a page's worth
> of inobt records, anyway?

Benchmark it on a busy system doing lots of other AGI work (e.g. a
busy NFS server workload with a working set of tens of millions of
inodes so it doesn't fit in cache) and find out. That's generally
how I answer those sorts of questions...

Cheers,

Dave.
Michal Hocko March 6, 2017, 1:21 p.m. UTC | #8
On Sat 04-03-17 09:54:44, Dave Chinner wrote:
> On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > ("vmalloc: back off when the current task is killed") when the current
> > task is killed. The later one makes the failure scenario much more
> > probable than it used to be because it makes vmalloc() failures
> > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > if the minimum size request failed.
> > 
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > 
> > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > fsstress cpuset=/ mems_allowed=0-1
> > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> > Call Trace:
> >  dump_stack+0x63/0x87
> >  warn_alloc+0x114/0x1c0
> >  ? alloc_pages_current+0x88/0x120
> >  __vmalloc_node_range+0x250/0x2a0
> >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  ? free_hot_cold_page+0x21f/0x280
> >  vzalloc+0x54/0x60
> >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  xfs_bulkstat+0x11b/0x730 [xfs]
> >  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> >  ? selinux_capable+0x20/0x30
> >  ? security_capable+0x48/0x60
> >  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> >  xfs_file_ioctl+0x9dd/0xad0 [xfs]
> >  ? do_filp_open+0xa5/0x100
> >  do_vfs_ioctl+0xa7/0x5e0
> >  SyS_ioctl+0x79/0x90
> >  do_syscall_64+0x67/0x180
> >  entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > because vmalloc keeps failing due to fatal_signal_pending.
> > 
> > Reported-by: Xiong Zhou <xzhou@redhat.com>
> > Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/xfs/kmem.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  	size_t		kmsize = maxsize;
> >  
> >  	while (!(ptr = vzalloc(kmsize))) {
> > +		if (kmsize == minsize)
> > +			break;
> >  		if ((kmsize >>= 1) <= minsize)
> >  			kmsize = minsize;
> >  	}
> 
> Seems wrong to me - this function used to have lots of callers and
> over time we've slowly removed them or replaced them with something
> else. I'd suggest removing it completely, replacing the call sites
> with kmem_zalloc_large().

I do not really care how this gets fixed. Dropping kmem_zalloc_greedy
sounds like a way to go. I am not familiar with xfs_bulkstat to do an
edicated guess which allocation size to use. So I guess I have to
postpone this to you guys if you prefer that route though.

Thanks!
diff mbox

Patch

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..ee95f5c6db45 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -34,6 +34,8 @@  kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
 	size_t		kmsize = maxsize;
 
 	while (!(ptr = vzalloc(kmsize))) {
+		if (kmsize == minsize)
+			break;
 		if ((kmsize >>= 1) <= minsize)
 			kmsize = minsize;
 	}