Message ID | 20190910151748.914-1-idryomov@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc() | expand |
On Tue, Sep 10, 2019 at 05:17:48PM +0200, Ilya Dryomov wrote: > The vmalloc allocator doesn't fully respect the specified gfp mask: > while the actual pages are allocated as requested, the page table pages > are always allocated with GFP_KERNEL. ceph_kvmalloc() may be called > with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may > result in a deadlock. > > There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic, > it's just something that seemed sensible at the time (ceph_kvmalloc() > predates kvmalloc()). kvmalloc() is smarter: in an attempt to reduce > long term fragmentation, it first tries to kmalloc non-disruptively. > > Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using > the scope API to avoid the deadlock. Note that kvmalloc() needs to be > passed GFP_KERNEL to enable the fallback. If you can please just stop using GFP_NOFS altogether and set PF_MEMALLOC_* for the actual contexts.
On Wed, Sep 11, 2019 at 8:32 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, Sep 10, 2019 at 05:17:48PM +0200, Ilya Dryomov wrote: > > The vmalloc allocator doesn't fully respect the specified gfp mask: > > while the actual pages are allocated as requested, the page table pages > > are always allocated with GFP_KERNEL. ceph_kvmalloc() may be called > > with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may > > result in a deadlock. > > > > There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic, > > it's just something that seemed sensible at the time (ceph_kvmalloc() > > predates kvmalloc()). kvmalloc() is smarter: in an attempt to reduce > > long term fragmentation, it first tries to kmalloc non-disruptively. > > > > Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using > > the scope API to avoid the deadlock. Note that kvmalloc() needs to be > > passed GFP_KERNEL to enable the fallback. > > If you can please just stop using GFP_NOFS altogether and set > PF_MEMALLOC_* for the actual contexts. Hi Christoph, ceph_kvmalloc() is indirectly called from dozens of places, everywhere a new RPC message is allocated. Some of them are used for client setup and don't need a scope (GFP_KERNEL is fine), but the vast majority do. I don't think wrapping each call is practical. As for getting rid of GFP_NOFS and GFP_NOIO entirely (i.e. dropping the gfp mask from all libceph APIs and using scopes instead), it's something that I have had in the back of my head for a while now because we cheat in a few places and hard-code GFP_NOIO as the lowest common denominator instead of properly propagating the gfp mask. It's more of a project though, and won't be backportable. Thanks, Ilya
On Tue, 2019-09-10 at 17:17 +0200, Ilya Dryomov wrote: > The vmalloc allocator doesn't fully respect the specified gfp mask: > while the actual pages are allocated as requested, the page table pages > are always allocated with GFP_KERNEL. ceph_kvmalloc() may be called > with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may > result in a deadlock. > > There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic, > it's just something that seemed sensible at the time (ceph_kvmalloc() > predates kvmalloc()). kvmalloc() is smarter: in an attempt to reduce > long term fragmentation, it first tries to kmalloc non-disruptively. > > Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using > the scope API to avoid the deadlock. Note that kvmalloc() needs to be > passed GFP_KERNEL to enable the fallback. > > Signed-off-by: Ilya Dryomov <idryomov@gmail.com> > --- > net/ceph/ceph_common.c | 29 +++++++++++++++++++++++------ > 1 file changed, 23 insertions(+), 6 deletions(-) > > diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c > index c41789154cdb..970e74b46213 100644 > --- a/net/ceph/ceph_common.c > +++ b/net/ceph/ceph_common.c > @@ -13,6 +13,7 @@ > #include <linux/nsproxy.h> > #include <linux/fs_parser.h> > #include <linux/sched.h> > +#include <linux/sched/mm.h> > #include <linux/seq_file.h> > #include <linux/slab.h> > #include <linux/statfs.h> > @@ -185,18 +186,34 @@ int ceph_compare_options(struct ceph_options *new_opt, > } > EXPORT_SYMBOL(ceph_compare_options); > > +/* > + * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are > + * compatible with (a superset of) GFP_KERNEL. This is because while the > + * actual pages are allocated with the specified flags, the page table pages > + * are always allocated with GFP_KERNEL. map_vm_area() doesn't even take > + * flags because GFP_KERNEL is hard-coded in {p4d,pud,pmd,pte}_alloc(). > + * > + * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO. > + */ > void *ceph_kvmalloc(size_t size, gfp_t flags) > { > - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { > - void *ptr = kmalloc(size, flags | __GFP_NOWARN); > - if (ptr) > - return ptr; > + void *p; > + > + if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) { > + p = kvmalloc(size, flags); > + } else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) { > + unsigned int nofs_flag = memalloc_nofs_save(); > + p = kvmalloc(size, GFP_KERNEL); > + memalloc_nofs_restore(nofs_flag); > + } else { > + unsigned int noio_flag = memalloc_noio_save(); > + p = kvmalloc(size, GFP_KERNEL); > + memalloc_noio_restore(noio_flag); > } > > - return __vmalloc(size, flags, PAGE_KERNEL); > + return p; > } > > - > static int parse_fsid(const char *str, struct ceph_fsid *fsid) > { > int i = 0; Reviewed-by: Jeff Layton <jlayton@kernel.org>
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index c41789154cdb..970e74b46213 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -13,6 +13,7 @@ #include <linux/nsproxy.h> #include <linux/fs_parser.h> #include <linux/sched.h> +#include <linux/sched/mm.h> #include <linux/seq_file.h> #include <linux/slab.h> #include <linux/statfs.h> @@ -185,18 +186,34 @@ int ceph_compare_options(struct ceph_options *new_opt, } EXPORT_SYMBOL(ceph_compare_options); +/* + * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are + * compatible with (a superset of) GFP_KERNEL. This is because while the + * actual pages are allocated with the specified flags, the page table pages + * are always allocated with GFP_KERNEL. map_vm_area() doesn't even take + * flags because GFP_KERNEL is hard-coded in {p4d,pud,pmd,pte}_alloc(). + * + * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO. + */ void *ceph_kvmalloc(size_t size, gfp_t flags) { - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - void *ptr = kmalloc(size, flags | __GFP_NOWARN); - if (ptr) - return ptr; + void *p; + + if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) { + p = kvmalloc(size, flags); + } else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) { + unsigned int nofs_flag = memalloc_nofs_save(); + p = kvmalloc(size, GFP_KERNEL); + memalloc_nofs_restore(nofs_flag); + } else { + unsigned int noio_flag = memalloc_noio_save(); + p = kvmalloc(size, GFP_KERNEL); + memalloc_noio_restore(noio_flag); } - return __vmalloc(size, flags, PAGE_KERNEL); + return p; } - static int parse_fsid(const char *str, struct ceph_fsid *fsid) { int i = 0;
The vmalloc allocator doesn't fully respect the specified gfp mask: while the actual pages are allocated as requested, the page table pages are always allocated with GFP_KERNEL. ceph_kvmalloc() may be called with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may result in a deadlock. There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic, it's just something that seemed sensible at the time (ceph_kvmalloc() predates kvmalloc()). kvmalloc() is smarter: in an attempt to reduce long term fragmentation, it first tries to kmalloc non-disruptively. Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using the scope API to avoid the deadlock. Note that kvmalloc() needs to be passed GFP_KERNEL to enable the fallback. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> --- net/ceph/ceph_common.c | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-)