Message ID | 20170206140718.16222-5-mhocko@kernel.org (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote: > +++ b/fs/xfs/xfs_buf.c > @@ -442,17 +442,17 @@ _xfs_buf_map_pages( > bp->b_addr = NULL; > } else { > int retried = 0; > - unsigned noio_flag; > + unsigned nofs_flag; > > /* > * vm_map_ram() will allocate auxillary structures (e.g. > * pagetables) with GFP_KERNEL, yet we are likely to be under > * GFP_NOFS context here. Hence we need to tell memory reclaim > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent > * memory reclaim re-entering the filesystem here and > * potentially deadlocking. > */ This comment feels out of date ... how about: /* * vm_map_ram will allocate auxiliary structures (eg page * tables) with GFP_KERNEL. If that tries to reclaim memory * by calling back into this filesystem, we may deadlock. * Prevent that by setting the NOFS flag. */ > - noio_flag = memalloc_noio_save(); > + nofs_flag = memalloc_nofs_save(); > do { > bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, > -1, PAGE_KERNEL); Also, I think it shows that this is the wrong place in XFS to be calling memalloc_nofs_save(). I'm not arguing against including this patch; it's a step towards where we want to be. I also don't know XFS well enough to know where to set that flag ;-) Presumably when we start a transaction ... ? -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 06-02-17 07:39:23, Matthew Wilcox wrote: > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote: > > +++ b/fs/xfs/xfs_buf.c > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages( > > bp->b_addr = NULL; > > } else { > > int retried = 0; > > - unsigned noio_flag; > > + unsigned nofs_flag; > > > > /* > > * vm_map_ram() will allocate auxillary structures (e.g. > > * pagetables) with GFP_KERNEL, yet we are likely to be under > > * GFP_NOFS context here. Hence we need to tell memory reclaim > > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent > > * memory reclaim re-entering the filesystem here and > > * potentially deadlocking. > > */ > > This comment feels out of date ... how about: which part is out of date? > > /* > * vm_map_ram will allocate auxiliary structures (eg page > * tables) with GFP_KERNEL. If that tries to reclaim memory > * by calling back into this filesystem, we may deadlock. > * Prevent that by setting the NOFS flag. > */ dunno, the previous wording seems clear enough to me. Maybe little bit more chatty than yours but I am not sure this is worth changing. > > > - noio_flag = memalloc_noio_save(); > > + nofs_flag = memalloc_nofs_save(); > > do { > > bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, > > -1, PAGE_KERNEL); > > Also, I think it shows that this is the wrong place in XFS to be calling > memalloc_nofs_save(). I'm not arguing against including this patch; > it's a step towards where we want to be. I also don't know XFS well > enough to know where to set that flag ;-) Presumably when we start a > transaction ... ? Yes that is what I would like to achieve longterm. And the reason why I didn't want to mimic this pattern in kvmalloc as some have suggested. It just takes much more time to get there from the past experience and we should really start somewhere.
On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote: > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote: > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote: > > > +++ b/fs/xfs/xfs_buf.c > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages( > > > bp->b_addr = NULL; > > > } else { > > > int retried = 0; > > > - unsigned noio_flag; > > > + unsigned nofs_flag; > > > > > > /* > > > * vm_map_ram() will allocate auxillary structures (e.g. > > > * pagetables) with GFP_KERNEL, yet we are likely to be under > > > * GFP_NOFS context here. Hence we need to tell memory reclaim > > > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > > > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent > > > * memory reclaim re-entering the filesystem here and > > > * potentially deadlocking. > > > */ > > > > This comment feels out of date ... how about: > > which part is out of date? > > > > > /* > > * vm_map_ram will allocate auxiliary structures (eg page > > * tables) with GFP_KERNEL. If that tries to reclaim memory > > * by calling back into this filesystem, we may deadlock. > > * Prevent that by setting the NOFS flag. > > */ > > dunno, the previous wording seems clear enough to me. Maybe little bit > more chatty than yours but I am not sure this is worth changing. I prefer to keep the "...yet we are likely to be under GFP_NOFS..." wording of the old comment because it captures the uncertainty of whether or not we actually are already under NOFS. If someone actually has audited this code well enough to know for sure then yes let's change the comment, but I haven't gone that far. The way the kmem_zalloc_large code is structured suggests to me that callers don't have to be especially aware of the NOFS state -- they can just call the function and it'll take care of making it work. > > > > > - noio_flag = memalloc_noio_save(); > > > + nofs_flag = memalloc_nofs_save(); > > > do { > > > bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, > > > -1, PAGE_KERNEL); > > > > Also, I think it shows that this is the wrong place in XFS to be calling > > memalloc_nofs_save(). I'm not arguing against including this patch; > > it's a step towards where we want to be. I also don't know XFS well > > enough to know where to set that flag ;-) Presumably when we start a > > transaction ... ? None of the current kmem_zalloc_large callers actually have a transaction, at least not at that point. > Yes that is what I would like to achieve longterm. And the reason why I > didn't want to mimic this pattern in kvmalloc as some have suggested. > It just takes much more time to get there from the past experience and > we should really start somewhere. --D > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 06-02-17 10:32:37, Darrick J. Wong wrote: > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote: > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote: > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote: > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages( > > > > bp->b_addr = NULL; > > > > } else { > > > > int retried = 0; > > > > - unsigned noio_flag; > > > > + unsigned nofs_flag; > > > > > > > > /* > > > > * vm_map_ram() will allocate auxillary structures (e.g. > > > > * pagetables) with GFP_KERNEL, yet we are likely to be under > > > > * GFP_NOFS context here. Hence we need to tell memory reclaim > > > > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > > > > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent > > > > * memory reclaim re-entering the filesystem here and > > > > * potentially deadlocking. > > > > */ > > > > > > This comment feels out of date ... how about: > > > > which part is out of date? > > > > > > > > /* > > > * vm_map_ram will allocate auxiliary structures (eg page > > > * tables) with GFP_KERNEL. If that tries to reclaim memory > > > * by calling back into this filesystem, we may deadlock. > > > * Prevent that by setting the NOFS flag. > > > */ > > > > dunno, the previous wording seems clear enough to me. Maybe little bit > > more chatty than yours but I am not sure this is worth changing. > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..." > wording of the old comment because it captures the uncertainty of > whether or not we actually are already under NOFS. If someone actually > has audited this code well enough to know for sure then yes let's change > the comment, but I haven't gone that far. I believe we can drop the memalloc_nofs_save then as well because either we are called from a potentially dangerous context and thus we are in the nofs scope we we do not need the protection at all.
On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote: > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote: > > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote: > > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote: > > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote: > > > > > +++ b/fs/xfs/xfs_buf.c > > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages( > > > > > bp->b_addr = NULL; > > > > > } else { > > > > > int retried = 0; > > > > > - unsigned noio_flag; > > > > > + unsigned nofs_flag; > > > > > > > > > > /* > > > > > * vm_map_ram() will allocate auxillary structures (e.g. > > > > > * pagetables) with GFP_KERNEL, yet we are likely to be under > > > > > * GFP_NOFS context here. Hence we need to tell memory reclaim > > > > > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > > > > > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent > > > > > * memory reclaim re-entering the filesystem here and > > > > > * potentially deadlocking. > > > > > */ > > > > > > > > This comment feels out of date ... how about: > > > > > > which part is out of date? > > > > > > > > > > > /* > > > > * vm_map_ram will allocate auxiliary structures (eg page > > > > * tables) with GFP_KERNEL. If that tries to reclaim memory > > > > * by calling back into this filesystem, we may deadlock. > > > > * Prevent that by setting the NOFS flag. > > > > */ > > > > > > dunno, the previous wording seems clear enough to me. Maybe little bit > > > more chatty than yours but I am not sure this is worth changing. > > > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..." > > wording of the old comment because it captures the uncertainty of > > whether or not we actually are already under NOFS. If someone actually > > has audited this code well enough to know for sure then yes let's change > > the comment, but I haven't gone that far. Ugh, /me hands himself another cup of coffee... Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this discussion. Probably because they have similar code snippets with very similar comments to two totally different parts of xfs. The _xfs_buf_map_pages can be called inside or outside of transaction context, so I think we still have to memalloc_nofs_save for that to avoid the lockdep complaints and deadlocks referenced in the commit that added all that (to _xfs_buf_map_pages) in the first place. ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram") My comments about kmem_zalloc_large still stand, even though the part of the patch you two were discussing was the _xfs_buf_map_pages. I probably should have clarified that I think both functions actually /are/ doing the right thing wrt calling (or not calling) memalloc_nofs_save(). > I believe we can drop the memalloc_nofs_save then as well because either > we are called from a potentially dangerous context and thus we are in > the nofs scope we we do not need the protection at all. Uh, now that I've muddied up the waters, which part are you referring to? --D > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 06-02-17 11:51:11, Darrick J. Wong wrote: > On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote: > > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote: > > > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote: > > > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote: > > > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote: > > > > > > +++ b/fs/xfs/xfs_buf.c > > > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages( > > > > > > bp->b_addr = NULL; > > > > > > } else { > > > > > > int retried = 0; > > > > > > - unsigned noio_flag; > > > > > > + unsigned nofs_flag; > > > > > > > > > > > > /* > > > > > > * vm_map_ram() will allocate auxillary structures (e.g. > > > > > > * pagetables) with GFP_KERNEL, yet we are likely to be under > > > > > > * GFP_NOFS context here. Hence we need to tell memory reclaim > > > > > > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > > > > > > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent > > > > > > * memory reclaim re-entering the filesystem here and > > > > > > * potentially deadlocking. > > > > > > */ > > > > > > > > > > This comment feels out of date ... how about: > > > > > > > > which part is out of date? > > > > > > > > > > > > > > /* > > > > > * vm_map_ram will allocate auxiliary structures (eg page > > > > > * tables) with GFP_KERNEL. If that tries to reclaim memory > > > > > * by calling back into this filesystem, we may deadlock. > > > > > * Prevent that by setting the NOFS flag. > > > > > */ > > > > > > > > dunno, the previous wording seems clear enough to me. Maybe little bit > > > > more chatty than yours but I am not sure this is worth changing. > > > > > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..." > > > wording of the old comment because it captures the uncertainty of > > > whether or not we actually are already under NOFS. If someone actually > > > has audited this code well enough to know for sure then yes let's change > > > the comment, but I haven't gone that far. > > Ugh, /me hands himself another cup of coffee... > > Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this > discussion. Probably because they have similar code snippets with very > similar comments to two totally different parts of xfs. > > The _xfs_buf_map_pages can be called inside or outside of > transaction context, so I think we still have to memalloc_nofs_save for > that to avoid the lockdep complaints and deadlocks referenced in the > commit that added all that (to _xfs_buf_map_pages) in the first place. > ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram") Yes, and that memalloc_nofs_save would start with the transaction context so this (_xfs_buf_map_pages) call would be already covered so additional memalloc_nofs_save would be unnecessary. Right now I am not sure whether this is always the case so I have kept this "just to be sure" measure. Checking that would be in the next step when I would like to remove other KM_NOFS usage so that we would always rely on the scope inside the transaction or other potentially dangerous (e.g. from the stack usage POV and who knows what else) contexts. Does that make more sense now?
On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote: > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote: > > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote: > > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote: > > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote: > > > > > +++ b/fs/xfs/xfs_buf.c > > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages( > > > > > bp->b_addr = NULL; > > > > > } else { > > > > > int retried = 0; > > > > > - unsigned noio_flag; > > > > > + unsigned nofs_flag; > > > > > > > > > > /* > > > > > * vm_map_ram() will allocate auxillary structures (e.g. > > > > > * pagetables) with GFP_KERNEL, yet we are likely to be under > > > > > * GFP_NOFS context here. Hence we need to tell memory reclaim > > > > > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > > > > > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent > > > > > * memory reclaim re-entering the filesystem here and > > > > > * potentially deadlocking. > > > > > */ > > > > > > > > This comment feels out of date ... how about: > > > > > > which part is out of date? > > > > > > > > > > > /* > > > > * vm_map_ram will allocate auxiliary structures (eg page > > > > * tables) with GFP_KERNEL. If that tries to reclaim memory > > > > * by calling back into this filesystem, we may deadlock. > > > > * Prevent that by setting the NOFS flag. > > > > */ > > > > > > dunno, the previous wording seems clear enough to me. Maybe little bit > > > more chatty than yours but I am not sure this is worth changing. > > > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..." > > wording of the old comment because it captures the uncertainty of > > whether or not we actually are already under NOFS. If someone actually > > has audited this code well enough to know for sure then yes let's change > > the comment, but I haven't gone that far. > > I believe we can drop the memalloc_nofs_save then as well because either > we are called from a potentially dangerous context and thus we are in > the nofs scope we we do not need the protection at all. No, absolutely not. "Belief" is not a sufficient justification for removing low level deadlock avoidance infrastructure. This code needs to remain in _xfs_buf_map_pages() until a full audit of the caller paths is done and we're 100% certain that there are no lurking deadlocks. For example, I'm pretty sure we can call into _xfs_buf_map_pages() outside of a transaction context but with an inode ILOCK held exclusively. If we then recurse into memory reclaim and try to run a transaction during reclaim, we have an inverted ILOCK vs transaction locking order. i.e. we are not allowed to call xfs_trans_reserve() with an ILOCK held as that can deadlock the log: log full, locked inode pins tail of log, inode cannot be flushed because ILOCK is held by caller waiting for log space to become available.... i.e. there are certain situations where holding a ILOCK is a deadlock vector. See xfs_lock_inodes() for an example of the lengths we go to avoid ILOCK based log deadlocks like this... Cheers, Dave.
On Tue 07-02-17 09:51:50, Dave Chinner wrote: > On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote: > > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote: [...] > > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..." > > > wording of the old comment because it captures the uncertainty of > > > whether or not we actually are already under NOFS. If someone actually > > > has audited this code well enough to know for sure then yes let's change > > > the comment, but I haven't gone that far. > > > > I believe we can drop the memalloc_nofs_save then as well because either > > we are called from a potentially dangerous context and thus we are in > > the nofs scope we we do not need the protection at all. > > No, absolutely not. "Belief" is not a sufficient justification for > removing low level deadlock avoidance infrastructure. This code > needs to remain in _xfs_buf_map_pages() until a full audit of the > caller paths is done and we're 100% certain that there are no > lurking deadlocks. Exactly. I was actually refering to "If someone actually has audited this code" above... So I definitely do not want to justify anything based on the belief > For example, I'm pretty sure we can call into _xfs_buf_map_pages() > outside of a transaction context but with an inode ILOCK held > exclusively. If we then recurse into memory reclaim and try to run a > transaction during reclaim, we have an inverted ILOCK vs transaction > locking order. i.e. we are not allowed to call xfs_trans_reserve() > with an ILOCK held as that can deadlock the log: log full, locked > inode pins tail of log, inode cannot be flushed because ILOCK is > held by caller waiting for log space to become available.... > > i.e. there are certain situations where holding a ILOCK is a > deadlock vector. See xfs_lock_inodes() for an example of the lengths > we go to avoid ILOCK based log deadlocks like this... Thanks for the reference. This is really helpful!
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c index a76a05dae96b..0c9f94f41b6c 100644 --- a/fs/xfs/kmem.c +++ b/fs/xfs/kmem.c @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags) void * kmem_zalloc_large(size_t size, xfs_km_flags_t flags) { - unsigned noio_flag = 0; + unsigned nofs_flag = 0; void *ptr; gfp_t lflags; @@ -77,17 +77,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags) * __vmalloc() will allocate data pages and auxillary structures (e.g. * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context * here. Hence we need to tell memory reclaim that we are in such a - * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering + * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering * the filesystem here and potentially deadlocking. */ - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) - noio_flag = memalloc_noio_save(); + if (flags & KM_NOFS) + nofs_flag = memalloc_nofs_save(); lflags = kmem_flags_convert(flags); ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL); - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) - memalloc_noio_restore(noio_flag); + if (flags & KM_NOFS) + memalloc_nofs_restore(nofs_flag); return ptr; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 8c7d01b75922..676a9ae75b9a 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -442,17 +442,17 @@ _xfs_buf_map_pages( bp->b_addr = NULL; } else { int retried = 0; - unsigned noio_flag; + unsigned nofs_flag; /* * vm_map_ram() will allocate auxillary structures (e.g. * pagetables) with GFP_KERNEL, yet we are likely to be under * GFP_NOFS context here. Hence we need to tell memory reclaim - * that we are in such a context via PF_MEMALLOC_NOIO to prevent + * that we are in such a context via PF_MEMALLOC_NOFS to prevent * memory reclaim re-entering the filesystem here and * potentially deadlocking. */ - noio_flag = memalloc_noio_save(); + nofs_flag = memalloc_nofs_save(); do { bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, -1, PAGE_KERNEL); @@ -460,7 +460,7 @@ _xfs_buf_map_pages( break; vm_unmap_aliases(); } while (retried++ <= 1); - memalloc_noio_restore(noio_flag); + memalloc_nofs_restore(nofs_flag); if (!bp->b_addr) return -ENOMEM;