diff mbox

[8/8] Revert "ext4: fix wrong gfp type under transaction"

Message ID 20170106141107.23953-9-mhocko@kernel.org (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Michal Hocko Jan. 6, 2017, 2:11 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
the transaction context uses memalloc_nofs_save and all allocations
within the this context inherit GFP_NOFS automatically, there is no
reason to mark specific allocations explicitly.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code
to make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/acl.c     | 6 +++---
 fs/ext4/extents.c | 2 +-
 fs/ext4/resize.c  | 4 ++--
 fs/ext4/xattr.c   | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

Comments

Theodore Ts'o Jan. 17, 2017, 2:56 a.m. UTC | #1
On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> the transaction context uses memalloc_nofs_save and all allocations
> within the this context inherit GFP_NOFS automatically, there is no
> reason to mark specific allocations explicitly.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> to make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Changes in the jbd2 layer aren't going to guarantee that
memalloc_nofs_save() will be executed if we are running ext4 without a
journal (aka in no journal mode).  And this is a *very* common
configuration; it's how ext4 is used inside Google in our production
servers.

So that means the earlier patches will probably need to be changed so
the nOFS scope is done in the ext4_journal_{start,stop} functions in
fs/ext4/ext4_jbd2.c.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Jan. 17, 2017, 8:24 a.m. UTC | #2
On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > the transaction context uses memalloc_nofs_save and all allocations
> > within the this context inherit GFP_NOFS automatically, there is no
> > reason to mark specific allocations explicitly.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > to make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Changes in the jbd2 layer aren't going to guarantee that
> memalloc_nofs_save() will be executed if we are running ext4 without a
> journal (aka in no journal mode).  And this is a *very* common
> configuration; it's how ext4 is used inside Google in our production
> servers.

OK, I wasn't aware of that.

> So that means the earlier patches will probably need to be changed so
> the nOFS scope is done in the ext4_journal_{start,stop} functions in
> fs/ext4/ext4_jbd2.c.

I could definitely appreciated some help here. The call paths are rather
complex and I am not familiar with the code enough. On of the biggest
problem I have currently is that there doesn't seem to be an easy place
to store the old allocation context. The original patch had it inside
the journal handle. I was thinking about putting it into superblock but
ext4_journal_stop doesn't seem to have access to the sb if there is no
handle. Now, if ext4_journal_start is never called from a nested context
then this is not a big deal but there are just too many caller to
check...
Michal Hocko Jan. 17, 2017, 3:18 p.m. UTC | #3
On Tue 17-01-17 09:24:25, Michal Hocko wrote:
> On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > > the transaction context uses memalloc_nofs_save and all allocations
> > > within the this context inherit GFP_NOFS automatically, there is no
> > > reason to mark specific allocations explicitly.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > > to make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > Changes in the jbd2 layer aren't going to guarantee that
> > memalloc_nofs_save() will be executed if we are running ext4 without a
> > journal (aka in no journal mode).  And this is a *very* common
> > configuration; it's how ext4 is used inside Google in our production
> > servers.
> 
> OK, I wasn't aware of that.
> 
> > So that means the earlier patches will probably need to be changed so
> > the nOFS scope is done in the ext4_journal_{start,stop} functions in
> > fs/ext4/ext4_jbd2.c.
> 
> I could definitely appreciated some help here. The call paths are rather
> complex and I am not familiar with the code enough. On of the biggest
> problem I have currently is that there doesn't seem to be an easy place
> to store the old allocation context. 

OK, so I've been staring into the code and AFAIU current->journal_info
can contain my stored information. I could either hijack part of the
word as the ref counting is only consuming low 12b. But that looks too
ugly to live. Or I can allocate some placeholder.

But before going to play with that I am really wondering whether we need
all this with no journal at all. AFAIU what Jack told me it is the
journal lock(s) which is the biggest problem from the reclaim recursion
point of view. What would cause a deadlock in no journal mode?
Theodore Ts'o Jan. 17, 2017, 3:59 p.m. UTC | #4
On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> 
> OK, so I've been staring into the code and AFAIU current->journal_info
> can contain my stored information. I could either hijack part of the
> word as the ref counting is only consuming low 12b. But that looks too
> ugly to live. Or I can allocate some placeholder.

Yeah, I was looking at something similar.  Can you guarantee that the
context will only take one or two bits?  (Looks like it only needs one
bit ATM, even though at the moment you're storing the whole GFP mask,
correct?)

> But before going to play with that I am really wondering whether we need
> all this with no journal at all. AFAIU what Jack told me it is the
> journal lock(s) which is the biggest problem from the reclaim recursion
> point of view. What would cause a deadlock in no journal mode?

We still have the original problem for why we need GFP_NOFS even in
ext2.  If we are in a writeback path, and we need to allocate memory,
we don't want to recurse back into the file system's writeback path.
Certainly not for the same inode, and while we could make it work if
the mm was writing back another inode, or another superblock, there
are also stack depth considerations that would make this be a bad
idea.  So we do need to be able to assert GFP_NOFS even in no journal
mode, and for any file system including ext2, for that matter.

Because of the fact that we're going to have to play games with
current->journal_info, maybe this is something that I should take
responsibility for, and to go through the the ext4 tree after the main
patch series go through?  Maybe you could use xfs and ext2 as sample
(simple) implementations?

My only ask is that the memalloc nofs context be a well defined N
bits, where N < 16, and I'll find some place to put them (probably
journal_info).

Thanks,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Jan. 17, 2017, 4:16 p.m. UTC | #5
On Tue 17-01-17 10:59:16, Theodore Ts'o wrote:
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> > 
> > OK, so I've been staring into the code and AFAIU current->journal_info
> > can contain my stored information. I could either hijack part of the
> > word as the ref counting is only consuming low 12b. But that looks too
> > ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)

No, I am just storing PF_MEMALLOC_NO{FS,IO} but I assume further changes
might want to pull in more changes into the scope context.

> > But before going to play with that I am really wondering whether we need
> > all this with no journal at all. AFAIU what Jack told me it is the
> > journal lock(s) which is the biggest problem from the reclaim recursion
> > point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.

But we do not enter the writeback path from the direct reclaim. Or do
you mean something other than pageout()'s mapping->a_ops->writepage?
There is only try_to_release_page where we get back to the filesystems
but I do not see any NOFS protection in ext4_releasepage.

> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?

How do you see a possibility that we would handle nojournal mode on
top of "[PATCH 5/8] jbd2: mark the transaction context with the scope
GFP_NOFS context" in a separate patch?

But anyway, I agree that we should go with the API sooner rather than
later.

>   Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I am pretty sure that we won't need more than a bit or two in a
foreseeable future (I can think of GFP_NOWAIT being one candidate).
Jan Kara Jan. 17, 2017, 5:29 p.m. UTC | #6
On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > But before going to play with that I am really wondering whether we need
> > > all this with no journal at all. AFAIU what Jack told me it is the
> > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> 
> But we do not enter the writeback path from the direct reclaim. Or do
> you mean something other than pageout()'s mapping->a_ops->writepage?
> There is only try_to_release_page where we get back to the filesystems
> but I do not see any NOFS protection in ext4_releasepage.

Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
shrinkers. That's it. So the recursion possibilities are rather more limited
than they used to be several years ago and we likely do not need as much
GFP_NOFS protection as we used to.

								Honza
Andreas Dilger Jan. 17, 2017, 9:04 p.m. UTC | #7
On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
>> 
>> OK, so I've been staring into the code and AFAIU current->journal_info
>> can contain my stored information. I could either hijack part of the
>> word as the ref counting is only consuming low 12b. But that looks too
>> ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)
> 
>> But before going to play with that I am really wondering whether we need
>> all this with no journal at all. AFAIU what Jack told me it is the
>> journal lock(s) which is the biggest problem from the reclaim recursion
>> point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.
> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?  Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I think Dave was suggesting that the NOFS context allow a pointer to
an arbitrary struct, so that it is possible to dereference this in
the filesystem itself to determine if the recursion is safe or not.
That way, ext2 could store an inode pointer (if that is what it cares
about) and verify that writeback is not recursing on the same inode,
and XFS can store something different.  It would also need to store
some additional info (e.g. fstype or superblock pointer) so that it
can determine how to interpret the NOFS context pointer.

I think it makes sense to add a couple of void * pointers to the task
struct along with journal_info and leave it up to the filesystem to
determine how to use them.

Cheers, Andreas
Michal Hocko Jan. 18, 2017, 8:29 a.m. UTC | #8
On Tue 17-01-17 14:04:03, Andreas Dilger wrote:
> On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > 
> > On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> >> 
> >> OK, so I've been staring into the code and AFAIU current->journal_info
> >> can contain my stored information. I could either hijack part of the
> >> word as the ref counting is only consuming low 12b. But that looks too
> >> ugly to live. Or I can allocate some placeholder.
> > 
> > Yeah, I was looking at something similar.  Can you guarantee that the
> > context will only take one or two bits?  (Looks like it only needs one
> > bit ATM, even though at the moment you're storing the whole GFP mask,
> > correct?)
> > 
> >> But before going to play with that I am really wondering whether we need
> >> all this with no journal at all. AFAIU what Jack told me it is the
> >> journal lock(s) which is the biggest problem from the reclaim recursion
> >> point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> > Certainly not for the same inode, and while we could make it work if
> > the mm was writing back another inode, or another superblock, there
> > are also stack depth considerations that would make this be a bad
> > idea.  So we do need to be able to assert GFP_NOFS even in no journal
> > mode, and for any file system including ext2, for that matter.
> > 
> > Because of the fact that we're going to have to play games with
> > current->journal_info, maybe this is something that I should take
> > responsibility for, and to go through the the ext4 tree after the main
> > patch series go through?  Maybe you could use xfs and ext2 as sample
> > (simple) implementations?
> > 
> > My only ask is that the memalloc nofs context be a well defined N
> > bits, where N < 16, and I'll find some place to put them (probably
> > journal_info).
> 
> I think Dave was suggesting that the NOFS context allow a pointer to
> an arbitrary struct, so that it is possible to dereference this in
> the filesystem itself to determine if the recursion is safe or not.

Yes, but can we start with a simpler approach first? Even this approach
takes quite some time to be used.
diff mbox

Patch

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index fd389935ecd1..9e98092c2a4b 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -32,7 +32,7 @@  ext4_acl_from_disk(const void *value, size_t size)
 		return ERR_PTR(-EINVAL);
 	if (count == 0)
 		return NULL;
-	acl = posix_acl_alloc(count, GFP_NOFS);
+	acl = posix_acl_alloc(count, GFP_KERNEL);
 	if (!acl)
 		return ERR_PTR(-ENOMEM);
 	for (n = 0; n < count; n++) {
@@ -94,7 +94,7 @@  ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)
 
 	*size = ext4_acl_size(acl->a_count);
 	ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
-			sizeof(ext4_acl_entry), GFP_NOFS);
+			sizeof(ext4_acl_entry), GFP_KERNEL);
 	if (!ext_acl)
 		return ERR_PTR(-ENOMEM);
 	ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
@@ -159,7 +159,7 @@  ext4_get_acl(struct inode *inode, int type)
 	}
 	retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
 	if (retval > 0) {
-		value = kmalloc(retval, GFP_NOFS);
+		value = kmalloc(retval, GFP_KERNEL);
 		if (!value)
 			return ERR_PTR(-ENOMEM);
 		retval = ext4_xattr_get(inode, name_index, "", value, retval);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9867b9e5ad8f..0371e7aa7bea 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2933,7 +2933,7 @@  int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 				le16_to_cpu(path[k].p_hdr->eh_entries)+1;
 	} else {
 		path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
-			       GFP_NOFS);
+			       GFP_KERNEL);
 		if (path == NULL) {
 			ext4_journal_stop(handle);
 			return -ENOMEM;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cf681004b196..e121f4e048b8 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -816,7 +816,7 @@  static int add_new_gdb(handle_t *handle, struct inode *inode,
 
 	n_group_desc = ext4_kvmalloc((gdb_num + 1) *
 				     sizeof(struct buffer_head *),
-				     GFP_NOFS);
+				     GFP_KERNEL);
 	if (!n_group_desc) {
 		err = -ENOMEM;
 		ext4_warning(sb, "not enough memory for %lu groups",
@@ -943,7 +943,7 @@  static int reserve_backup_gdb(handle_t *handle, struct inode *inode,
 	int res, i;
 	int err;
 
-	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
+	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
 	if (!primary)
 		return -ENOMEM;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..172317462238 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -875,7 +875,7 @@  ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 			unlock_buffer(bs->bh);
 			ea_bdebug(bs->bh, "cloning");
-			s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+			s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
 			error = -ENOMEM;
 			if (s->base == NULL)
 				goto cleanup;
@@ -887,7 +887,7 @@  ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-		s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
+		s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
 		/* assert(header == s->base) */
 		error = -ENOMEM;
 		if (s->base == NULL)