diff mbox

[RFC] ext4: increase the protection of drop nlink and ext4 inode destroy

Message ID 1482755657-28791-1-git-send-email-yi.zhang@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Zhang Yi Dec. 26, 2016, 12:34 p.m. UTC
Because of the disk and hardware issue, the ext4 filesystem have 
many errors, the inode->i_nlink of ext4 becomes zero abnormally 
but the dentry is still positive, it will cause memory corruption 
after the following process:

 1) Due to the inode->i_nlink is 0, this inode will be added into
the orhpan list,
 2) ext4_rename() cover this inode, and drop_nlink() will reverse
the inode->i_nlink to 0xFFFFFFFF,
 3) iput() add this inode to LRU,
 4) evict() will call destroy_inode() to destroy this inode but
skip removing it from the orphan list,
 5) after this, the inode's memory address space will be used by
other module, when the ext4 filesystem change the orphan list, it will
trample other module's data and then may cause oops.

Although we cannot avoid hardware and disk errors, we can control the
softwore error in the ext4 module, do not affect other modules and
increase the difficulty of locating problems.

This patch avoid inode->i_nlink reverse and remove the inode form the
orphan list when destroy it if the list is not empty.
Signed-off-by: yi zhang <yi.zhang@huawei.com>
---
 fs/ext4/super.c | 1 +
 fs/inode.c      | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

Comments

Andreas Dilger Dec. 26, 2016, 6:32 p.m. UTC | #1
On Dec 26, 2016, at 5:34 AM, yi zhang <yi.zhang@huawei.com> wrote:
> 
> Because of the disk and hardware issue, the ext4 filesystem have
> many errors, the inode->i_nlink of ext4 becomes zero abnormally
> but the dentry is still positive, it will cause memory corruption
> after the following process:
> 
> 1) Due to the inode->i_nlink is 0, this inode will be added into
> the orhpan list,
> 2) ext4_rename() cover this inode, and drop_nlink() will reverse
> the inode->i_nlink to 0xFFFFFFFF,
> 3) iput() add this inode to LRU,
> 4) evict() will call destroy_inode() to destroy this inode but
> skip removing it from the orphan list,
> 5) after this, the inode's memory address space will be used by
> other module, when the ext4 filesystem change the orphan list, it will
> trample other module's data and then may cause oops.
> 
> Although we cannot avoid hardware and disk errors, we can control the
> softwore error in the ext4 module, do not affect other modules and
> increase the difficulty of locating problems.
> 
> This patch avoid inode->i_nlink reverse and remove the inode form the

(typo) s/form/from/

> orphan list when destroy it if the list is not empty.
> Signed-off-by: yi zhang <yi.zhang@huawei.com>
> ---
> fs/ext4/super.c | 1 +
> fs/inode.c      | 5 ++++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 52b0530..617327e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -975,6 +975,7 @@ static void ext4_destroy_inode(struct inode *inode)
> 				EXT4_I(inode), sizeof(struct ext4_inode_info),
> 				true);
> 		dump_stack();
> +		ext4_orphan_del(NULL, inode);
> 	}
> 	call_rcu(&inode->i_rcu, ext4_i_callback);
> }
> diff --git a/fs/inode.c b/fs/inode.c
> index 88110fd..079d383 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -279,7 +279,10 @@ static void destroy_inode(struct inode *inode)
> */
> void drop_nlink(struct inode *inode)
> {
> -	WARN_ON(inode->i_nlink == 0);
> +	if (WARN(inode->i_nlink == 0, "inode %lu nlink"
> +		" is already 0", inode->i_ino))

(style) the string should be kept on a single line instead of being
split, especially since it can fit easily.

(defect) this needs to have a newline.

	if (WARN(inode->i_nlink == 0,
		 "inode %lu nlink is already 0\n", inode->i_ino))

Cheers, Andreas

> +		return;
> +
> 	inode->__i_nlink--;
> 	if (!inode->i_nlink)
> 		atomic_long_inc(&inode->i_sb->s_remove_count);
> --
> 2.5.0
> 


Cheers, Andreas
Valdis Klētnieks Dec. 31, 2016, 10:59 p.m. UTC | #2
On Mon, 26 Dec 2016 20:34:17 +0800, yi zhang said:
> Because of the disk and hardware issue, the ext4 filesystem have
> many errors, the inode->i_nlink of ext4 becomes zero abnormally
> but the dentry is still positive, it will cause memory corruption
> after the following process:
>
>  1) Due to the inode->i_nlink is 0, this inode will be added into
> the orhpan list,

> +	if (WARN(inode->i_nlink == 0, "inode %lu nlink"
> +		" is already 0", inode->i_ino))

Can we get the filesystem? Or at least the device major/minor? If a system
has multiple large ext4 filesystems, it would be helpful to know which
one is having the problem.
Zhang Yi Jan. 4, 2017, 8:29 a.m. UTC | #3
On 2017/1/1 6:59, Valdis.Kletnieks@vt.edu said:
> On Mon, 26 Dec 2016 20:34:17 +0800, yi zhang said:
>> Because of the disk and hardware issue, the ext4 filesystem have
>> many errors, the inode->i_nlink of ext4 becomes zero abnormally
>> but the dentry is still positive, it will cause memory corruption
>> after the following process:
>>
>>  1) Due to the inode->i_nlink is 0, this inode will be added into
>> the orhpan list,
> 
>> +	if (WARN(inode->i_nlink == 0, "inode %lu nlink"
>> +		" is already 0", inode->i_ino))
> 
> Can we get the filesystem? Or at least the device major/minor? If a system
> has multiple large ext4 filesystems, it would be helpful to know which
> one is having the problem.
> 

        if (WARN(inode->i_nlink == 0,
-               "inode %lu nlink is already 0", inode->i_ino))
+               "inode %lu nlink is already 0, dev=%u:%u",
+               inode->i_ino, MAJOR(inode->i_sb->s_dev), MINOR(inode->i_sb->s_dev)))
                return;

We can modify as above, it's enough to know which filesystem is having the
problem, what do you think?

yi zhang

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Jan. 4, 2017, 9:54 p.m. UTC | #4
On Wed, Jan 04, 2017 at 04:29:33PM +0800, zhangyi (F) wrote:
> On 2017/1/1 6:59, Valdis.Kletnieks@vt.edu said:
> > On Mon, 26 Dec 2016 20:34:17 +0800, yi zhang said:
> >> Because of the disk and hardware issue, the ext4 filesystem have
> >> many errors, the inode->i_nlink of ext4 becomes zero abnormally
> >> but the dentry is still positive, it will cause memory corruption
> >> after the following process:
> >>
> >>  1) Due to the inode->i_nlink is 0, this inode will be added into
> >> the orhpan list,
> > 
> >> +	if (WARN(inode->i_nlink == 0, "inode %lu nlink"
> >> +		" is already 0", inode->i_ino))
> > 
> > Can we get the filesystem? Or at least the device major/minor? If a system
> > has multiple large ext4 filesystems, it would be helpful to know which
> > one is having the problem.
> > 
> 
>         if (WARN(inode->i_nlink == 0,
> -               "inode %lu nlink is already 0", inode->i_ino))
> +               "inode %lu nlink is already 0, dev=%u:%u",
> +               inode->i_ino, MAJOR(inode->i_sb->s_dev), MINOR(inode->i_sb->s_dev)))
>                 return;
> 
> We can modify as above, it's enough to know which filesystem is having the
> problem, what do you think?

Why not:

if (inode->i_nlink == 0) {
	ext4_warning_inode(inode, "nlink is already 0");
	return;
}

?

--D

> 
> yi zhang
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Jan. 4, 2017, 10 p.m. UTC | #5
On Jan 4, 2017, at 2:54 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> On Wed, Jan 04, 2017 at 04:29:33PM +0800, zhangyi (F) wrote:
>> On 2017/1/1 6:59, Valdis.Kletnieks@vt.edu said:
>>> On Mon, 26 Dec 2016 20:34:17 +0800, yi zhang said:
>>>> Because of the disk and hardware issue, the ext4 filesystem have
>>>> many errors, the inode->i_nlink of ext4 becomes zero abnormally
>>>> but the dentry is still positive, it will cause memory corruption
>>>> after the following process:
>>>> 
>>>> 1) Due to the inode->i_nlink is 0, this inode will be added into
>>>> the orhpan list,
>>> 
>>>> +	if (WARN(inode->i_nlink == 0, "inode %lu nlink"
>>>> +		" is already 0", inode->i_ino))
>>> 
>>> Can we get the filesystem? Or at least the device major/minor? If a system
>>> has multiple large ext4 filesystems, it would be helpful to know which
>>> one is having the problem.
>>> 
>> 
>>        if (WARN(inode->i_nlink == 0,
>> -               "inode %lu nlink is already 0", inode->i_ino))
>> +               "inode %lu nlink is already 0, dev=%u:%u",
>> +               inode->i_ino, MAJOR(inode->i_sb->s_dev), MINOR(inode->i_sb->s_dev)))
>>                return;
>> 
>> We can modify as above, it's enough to know which filesystem is having the
>> problem, what do you think?
> 
> Why not:
> 
> if (inode->i_nlink == 0) {
> 	ext4_warning_inode(inode, "nlink is already 0");
> 	return;
> }

I like this better as well.

Cheers, Andreas
Theodore Ts'o Jan. 4, 2017, 11:35 p.m. UTC | #6
On Wed, Jan 04, 2017 at 01:54:24PM -0800, Darrick J. Wong wrote:
> 
> if (inode->i_nlink == 0) {
> 	ext4_warning_inode(inode, "nlink is already 0");
> 	return;
> }

We can't do that because the place where Zhangyi is proposing to
change is in fs/inode.c:drop_nlink(), so we can't add a call to
ext4_error() or ext4_warning().

So how exactly how did we get into this state?  When we read the inode
into memory, if i_nlink is zero, we declare the file system as
corrupted immediately.

So I assume this is happening the on-disk i_links_count (which is read
into inode->i_nlink) was too low.  So I think the way we should be
handling this is in unlink and rename, before we let i_nlink drop to
zero, we need to check to see if there are other dcache entries
pointing at the inode.  If so, we need to call ext4_error(), and in
the errors=continue case, return EFSCORRUPTED (aka EUCLEAN).

    		    	  	 	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang Yi Jan. 11, 2017, 9:07 a.m. UTC | #7
on 2017/1/5 7:35, Theodore Ts'o wrote:
> On Wed, Jan 04, 2017 at 01:54:24PM -0800, Darrick J. Wong wrote:
>>
>> if (inode->i_nlink == 0) {
>> 	ext4_warning_inode(inode, "nlink is already 0");
>> 	return;
>> }
> 
> We can't do that because the place where Zhangyi is proposing to
> change is in fs/inode.c:drop_nlink(), so we can't add a call to
> ext4_error() or ext4_warning().
> 
> So how exactly how did we get into this state?  When we read the inode
> into memory, if i_nlink is zero, we declare the file system as
> corrupted immediately.
> 
> So I assume this is happening the on-disk i_links_count (which is read
> into inode->i_nlink) was too low.  So I think the way we should be
> handling this is in unlink and rename, before we let i_nlink drop to
> zero, we need to check to see if there are other dcache entries
> pointing at the inode.  If so, we need to call ext4_error(), and in
> the errors=continue case, return EFSCORRUPTED (aka EUCLEAN).
> 
>     		    	  	 	      - Ted
> 

Hi Theodore:

The i_nlink underflow and memory corruption problem on ext4fs remains inconclusive.

You suggest we can check dcache entries when i_nlink drop to zero in unlink and
rename. But I think it may still have some problems, assume the following situation:

(1) The file we want to unlink have many hard links, but only one dcache entry in memory.
(2) open this file, but it's inode->i_nlink read from disk was 1 (too low).
(3) some one call rename and drop it's i_nlink to zero.
(4) it's inode is still in use and do not destroy (not closed), at the same time,
    some others open it's hard link and create a dcache entry.
(5) call rename again and it's i_nlink will still underflow and cause memory corruption.

For simplicity, I think we can add underflow protection in ext4_rename or
drop_nlink as V2 and V3 patch wrote. What do you think?

yi zhang

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Jan. 11, 2017, 3:34 p.m. UTC | #8
On Wed, Jan 11, 2017 at 05:07:29PM +0800, zhangyi (F) wrote:
> 
> (1) The file we want to unlink have many hard links, but only one dcache entry in memory.
> (2) open this file, but it's inode->i_nlink read from disk was 1 (too low).
> (3) some one call rename and drop it's i_nlink to zero.
> (4) it's inode is still in use and do not destroy (not closed), at the same time,
>     some others open it's hard link and create a dcache entry.
> (5) call rename again and it's i_nlink will still underflow and cause memory corruption.

Do you have reproducers that make it easy to reproduce situations like
this?  (It shouldn't be hard to write, but if you have them already
will save me some effort.  :-)

If we ever get passed an inode to ext4_file_open() where i_nlink is
zero, we can declare the file system is corrupt by calling
ext4_error() to report the problem.

Similarly, whenever we are passed a dentry pointing to an inode for
link, unlink, rename, and other methods in the inode_operations
structure, by definition the file system is corrupt, and again we
should report this using ext4_error().

So I don't think we should think of this as adding "underflow
protection"; instead we should think of it as adding more aggressive
detection of file system inconsistencies.  If there is dentry which is
valid, and pointing at an inode where n_links is zero, something has
gone seriously wrong.  So we should call ext4_error() to report the
file system inconsistency, and then return EFSCORRUPTED (aka EUCLEAN).

Since we would be doing this in a number of places, we should probably
add an inline function:

static inline int ext4_validate_dentry(struct dentry *dentry);

which returns 0 if the dentry is valid, and calls ext4_error_inode()
and returns -EFSCORRUPTED if the dentry points to an inode with a zero
i_nlink.

(Note: it's valid for i_nlinks to be zero if the system call started
with a file descriptor, such as read(2) or write(2) operating on a
file which is still deleted but has open file descriptors.  But if the
user has passed a pathname to the system call, such as in the case of
open(), rename(), unlink(), rmdir(), etc, then the dentry had better
be pointing at an inode with a non-zero i_nlink call.  We need to be a
bit careful if the method function could be called by both a pathname
and file descriptor variant of the system call --- for example
fsetxattr(2) and setxattr(2); we won't be able to use
ext4_validate_dentry() for those inode_operations calls.)

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang Yi Jan. 12, 2017, 8 a.m. UTC | #9
on 2017/1/11 23:34, Theodore Ts'o wrote:
> On Wed, Jan 11, 2017 at 05:07:29PM +0800, zhangyi (F) wrote:
>>
>> (1) The file we want to unlink have many hard links, but only one dcache entry in memory.
>> (2) open this file, but it's inode->i_nlink read from disk was 1 (too low).
>> (3) some one call rename and drop it's i_nlink to zero.
>> (4) it's inode is still in use and do not destroy (not closed), at the same time,
>>     some others open it's hard link and create a dcache entry.
>> (5) call rename again and it's i_nlink will still underflow and cause memory corruption.
> 
> Do you have reproducers that make it easy to reproduce situations like
> this?  (It shouldn't be hard to write, but if you have them already
> will save me some effort.  :-)
> 
> If we ever get passed an inode to ext4_file_open() where i_nlink is
> zero, we can declare the file system is corrupt by calling
> ext4_error() to report the problem.
> 
> Similarly, whenever we are passed a dentry pointing to an inode for
> link, unlink, rename, and other methods in the inode_operations
> structure, by definition the file system is corrupt, and again we
> should report this using ext4_error().
> 
> So I don't think we should think of this as adding "underflow
> protection"; instead we should think of it as adding more aggressive
> detection of file system inconsistencies.  If there is dentry which is
> valid, and pointing at an inode where n_links is zero, something has
> gone seriously wrong.  So we should call ext4_error() to report the
> file system inconsistency, and then return EFSCORRUPTED (aka EUCLEAN).
> 
> Since we would be doing this in a number of places, we should probably
> add an inline function:
> 
> static inline int ext4_validate_dentry(struct dentry *dentry);
> 
> which returns 0 if the dentry is valid, and calls ext4_error_inode()
> and returns -EFSCORRUPTED if the dentry points to an inode with a zero
> i_nlink.
> 

Thanks for your advice, it can detect file system inconsistency and reprot
error more effective. :-)

At the same time, I think other file systems may have the same problem, do
you think we should put these detections on the VFS layer? Thus other file
systems no need to do the same things, but the disadvantage is that we can
not call ext4_error to report ext4 inconsistency.

yi zhang

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Jan. 12, 2017, 5:03 p.m. UTC | #10
On Thu, Jan 12, 2017 at 04:00:16PM +0800, zhangyi (F) wrote:
> 
> At the same time, I think other file systems may have the same problem, do
> you think we should put these detections on the VFS layer? Thus other file
> systems no need to do the same things, but the disadvantage is that we can
> not call ext4_error to report ext4 inconsistency.

There are file systems which don't have inodes per-se where the
i_nlinks could be a something which is simulated by the file system.
So it's not *necessarily* an on-disk inconsistency.

We'll have to see if Al and other file system developers are
agreeable, but one thing that we could do is to do the detection in
the VFS layer (which it is actually easier to do), and if they find an
issue, they can just pass a report via a callback function found in
the struct_operations structure.  If there isn't such a function
defined, or the function returns 0, the VFS could just do nothing; if
it returns an error code, then that would get reflected back up to
userspace, plus whatever other action the file system sees fit to do.

	   		       	      	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 13, 2017, 3:42 a.m. UTC | #11
On Thu, Jan 12, 2017 at 12:03:28PM -0500, Theodore Ts'o wrote:
> On Thu, Jan 12, 2017 at 04:00:16PM +0800, zhangyi (F) wrote:
> > 
> > At the same time, I think other file systems may have the same problem, do
> > you think we should put these detections on the VFS layer? Thus other file
> > systems no need to do the same things, but the disadvantage is that we can
> > not call ext4_error to report ext4 inconsistency.
> 
> There are file systems which don't have inodes per-se where the
> i_nlinks could be a something which is simulated by the file system.
> So it's not *necessarily* an on-disk inconsistency.
> 
> We'll have to see if Al and other file system developers are
> agreeable, but one thing that we could do is to do the detection in
> the VFS layer (which it is actually easier to do), and if they find an
> issue, they can just pass a report via a callback function found in
> the struct_operations structure.  If there isn't such a function
> defined, or the function returns 0, the VFS could just do nothing; if
> it returns an error code, then that would get reflected back up to
> userspace, plus whatever other action the file system sees fit to do.

	Detection of what?  Zero ->i_nlink on inode of dentry that passes e.g.
may_delete()?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Jan. 13, 2017, 2:26 p.m. UTC | #12
On Fri, Jan 13, 2017 at 03:42:19AM +0000, Al Viro wrote:
> On Thu, Jan 12, 2017 at 12:03:28PM -0500, Theodore Ts'o wrote:
> > On Thu, Jan 12, 2017 at 04:00:16PM +0800, zhangyi (F) wrote:
> > > 
> > > At the same time, I think other file systems may have the same problem, do
> > > you think we should put these detections on the VFS layer? Thus other file
> > > systems no need to do the same things, but the disadvantage is that we can
> > > not call ext4_error to report ext4 inconsistency.
> > 
> > There are file systems which don't have inodes per-se where the
> > i_nlinks could be a something which is simulated by the file system.
> > So it's not *necessarily* an on-disk inconsistency.
> > 
> > We'll have to see if Al and other file system developers are
> > agreeable, but one thing that we could do is to do the detection in
> > the VFS layer (which it is actually easier to do), and if they find an
> > issue, they can just pass a report via a callback function found in
> > the struct_operations structure.  If there isn't such a function
> > defined, or the function returns 0, the VFS could just do nothing; if
> > it returns an error code, then that would get reflected back up to
> > userspace, plus whatever other action the file system sees fit to do.
> 
> 	Detection of what?  Zero ->i_nlink on inode of dentry that passes e.g.
> may_delete()?

Or other impossible cases where there is a valid dentry pointing at an
inode with zero i_nlink.  I am fairly sure this should **never**
happen in the case of unlink(2), rmdir(2), or by the time we call
file_ops->open(), and if it does, it indicates that the underlying
on-disk file system (at least for ext4) is corrupt.

Am I missing a case?  And do you have an opinion about whether we
should be trying to do this check at the VFS layer versus inside ext4?

Thanks,

					- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang Yi Jan. 16, 2017, 3:24 a.m. UTC | #13
on 2017/1/11 23:34, Theodore Ts'o wrote:
> On Wed, Jan 11, 2017 at 05:07:29PM +0800, zhangyi (F) wrote:
>>
>> (1) The file we want to unlink have many hard links, but only one dcache entry in memory.
>> (2) open this file, but it's inode->i_nlink read from disk was 1 (too low).
>> (3) some one call rename and drop it's i_nlink to zero.
>> (4) it's inode is still in use and do not destroy (not closed), at the same time,
>>     some others open it's hard link and create a dcache entry.
>> (5) call rename again and it's i_nlink will still underflow and cause memory corruption.
> 
> Do you have reproducers that make it easy to reproduce situations like
> this?  (It shouldn't be hard to write, but if you have them already
> will save me some effort.  :-)
> 

I make a reproducer, we can do the following steps to reproduce this probrem easily:
1) mount a ext4 file system, and create 3 files and 1 hard link,

    #mount /dev/sdax /mnt
    #cd /mnt
    #touch old_file1 old_file2 new_file
    #ln new_file new_link1

2) umount the file system and use the debugfs to change new_file's
   links_count value to 1, which is used to simulate the fs inconsistency,

   #umount /mnt
   #debugfs /dev/sdax -w
	set_inode_field new_file links_count 1

3) mount the fs again, and then execute the following program (Note:
   do not execute the ls cmd, it will create the second dcache entry),

   #define RENAME_OLD_FILE_1  "old_file1"
   #define RENAME_OLD_FILE_2  "old_file2"
   #define RENAME_NEW_FILE    "new_file"
   #define NEW_FILE_LINK_1    "new_link1"

   int main(int argc, char *argv[])
   {
        int fd = 0;
        int err = 0;

        fd = open(RENAME_NEW_FILE, O_RDONLY);
        if (fd < 0) {
                printf("open error:%d\n", errno);
                return -1;
        }

        err = rename(RENAME_OLD_FILE_1, RENAME_NEW_FILE);
        if (err < 0) {
                printf("rename error:%d\n", errno);
                close(fd);
                return -1;
        }

        err = rename(RENAME_OLD_FILE_2, NEW_FILE_LINK_1);
        if (err < 0) {
                printf("rename error:%d\n", errno);
                close(fd);
                return -1;
        }

        close(fd);
        return 0;
   }

4) after this, the new_file's inode->i_nlink is underflowed and add to orphan list,
   kernel dump like this:

    ------------[ cut here ]------------
   WARNING: CPU: 0 PID: 1814 at fs/inode.c:282 drop_nlink+0x3e/0x50
   ...
   Call Trace:
   dump_stack+0x63/0x86
   __warn+0xcb/0xf0
   warn_slowpath_null+0x1d/0x20
   drop_nlink+0x3e/0x50
   ext4_rename+0x532/0x8c0
   ext4_rename2+0x1d/0x30
   vfs_rename+0x728/0x940
    ? __lookup_hash+0x20/0xa0
    SyS_rename+0x3ba/0x3e0
    entry_SYSCALL_64_fastpath+0x1a/0xa9
   ...
    ---[ end trace b157dacbc891e6e8 ]---

5) then, we trigger mem shrink, this inode will be destroyed but it is still
   on the orphan list,

   #echo 3 > /proc/sys/vm/drop_caches

   kernrl dump:

   EXT4-fs (sdb1): Inode 16 (ffff98f4b3285c20): orphan list check failed!
   ...
   ffff98f4b3285d30: fa87e800 ffff98f4 b3285e80 ffff98f4  .........^(.....
   ffff98f4b3285d40: b20829d8 ffff98f4 00000010 00000000  .)..............
   ffff98f4b3285d50: ffffffff 00000000 00000000 00000000  ................
   ...
   Call Trace:
    dump_stack+0x63/0x86
    ext4_destroy_inode+0xa0/0xb0
    destroy_inode+0x3b/0x60
    evict+0x130/0x1c0
    dispose_list+0x4d/0x70
    prune_icache_sb+0x5a/0x80
    super_cache_scan+0x14b/0x1a0
    shrink_slab.part.40+0x1f5/0x420
    shrink_slab+0x29/0x30
    drop_slab_node+0x31/0x60
    drop_slab+0x3f/0x70
    drop_caches_sysctl_handler+0x71/0xc0
    proc_sys_call_handler+0xea/0x110
    proc_sys_write+0x14/0x20
    __vfs_write+0x37/0x160
    ? selinux_file_permission+0xd7/0x110
    ? security_file_permission+0x3b/0xc0
    vfs_write+0xb5/0x1a0
    SyS_write+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xa9
   ...
   bash (1594): drop_caches: 3

6) Some time later, if we change the orphan list, it will cause memory corruption.

Thanks.

zhangyi

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 52b0530..617327e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -975,6 +975,7 @@  static void ext4_destroy_inode(struct inode *inode)
 				EXT4_I(inode), sizeof(struct ext4_inode_info),
 				true);
 		dump_stack();
+		ext4_orphan_del(NULL, inode);
 	}
 	call_rcu(&inode->i_rcu, ext4_i_callback);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 88110fd..079d383 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -279,7 +279,10 @@  static void destroy_inode(struct inode *inode)
  */
 void drop_nlink(struct inode *inode)
 {
-	WARN_ON(inode->i_nlink == 0);
+	if (WARN(inode->i_nlink == 0, "inode %lu nlink"
+		" is already 0", inode->i_ino))
+		return;
+
 	inode->__i_nlink--;
 	if (!inode->i_nlink)
 		atomic_long_inc(&inode->i_sb->s_remove_count);