diff mbox

btrfs: add delayed_iput list head to btrfs inode

Message ID 511180E4.2020600@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric Sandeen Feb. 5, 2013, 10 p.m. UTC
Following the lead from Jeff Mahoney's comment in the code:

/* JDM: If this is fs-wide, why can't we add a pointer to
 * btrfs_inode instead and avoid the allocation? */

Remove the NOFAIL kmalloc in btrfs_add_delayed_iput(), and just
use a list head in the btrfs inode.

This does grow the btrfs inode by 16 bytes, but doesn't change
slab cache utilization on my machine.  Rearranging the btrfs
inode could get back 8 bytes or so if people are worried about it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
---


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Zach Brown Feb. 5, 2013, 11:14 p.m. UTC | #1
> +	struct btrfs_inode *b_inode = BTRFS_I(inode);
> +	struct btrfs_fs_info *fs_info = b_inode->root->fs_info;
>  
>  	if (atomic_add_unless(&inode->i_count, -1, 1))
>  		return;
>  
> -	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
> -	delayed->inode = inode;
> -
>  	spin_lock(&fs_info->delayed_iput_lock);
> -	list_add_tail(&delayed->list, &fs_info->delayed_iputs);
> +	list_add_tail(&b_inode->delayed_iput, &fs_info->delayed_iputs);
>  	spin_unlock(&fs_info->delayed_iput_lock);
>  }

Hmm.  I'm not great with inode life cycles, but isn't this only safe if
someone else can't get an i_count reference while this is in flight?  It
looks like the final iput does the unhashing, and so on, so couldn't an
iget/iput race with this and try to add the inode's list_head twice?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Feb. 6, 2013, 2:08 a.m. UTC | #2
On Tue, Feb 05, 2013 at 03:14:05PM -0800, Zach Brown wrote:
> > +	struct btrfs_inode *b_inode = BTRFS_I(inode);
> > +	struct btrfs_fs_info *fs_info = b_inode->root->fs_info;
> >  
> >  	if (atomic_add_unless(&inode->i_count, -1, 1))
> >  		return;
> >  
> > -	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
> > -	delayed->inode = inode;
> > -
> >  	spin_lock(&fs_info->delayed_iput_lock);
> > -	list_add_tail(&delayed->list, &fs_info->delayed_iputs);
> > +	list_add_tail(&b_inode->delayed_iput, &fs_info->delayed_iputs);
> >  	spin_unlock(&fs_info->delayed_iput_lock);
> >  }
> 
> Hmm.  I'm not great with inode life cycles, but isn't this only safe if
> someone else can't get an i_count reference while this is in flight?  It
> looks like the final iput does the unhashing, and so on, so couldn't an
> iget/iput race with this and try to add the inode's list_head twice?

Yeah, same concern here.  Basically this will result in inodes still being
in use on unmount.

Actually I did a similar one, here is some disscussion:

https://patchwork.kernel.org/patch/1824711/

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Feb. 6, 2013, 2:14 p.m. UTC | #3
On Feb 5, 2013, at 8:11 PM, Liu Bo <bo.li.liu@oracle.com> wrote:

> On Tue, Feb 05, 2013 at 03:14:05PM -0800, Zach Brown wrote:
>>> +    struct btrfs_inode *b_inode = BTRFS_I(inode);
>>> +    struct btrfs_fs_info *fs_info = b_inode->root->fs_info;
>>> 
>>>    if (atomic_add_unless(&inode->i_count, -1, 1))
>>>        return;
>>> 
>>> -    delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
>>> -    delayed->inode = inode;
>>> -
>>>    spin_lock(&fs_info->delayed_iput_lock);
>>> -    list_add_tail(&delayed->list, &fs_info->delayed_iputs);
>>> +    list_add_tail(&b_inode->delayed_iput, &fs_info->delayed_iputs);
>>>    spin_unlock(&fs_info->delayed_iput_lock);
>>> }
>> 
>> Hmm.  I'm not great with inode life cycles, but isn't this only safe if
>> someone else can't get an i_count reference while this is in flight?  It
>> looks like the final iput does the unhashing, and so on, so couldn't an
>> iget/iput race with this and try to add the inode's list_head twice?
> 
> Yeah, same concern here.  Basically this will result in inodes still being
> in use on unmount.
> 
> Actually I did a similar one, here is some disscussion:
> 
> https://patchwork.kernel.org/patch/1824711/
> 
Ok, thanks all.  We should remove Jeff's comment then, it sure sounded like a good idea...

Eric

> thanks,
> liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Feb. 6, 2013, 3:53 p.m. UTC | #4
On 2/5/13 8:08 PM, Liu Bo wrote:
> On Tue, Feb 05, 2013 at 03:14:05PM -0800, Zach Brown wrote:
>>> +	struct btrfs_inode *b_inode = BTRFS_I(inode);
>>> +	struct btrfs_fs_info *fs_info = b_inode->root->fs_info;
>>>  
>>>  	if (atomic_add_unless(&inode->i_count, -1, 1))
>>>  		return;
>>>  
>>> -	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
>>> -	delayed->inode = inode;
>>> -
>>>  	spin_lock(&fs_info->delayed_iput_lock);
>>> -	list_add_tail(&delayed->list, &fs_info->delayed_iputs);
>>> +	list_add_tail(&b_inode->delayed_iput, &fs_info->delayed_iputs);
>>>  	spin_unlock(&fs_info->delayed_iput_lock);
>>>  }
>>
>> Hmm.  I'm not great with inode life cycles, but isn't this only safe if
>> someone else can't get an i_count reference while this is in flight?  It
>> looks like the final iput does the unhashing, and so on, so couldn't an
>> iget/iput race with this and try to add the inode's list_head twice?
> 
> Yeah, same concern here.  Basically this will result in inodes still being
> in use on unmount.
> 
> Actually I did a similar one, here is some disscussion:
> 
> https://patchwork.kernel.org/patch/1824711/

I read it, thanks.  Did you try the counter approach?

-Eric

> thanks,
> liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Feb. 6, 2013, 4:02 p.m. UTC | #5
On Wed, Feb 06, 2013 at 09:53:05AM -0600, Eric Sandeen wrote:
> On 2/5/13 8:08 PM, Liu Bo wrote:
> > On Tue, Feb 05, 2013 at 03:14:05PM -0800, Zach Brown wrote:
> >>> +	struct btrfs_inode *b_inode = BTRFS_I(inode);
> >>> +	struct btrfs_fs_info *fs_info = b_inode->root->fs_info;
> >>>  
> >>>  	if (atomic_add_unless(&inode->i_count, -1, 1))
> >>>  		return;
> >>>  
> >>> -	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
> >>> -	delayed->inode = inode;
> >>> -
> >>>  	spin_lock(&fs_info->delayed_iput_lock);
> >>> -	list_add_tail(&delayed->list, &fs_info->delayed_iputs);
> >>> +	list_add_tail(&b_inode->delayed_iput, &fs_info->delayed_iputs);
> >>>  	spin_unlock(&fs_info->delayed_iput_lock);
> >>>  }
> >>
> >> Hmm.  I'm not great with inode life cycles, but isn't this only safe if
> >> someone else can't get an i_count reference while this is in flight?  It
> >> looks like the final iput does the unhashing, and so on, so couldn't an
> >> iget/iput race with this and try to add the inode's list_head twice?
> > 
> > Yeah, same concern here.  Basically this will result in inodes still being
> > in use on unmount.
> > 
> > Actually I did a similar one, here is some disscussion:
> > 
> > https://patchwork.kernel.org/patch/1824711/
> 
> I read it, thanks.  Did you try the counter approach?

Yes, it'll bring a tradeoff situation.

With counter, we need to lock the list all the time instead of
doing a splice on the list and unlocking it.  I think splice would be
faster so I didn't go further(I MIGHT be wrong on this)..

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Mahoney Feb. 12, 2013, 7:34 a.m. UTC | #6
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/6/13 11:02 AM, Liu Bo wrote:
> On Wed, Feb 06, 2013 at 09:53:05AM -0600, Eric Sandeen wrote:
>> On 2/5/13 8:08 PM, Liu Bo wrote:
>>> On Tue, Feb 05, 2013 at 03:14:05PM -0800, Zach Brown wrote:
>>>>> +	struct btrfs_inode *b_inode = BTRFS_I(inode); +	struct
>>>>> btrfs_fs_info *fs_info = b_inode->root->fs_info;
>>>>> 
>>>>> if (atomic_add_unless(&inode->i_count, -1, 1)) return;
>>>>> 
>>>>> -	delayed = kmalloc(sizeof(*delayed), GFP_NOFS |
>>>>> __GFP_NOFAIL); -	delayed->inode = inode; - 
>>>>> spin_lock(&fs_info->delayed_iput_lock); -
>>>>> list_add_tail(&delayed->list, &fs_info->delayed_iputs); +
>>>>> list_add_tail(&b_inode->delayed_iput,
>>>>> &fs_info->delayed_iputs); 
>>>>> spin_unlock(&fs_info->delayed_iput_lock); }
>>>> 
>>>> Hmm.  I'm not great with inode life cycles, but isn't this
>>>> only safe if someone else can't get an i_count reference
>>>> while this is in flight?  It looks like the final iput does
>>>> the unhashing, and so on, so couldn't an iget/iput race with
>>>> this and try to add the inode's list_head twice?
>>> 
>>> Yeah, same concern here.  Basically this will result in inodes
>>> still being in use on unmount.
>>> 
>>> Actually I did a similar one, here is some disscussion:
>>> 
>>> https://patchwork.kernel.org/patch/1824711/
>> 
>> I read it, thanks.  Did you try the counter approach?
> 
> Yes, it'll bring a tradeoff situation.
> 
> With counter, we need to lock the list all the time instead of 
> doing a splice on the list and unlocking it.  I think splice would
> be faster so I didn't go further(I MIGHT be wrong on this)..

Thanks for looking into this. I left this note to myself during the
development of the error handling patches while on a tangent to try to
eliminate NOFAIL allocs. It's not the alloc/free that's the issue
(though eliminating these can probably only help), it's that NOFAIL
allocs essentially become locks when memory pressure is high enough
that the NOFAIL functionality gets invoked. OTOH, bailing out of that
path when we encounter an allocation failure is impossible.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.18 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQIcBAEBAgAGBQJRGfB3AAoJEB57S2MheeWy/E4QALVJ2YI1zbwCHnkUia+yuT40
LoYfyRJoTiKwnwiFeByy98tX9WxVnXGZUVpR8GMwVuLfDIMyVgQmaAicqiirHHHD
ySNV3jsyz8HCOb6ALu7eQyWy4F8yBD1HG75njvvzVO+zUlSsaKGmfvsXS0f4ubCk
hyxg7OujW++cWg+WOedCZsg2n7kF34MLPJiyjS1E1vw8DZW3tHKWgv/hyJIzp+JK
wIZQPrzNUTp0kS4N6+b8rJnXTNkj7zMhWPYeJdIMIG9/+oDr2r1N/XedYMY7fkdS
g7Gj28nmTtufYlTcgztL6MHFwxm/tRQNl85+lRU/zYFKIR0ok4+1kFrpZ5KcF97m
NZeGSsSiaZfMXE+t6B/AgagFJUws+y/RHBJ/V9paMNjsojLRUBVPQOdeHw355XVm
lJeTtyElA+SSawPkzf2115IEj1EgFmHIouSQJdUCPoTfS126NHhH0PYX2GHgAs8b
1ImyG9E/Z/JswVRzAxWGQSffdxzg5Vb8P8w7LzAlIdToVa0tM3Q2n9h3a0vcl83m
NQEqe3+GnsflB2xSVyoztVx+ZL8664HC1UzIjgb7oUihGHe7gJZ4uqDgaClGprKh
pQyvr8zsbjeMwpvlqv7gRQDFyY3JKK4W5UeS/pGjTM7ORS1LmEUTR5S4pQknTUgc
Qj/bH6806My5pW3VB5i5
=ZSdX
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 2a8c242..3024006 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -86,6 +86,8 @@  struct btrfs_inode {
 	 */
 	struct list_head ordered_operations;
 
+	struct list_head delayed_iput;
+
 	/* node for the red-black tree that links inodes in subvolume root */
 	struct rb_node rb_node;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cc93b23..cac7f43 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2119,34 +2119,24 @@  zeroit:
 	return -EIO;
 }
 
-struct delayed_iput {
-	struct list_head list;
-	struct inode *inode;
-};
-
-/* JDM: If this is fs-wide, why can't we add a pointer to
- * btrfs_inode instead and avoid the allocation? */
 void btrfs_add_delayed_iput(struct inode *inode)
 {
-	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
-	struct delayed_iput *delayed;
+	struct btrfs_inode *b_inode = BTRFS_I(inode);
+	struct btrfs_fs_info *fs_info = b_inode->root->fs_info;
 
 	if (atomic_add_unless(&inode->i_count, -1, 1))
 		return;
 
-	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
-	delayed->inode = inode;
-
 	spin_lock(&fs_info->delayed_iput_lock);
-	list_add_tail(&delayed->list, &fs_info->delayed_iputs);
+	list_add_tail(&b_inode->delayed_iput, &fs_info->delayed_iputs);
 	spin_unlock(&fs_info->delayed_iput_lock);
 }
 
 void btrfs_run_delayed_iputs(struct btrfs_root *root)
 {
 	LIST_HEAD(list);
+	struct btrfs_inode *b_inode;
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct delayed_iput *delayed;
 	int empty;
 
 	spin_lock(&fs_info->delayed_iput_lock);
@@ -2160,10 +2150,9 @@  void btrfs_run_delayed_iputs(struct btrfs_root *root)
 	spin_unlock(&fs_info->delayed_iput_lock);
 
 	while (!list_empty(&list)) {
-		delayed = list_entry(list.next, struct delayed_iput, list);
-		list_del(&delayed->list);
-		iput(delayed->inode);
-		kfree(delayed);
+		b_inode = list_entry(list.next, struct btrfs_inode, delayed_iput);
+		list_del(&b_inode->delayed_iput);
+		iput(&b_inode->vfs_inode);
 	}
 }
 
@@ -7142,6 +7131,7 @@  struct inode *btrfs_alloc_inode(struct super_block *sb)
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
+	INIT_LIST_HEAD(&ei->delayed_iput);
 	RB_CLEAR_NODE(&ei->rb_node);
 
 	return inode;