block: detach bdev inode from its wb in __blkdev_put()

Message ID	1448054554-24138-1-git-send-email-idryomov@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Ilya Dryomov <idryomov@gmail.com> To: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org>, Christoph Hellwig <hch@lst.de>, linux-fsdevel@vger.kernel.org Subject: [PATCH] block: detach bdev inode from its wb in __blkdev_put() Date: Fri, 20 Nov 2015 22:22:34 +0100 Message-Id: <1448054554-24138-1-git-send-email-idryomov@gmail.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Ilya Dryomov Nov. 20, 2015, 9:22 p.m. UTC

Since 52ebea749aae ("writeback: make backing_dev_info host
cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
gets attached to a wb (struct bdi_writeback).  Detaching happens on
evict, in inode_detach_wb() called from __destroy_inode(), and involves
updating wb.

However, detaching an internal bdev inode from its wb in
__destroy_inode() is too late.  Its bdi and by extension root wb are
embedded into struct request_queue, which has different lifetime rules
and can be freed long before the final bdput() is called (can be from
__fput() of a corresponding /dev inode, through dput() - evict() -
bd_forget().  bdevs hold onto the underlying disk/queue pair only while
opened; as soon as bdev is closed all bets are off.  In fact,
disk/queue can be gone before __blkdev_put() even returns:

1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
1500 {
...
1518         if (bdev->bd_contains == bdev) {
1519                 if (disk->fops->release)
1520                         disk->fops->release(disk, mode);

[ Driver puts its references to disk/queue ]

1521         }
1522         if (!bdev->bd_openers) {
1523                 struct module *owner = disk->fops->owner;
1524
1525                 disk_put_part(bdev->bd_part);
1526                 bdev->bd_part = NULL;
1527                 bdev->bd_disk = NULL;
1528                 if (bdev != bdev->bd_contains)
1529                         victim = bdev->bd_contains;
1530                 bdev->bd_contains = NULL;
1531
1532                 put_disk(disk);

[ We put ours, the queue is gone
  The last bdput() would result in a write to invalid memory ]

1533                 module_put(owner);
...
1539 }

Since bdev inodes are special anyway, detach them in __blkdev_put()
after clearing inode's dirty bits, turning the problematic
inode_detach_wb() in __destroy_inode() into a noop.

add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
gendisk hold a reference to its queue"), so the old ->release comment
is removed in favor of the new inode_detach_wb() comment.

Cc: stable@vger.kernel.org # 4.2+, needs backporting
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
 fs/block_dev.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Tejun Heo Nov. 20, 2015, 9:23 p.m. UTC | #1

On Fri, Nov 20, 2015 at 10:22:34PM +0100, Ilya Dryomov wrote:
> Since 52ebea749aae ("writeback: make backing_dev_info host
> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
> gets attached to a wb (struct bdi_writeback).  Detaching happens on
> evict, in inode_detach_wb() called from __destroy_inode(), and involves
> updating wb.
> 
> However, detaching an internal bdev inode from its wb in
> __destroy_inode() is too late.  Its bdi and by extension root wb are
> embedded into struct request_queue, which has different lifetime rules
> and can be freed long before the final bdput() is called (can be from
> __fput() of a corresponding /dev inode, through dput() - evict() -
> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
> opened; as soon as bdev is closed all bets are off.  In fact,
> disk/queue can be gone before __blkdev_put() even returns:
> 
> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
> 1500 {
> ...
> 1518         if (bdev->bd_contains == bdev) {
> 1519                 if (disk->fops->release)
> 1520                         disk->fops->release(disk, mode);
> 
> [ Driver puts its references to disk/queue ]
> 
> 1521         }
> 1522         if (!bdev->bd_openers) {
> 1523                 struct module *owner = disk->fops->owner;
> 1524
> 1525                 disk_put_part(bdev->bd_part);
> 1526                 bdev->bd_part = NULL;
> 1527                 bdev->bd_disk = NULL;
> 1528                 if (bdev != bdev->bd_contains)
> 1529                         victim = bdev->bd_contains;
> 1530                 bdev->bd_contains = NULL;
> 1531
> 1532                 put_disk(disk);
> 
> [ We put ours, the queue is gone
>   The last bdput() would result in a write to invalid memory ]
> 
> 1533                 module_put(owner);
> ...
> 1539 }
> 
> Since bdev inodes are special anyway, detach them in __blkdev_put()
> after clearing inode's dirty bits, turning the problematic
> inode_detach_wb() in __destroy_inode() into a noop.
> 
> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
> gendisk hold a reference to its queue"), so the old ->release comment
> is removed in favor of the new inode_detach_wb() comment.
> 
> Cc: stable@vger.kernel.org # 4.2+, needs backporting
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

Christoph Hellwig Nov. 22, 2015, 3:02 p.m. UTC | #2

The right fix is to never point from the inode to a BDI.  I fixed this
with a lot of effort, and the BDI writeback series put it back a little
later.  We need to revert that damage (not neseccarily literally, but the
effect).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo Nov. 23, 2015, 3:35 p.m. UTC | #3

Hello, Christoph.

On Sun, Nov 22, 2015 at 04:02:16PM +0100, Christoph Hellwig wrote:
> The right fix is to never point from the inode to a BDI.  I fixed this
> with a lot of effort, and the BDI writeback series put it back a little
> later.  We need to revert that damage (not neseccarily literally, but the
> effect).

Can you please explain a bit more why inode pointing to its associated
bdi_writeback is a bad idea?  The only alternative would be recording
a key and looking up each time.  We sure can do that but I don't get
why we would want to.

Thanks.

Christoph Hellwig Nov. 24, 2015, 7:54 a.m. UTC | #4

On Mon, Nov 23, 2015 at 10:35:52AM -0500, Tejun Heo wrote:
> Hello, Christoph.
> 
> On Sun, Nov 22, 2015 at 04:02:16PM +0100, Christoph Hellwig wrote:
> > The right fix is to never point from the inode to a BDI.  I fixed this
> > with a lot of effort, and the BDI writeback series put it back a little
> > later.  We need to revert that damage (not neseccarily literally, but the
> > effect).
> 
> Can you please explain a bit more why inode pointing to its associated
> bdi_writeback is a bad idea?  The only alternative would be recording
> a key and looking up each time.  We sure can do that but I don't get
> why we would want to.

Because the writeback context really is a per-super block concept
(except for the magic block device inodes).  Having another pointer
pointer in the inode (or address_space back then) lead to all kinds
of confusing bugs and life time issues, nevermind massively bloating the
inode for no reason.  But then again bloating the inode has been rather
en vogue lately anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo Nov. 24, 2015, 2 p.m. UTC | #5

Hello, Christoph.

On Tue, Nov 24, 2015 at 08:54:57AM +0100, Christoph Hellwig wrote:
> > Can you please explain a bit more why inode pointing to its associated
> > bdi_writeback is a bad idea?  The only alternative would be recording
> > a key and looking up each time.  We sure can do that but I don't get
> > why we would want to.
> 
> Because the writeback context really is a per-super block concept
> (except for the magic block device inodes).  Having another pointer

With cgroup writeback, inodes belonging to the same superblock can
belong to different writeback domains, so it no longer is strictly
per-super block concept.

> pointer in the inode (or address_space back then) lead to all kinds
> of confusing bugs and life time issues, nevermind massively bloating the

The bdev case is different because a bdev's lifetime doesn't coincide
with the actual object it's representing but the root cause there is
lower layer objects being destroyed while higher layer objects
depending on them are still around.  Severing inode -> lower layer
object dependency may be able to solve some issues but those
approaches are usually pretty brittle.  The right thing to do is
letting objects getting torn down in order by maintaining proper refs.

> inode for no reason.  But then again bloating the inode has been rather
> en vogue lately anyway.

How is it for no reason when there are clearly purposes for them?  If
the extra pointer is problematic, we can try to split sb's so that we
can encode writeback domain information by pointing to
per-writeback-domain split sb's but I'm not sure the added complexity
would be justifiable.

Thanks.

Raghavendra K T Nov. 30, 2015, 7:20 a.m. UTC | #6

* Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:

> Since 52ebea749aae ("writeback: make backing_dev_info host
> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
> gets attached to a wb (struct bdi_writeback).  Detaching happens on
> evict, in inode_detach_wb() called from __destroy_inode(), and involves
> updating wb.
> 
> However, detaching an internal bdev inode from its wb in
> __destroy_inode() is too late.  Its bdi and by extension root wb are
> embedded into struct request_queue, which has different lifetime rules
> and can be freed long before the final bdput() is called (can be from
> __fput() of a corresponding /dev inode, through dput() - evict() -
> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
> opened; as soon as bdev is closed all bets are off.  In fact,
> disk/queue can be gone before __blkdev_put() even returns:
> 
> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
> 1500 {
> ...
> 1518         if (bdev->bd_contains == bdev) {
> 1519                 if (disk->fops->release)
> 1520                         disk->fops->release(disk, mode);
> 
> [ Driver puts its references to disk/queue ]
> 
> 1521         }
> 1522         if (!bdev->bd_openers) {
> 1523                 struct module *owner = disk->fops->owner;
> 1524
> 1525                 disk_put_part(bdev->bd_part);
> 1526                 bdev->bd_part = NULL;
> 1527                 bdev->bd_disk = NULL;
> 1528                 if (bdev != bdev->bd_contains)
> 1529                         victim = bdev->bd_contains;
> 1530                 bdev->bd_contains = NULL;
> 1531
> 1532                 put_disk(disk);
> 
> [ We put ours, the queue is gone
>   The last bdput() would result in a write to invalid memory ]
> 
> 1533                 module_put(owner);
> ...
> 1539 }
> 
> Since bdev inodes are special anyway, detach them in __blkdev_put()
> after clearing inode's dirty bits, turning the problematic
> inode_detach_wb() in __destroy_inode() into a noop.
> 
> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
> gendisk hold a reference to its queue"), so the old ->release comment
> is removed in favor of the new inode_detach_wb() comment.
> 
> Cc: stable@vger.kernel.org # 4.2+, needs backporting
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> ---
Feel free to add 
Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

I was facing bad memory access problem while creating thousands of containers.
With this patch I am able to create more than 10k containers without hitting
the problem.
I had reported the problem here: https://lkml.org/lkml/2015/11/19/149

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Nov. 30, 2015, 10:54 a.m. UTC | #7

On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
<raghavendra.kt@linux.vnet.ibm.com> wrote:
> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>
>> Since 52ebea749aae ("writeback: make backing_dev_info host
>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>> updating wb.
>>
>> However, detaching an internal bdev inode from its wb in
>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>> embedded into struct request_queue, which has different lifetime rules
>> and can be freed long before the final bdput() is called (can be from
>> __fput() of a corresponding /dev inode, through dput() - evict() -
>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>> opened; as soon as bdev is closed all bets are off.  In fact,
>> disk/queue can be gone before __blkdev_put() even returns:
>>
>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>> 1500 {
>> ...
>> 1518         if (bdev->bd_contains == bdev) {
>> 1519                 if (disk->fops->release)
>> 1520                         disk->fops->release(disk, mode);
>>
>> [ Driver puts its references to disk/queue ]
>>
>> 1521         }
>> 1522         if (!bdev->bd_openers) {
>> 1523                 struct module *owner = disk->fops->owner;
>> 1524
>> 1525                 disk_put_part(bdev->bd_part);
>> 1526                 bdev->bd_part = NULL;
>> 1527                 bdev->bd_disk = NULL;
>> 1528                 if (bdev != bdev->bd_contains)
>> 1529                         victim = bdev->bd_contains;
>> 1530                 bdev->bd_contains = NULL;
>> 1531
>> 1532                 put_disk(disk);
>>
>> [ We put ours, the queue is gone
>>   The last bdput() would result in a write to invalid memory ]
>>
>> 1533                 module_put(owner);
>> ...
>> 1539 }
>>
>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>> after clearing inode's dirty bits, turning the problematic
>> inode_detach_wb() in __destroy_inode() into a noop.
>>
>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>> gendisk hold a reference to its queue"), so the old ->release comment
>> is removed in favor of the new inode_detach_wb() comment.
>>
>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>> ---
> Feel free to add
> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>
> I was facing bad memory access problem while creating thousands of containers.
> With this patch I am able to create more than 10k containers without hitting
> the problem.
> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149

Great!  Christoph's concern is with ->i_wb as a whole, not this
particular patch - Al, this one is marked for stable, can we get it
merged into -rc4?  Or should it go through Jens' tree, as cgroup
writeback patches did?

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Dec. 4, 2015, 2:07 p.m. UTC | #8

On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
> <raghavendra.kt@linux.vnet.ibm.com> wrote:
>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>>
>>> Since 52ebea749aae ("writeback: make backing_dev_info host
>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>>> updating wb.
>>>
>>> However, detaching an internal bdev inode from its wb in
>>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>>> embedded into struct request_queue, which has different lifetime rules
>>> and can be freed long before the final bdput() is called (can be from
>>> __fput() of a corresponding /dev inode, through dput() - evict() -
>>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>>> opened; as soon as bdev is closed all bets are off.  In fact,
>>> disk/queue can be gone before __blkdev_put() even returns:
>>>
>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>>> 1500 {
>>> ...
>>> 1518         if (bdev->bd_contains == bdev) {
>>> 1519                 if (disk->fops->release)
>>> 1520                         disk->fops->release(disk, mode);
>>>
>>> [ Driver puts its references to disk/queue ]
>>>
>>> 1521         }
>>> 1522         if (!bdev->bd_openers) {
>>> 1523                 struct module *owner = disk->fops->owner;
>>> 1524
>>> 1525                 disk_put_part(bdev->bd_part);
>>> 1526                 bdev->bd_part = NULL;
>>> 1527                 bdev->bd_disk = NULL;
>>> 1528                 if (bdev != bdev->bd_contains)
>>> 1529                         victim = bdev->bd_contains;
>>> 1530                 bdev->bd_contains = NULL;
>>> 1531
>>> 1532                 put_disk(disk);
>>>
>>> [ We put ours, the queue is gone
>>>   The last bdput() would result in a write to invalid memory ]
>>>
>>> 1533                 module_put(owner);
>>> ...
>>> 1539 }
>>>
>>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>>> after clearing inode's dirty bits, turning the problematic
>>> inode_detach_wb() in __destroy_inode() into a noop.
>>>
>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>>> gendisk hold a reference to its queue"), so the old ->release comment
>>> is removed in favor of the new inode_detach_wb() comment.
>>>
>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>> ---
>> Feel free to add
>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>
>> I was facing bad memory access problem while creating thousands of containers.
>> With this patch I am able to create more than 10k containers without hitting
>> the problem.
>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149
>
> Great!  Christoph's concern is with ->i_wb as a whole, not this
> particular patch - Al, this one is marked for stable, can we get it
> merged into -rc4?  Or should it go through Jens' tree, as cgroup
> writeback patches did?

Ping?

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Axboe Dec. 4, 2015, 5:44 p.m. UTC | #9

On 12/04/2015 07:07 AM, Ilya Dryomov wrote:
> On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
>> <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>>>
>>>> Since 52ebea749aae ("writeback: make backing_dev_info host
>>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>>>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>>>> updating wb.
>>>>
>>>> However, detaching an internal bdev inode from its wb in
>>>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>>>> embedded into struct request_queue, which has different lifetime rules
>>>> and can be freed long before the final bdput() is called (can be from
>>>> __fput() of a corresponding /dev inode, through dput() - evict() -
>>>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>>>> opened; as soon as bdev is closed all bets are off.  In fact,
>>>> disk/queue can be gone before __blkdev_put() even returns:
>>>>
>>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>>>> 1500 {
>>>> ...
>>>> 1518         if (bdev->bd_contains == bdev) {
>>>> 1519                 if (disk->fops->release)
>>>> 1520                         disk->fops->release(disk, mode);
>>>>
>>>> [ Driver puts its references to disk/queue ]
>>>>
>>>> 1521         }
>>>> 1522         if (!bdev->bd_openers) {
>>>> 1523                 struct module *owner = disk->fops->owner;
>>>> 1524
>>>> 1525                 disk_put_part(bdev->bd_part);
>>>> 1526                 bdev->bd_part = NULL;
>>>> 1527                 bdev->bd_disk = NULL;
>>>> 1528                 if (bdev != bdev->bd_contains)
>>>> 1529                         victim = bdev->bd_contains;
>>>> 1530                 bdev->bd_contains = NULL;
>>>> 1531
>>>> 1532                 put_disk(disk);
>>>>
>>>> [ We put ours, the queue is gone
>>>>    The last bdput() would result in a write to invalid memory ]
>>>>
>>>> 1533                 module_put(owner);
>>>> ...
>>>> 1539 }
>>>>
>>>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>>>> after clearing inode's dirty bits, turning the problematic
>>>> inode_detach_wb() in __destroy_inode() into a noop.
>>>>
>>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>>>> gendisk hold a reference to its queue"), so the old ->release comment
>>>> is removed in favor of the new inode_detach_wb() comment.
>>>>
>>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>>> ---
>>> Feel free to add
>>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>
>>> I was facing bad memory access problem while creating thousands of containers.
>>> With this patch I am able to create more than 10k containers without hitting
>>> the problem.
>>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149
>>
>> Great!  Christoph's concern is with ->i_wb as a whole, not this
>> particular patch - Al, this one is marked for stable, can we get it
>> merged into -rc4?  Or should it go through Jens' tree, as cgroup
>> writeback patches did?
>
> Ping?

Was holding off, but I think we should get the simple fix in for 4.4. 
I've applied it for this series.

Ilya Dryomov Dec. 4, 2015, 5:55 p.m. UTC | #10

On Fri, Dec 4, 2015 at 6:44 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 12/04/2015 07:07 AM, Ilya Dryomov wrote:
>>
>> On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>
>>> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
>>> <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>>
>>>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>>>>
>>>>> Since 52ebea749aae ("writeback: make backing_dev_info host
>>>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>>>>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>>>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>>>>> updating wb.
>>>>>
>>>>> However, detaching an internal bdev inode from its wb in
>>>>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>>>>> embedded into struct request_queue, which has different lifetime rules
>>>>> and can be freed long before the final bdput() is called (can be from
>>>>> __fput() of a corresponding /dev inode, through dput() - evict() -
>>>>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>>>>> opened; as soon as bdev is closed all bets are off.  In fact,
>>>>> disk/queue can be gone before __blkdev_put() even returns:
>>>>>
>>>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode,
>>>>> int for_part)
>>>>> 1500 {
>>>>> ...
>>>>> 1518         if (bdev->bd_contains == bdev) {
>>>>> 1519                 if (disk->fops->release)
>>>>> 1520                         disk->fops->release(disk, mode);
>>>>>
>>>>> [ Driver puts its references to disk/queue ]
>>>>>
>>>>> 1521         }
>>>>> 1522         if (!bdev->bd_openers) {
>>>>> 1523                 struct module *owner = disk->fops->owner;
>>>>> 1524
>>>>> 1525                 disk_put_part(bdev->bd_part);
>>>>> 1526                 bdev->bd_part = NULL;
>>>>> 1527                 bdev->bd_disk = NULL;
>>>>> 1528                 if (bdev != bdev->bd_contains)
>>>>> 1529                         victim = bdev->bd_contains;
>>>>> 1530                 bdev->bd_contains = NULL;
>>>>> 1531
>>>>> 1532                 put_disk(disk);
>>>>>
>>>>> [ We put ours, the queue is gone
>>>>>    The last bdput() would result in a write to invalid memory ]
>>>>>
>>>>> 1533                 module_put(owner);
>>>>> ...
>>>>> 1539 }
>>>>>
>>>>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>>>>> after clearing inode's dirty bits, turning the problematic
>>>>> inode_detach_wb() in __destroy_inode() into a noop.
>>>>>
>>>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>>>>> gendisk hold a reference to its queue"), so the old ->release comment
>>>>> is removed in favor of the new inode_detach_wb() comment.
>>>>>
>>>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>>>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>>>> ---
>>>>
>>>> Feel free to add
>>>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>>
>>>> I was facing bad memory access problem while creating thousands of
>>>> containers.
>>>> With this patch I am able to create more than 10k containers without
>>>> hitting
>>>> the problem.
>>>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149
>>>
>>>
>>> Great!  Christoph's concern is with ->i_wb as a whole, not this
>>> particular patch - Al, this one is marked for stable, can we get it
>>> merged into -rc4?  Or should it go through Jens' tree, as cgroup
>>> writeback patches did?
>>
>>
>> Ping?
>
>
> Was holding off, but I think we should get the simple fix in for 4.4. I've
> applied it for this series.

I think you missed Raghavendra's Tested-by (I'm looking at
https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=f2148d49930497c66d38b13f137c30d78c60da3d).

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Axboe Dec. 4, 2015, 6:02 p.m. UTC | #11

On 12/04/2015 10:55 AM, Ilya Dryomov wrote:
> On Fri, Dec 4, 2015 at 6:44 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 12/04/2015 07:07 AM, Ilya Dryomov wrote:
>>>
>>> On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>
>>>> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
>>>> <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>>>
>>>>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>>>>>
>>>>>> Since 52ebea749aae ("writeback: make backing_dev_info host
>>>>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>>>>>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>>>>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>>>>>> updating wb.
>>>>>>
>>>>>> However, detaching an internal bdev inode from its wb in
>>>>>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>>>>>> embedded into struct request_queue, which has different lifetime rules
>>>>>> and can be freed long before the final bdput() is called (can be from
>>>>>> __fput() of a corresponding /dev inode, through dput() - evict() -
>>>>>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>>>>>> opened; as soon as bdev is closed all bets are off.  In fact,
>>>>>> disk/queue can be gone before __blkdev_put() even returns:
>>>>>>
>>>>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode,
>>>>>> int for_part)
>>>>>> 1500 {
>>>>>> ...
>>>>>> 1518         if (bdev->bd_contains == bdev) {
>>>>>> 1519                 if (disk->fops->release)
>>>>>> 1520                         disk->fops->release(disk, mode);
>>>>>>
>>>>>> [ Driver puts its references to disk/queue ]
>>>>>>
>>>>>> 1521         }
>>>>>> 1522         if (!bdev->bd_openers) {
>>>>>> 1523                 struct module *owner = disk->fops->owner;
>>>>>> 1524
>>>>>> 1525                 disk_put_part(bdev->bd_part);
>>>>>> 1526                 bdev->bd_part = NULL;
>>>>>> 1527                 bdev->bd_disk = NULL;
>>>>>> 1528                 if (bdev != bdev->bd_contains)
>>>>>> 1529                         victim = bdev->bd_contains;
>>>>>> 1530                 bdev->bd_contains = NULL;
>>>>>> 1531
>>>>>> 1532                 put_disk(disk);
>>>>>>
>>>>>> [ We put ours, the queue is gone
>>>>>>     The last bdput() would result in a write to invalid memory ]
>>>>>>
>>>>>> 1533                 module_put(owner);
>>>>>> ...
>>>>>> 1539 }
>>>>>>
>>>>>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>>>>>> after clearing inode's dirty bits, turning the problematic
>>>>>> inode_detach_wb() in __destroy_inode() into a noop.
>>>>>>
>>>>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>>>>>> gendisk hold a reference to its queue"), so the old ->release comment
>>>>>> is removed in favor of the new inode_detach_wb() comment.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>>>>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>>>>> ---
>>>>>
>>>>> Feel free to add
>>>>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>>>
>>>>> I was facing bad memory access problem while creating thousands of
>>>>> containers.
>>>>> With this patch I am able to create more than 10k containers without
>>>>> hitting
>>>>> the problem.
>>>>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149
>>>>
>>>>
>>>> Great!  Christoph's concern is with ->i_wb as a whole, not this
>>>> particular patch - Al, this one is marked for stable, can we get it
>>>> merged into -rc4?  Or should it go through Jens' tree, as cgroup
>>>> writeback patches did?
>>>
>>>
>>> Ping?
>>
>>
>> Was holding off, but I think we should get the simple fix in for 4.4. I've
>> applied it for this series.
>
> I think you missed Raghavendra's Tested-by (I'm looking at
> https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=f2148d49930497c66d38b13f137c30d78c60da3d).

I did, fixed.

block: detach bdev inode from its wb in __blkdev_put()

Commit Message

Comments

Patch