diff mbox series

sd: use mempool for discard special page

Message ID 6da7da45-ad80-65df-f1f0-81a2d488459c@kernel.dk (mailing list archive)
State Accepted
Headers show
Series sd: use mempool for discard special page | expand

Commit Message

Jens Axboe Dec. 12, 2018, 1:46 p.m. UTC
When boxes are run near (or to) OOM, we have a problem with the discard
page allocation in sd. If we fail allocating the special page, we return
busy, and it'll get retried. But since ordering is honored for dispatch
requests, we can keep retrying this same IO and failing. Behind that IO
could be requests that want to free memory, but they never get the
chance. This means you get repeated spews of traces like this:

[1201401.625972] Call Trace:
[1201401.631748]  dump_stack+0x4d/0x65
[1201401.639445]  warn_alloc+0xec/0x190
[1201401.647335]  __alloc_pages_slowpath+0xe84/0xf30
[1201401.657722]  ? get_page_from_freelist+0x11b/0xb10
[1201401.668475]  ? __alloc_pages_slowpath+0x2e/0xf30
[1201401.679054]  __alloc_pages_nodemask+0x1f9/0x210
[1201401.689424]  alloc_pages_current+0x8c/0x110
[1201401.699025]  sd_setup_write_same16_cmnd+0x51/0x150
[1201401.709987]  sd_init_command+0x49c/0xb70
[1201401.719029]  scsi_setup_cmnd+0x9c/0x160
[1201401.727877]  scsi_queue_rq+0x4d9/0x610
[1201401.736535]  blk_mq_dispatch_rq_list+0x19a/0x360
[1201401.747113]  blk_mq_sched_dispatch_requests+0xff/0x190
[1201401.758844]  __blk_mq_run_hw_queue+0x95/0xa0
[1201401.768653]  blk_mq_run_work_fn+0x2c/0x30
[1201401.777886]  process_one_work+0x14b/0x400
[1201401.787119]  worker_thread+0x4b/0x470
[1201401.795586]  kthread+0x110/0x150
[1201401.803089]  ? rescuer_thread+0x320/0x320
[1201401.812322]  ? kthread_park+0x90/0x90
[1201401.820787]  ? do_syscall_64+0x53/0x150
[1201401.829635]  ret_from_fork+0x29/0x40

Ensure that the discard page allocation has a mempool backing, so we
know we can make progress.

Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

We actually hit this in production, it's not a theoretical issue.

Comments

Christoph Hellwig Dec. 12, 2018, 2:21 p.m. UTC | #1
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Note that NVMe will need the same treatment.
Jens Axboe Dec. 12, 2018, 2:35 p.m. UTC | #2
On 12/12/18 7:21 AM, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Note that NVMe will need the same treatment.

Indeed, I'll prep one for that too.
Jens Axboe Dec. 12, 2018, 3:22 p.m. UTC | #3
On 12/12/18 7:35 AM, Jens Axboe wrote:
> On 12/12/18 7:21 AM, Christoph Hellwig wrote:
>> Looks good:
>>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>
>> Note that NVMe will need the same treatment.
> 
> Indeed, I'll prep one for that too.

That one is a little worse, since we only need a full page if we
use all 256 segments. I don't want to make the fast case of
16 bytes single segment allocs get a full page, so we have to
track if we used kmalloc() or mempool_alloc() for this particular
range.

I guess I could abuse ->end_io for that, set it if we end up
punting to mempool. I'll do that.
Christoph Hellwig Dec. 12, 2018, 3:32 p.m. UTC | #4
On Wed, Dec 12, 2018 at 08:22:37AM -0700, Jens Axboe wrote:
> That one is a little worse, since we only need a full page if we
> use all 256 segments. I don't want to make the fast case of
> 16 bytes single segment allocs get a full page, so we have to
> track if we used kmalloc() or mempool_alloc() for this particular
> range.
> 
> I guess I could abuse ->end_io for that, set it if we end up
> punting to mempool. I'll do that.

How about a full emergency page hanging off struct nvme_ctrl,
and then in the completion path we can do:

	if (req->special_vec.bv_page == ctrl->discard_emergency_page)
		// clear some bit in ctrl->flags
	else
		kfree(page_address(req->special_vec.bv_page) +
Jens Axboe Dec. 12, 2018, 3:33 p.m. UTC | #5
On 12/12/18 8:32 AM, Christoph Hellwig wrote:
> On Wed, Dec 12, 2018 at 08:22:37AM -0700, Jens Axboe wrote:
>> That one is a little worse, since we only need a full page if we
>> use all 256 segments. I don't want to make the fast case of
>> 16 bytes single segment allocs get a full page, so we have to
>> track if we used kmalloc() or mempool_alloc() for this particular
>> range.
>>
>> I guess I could abuse ->end_io for that, set it if we end up
>> punting to mempool. I'll do that.
> 
> How about a full emergency page hanging off struct nvme_ctrl,
> and then in the completion path we can do:
> 
> 	if (req->special_vec.bv_page == ctrl->discard_emergency_page)
> 		// clear some bit in ctrl->flags
> 	else
> 		kfree(page_address(req->special_vec.bv_page) +

That's not a bad idea, then we don't have to track it.
Jens Axboe Dec. 12, 2018, 3:45 p.m. UTC | #6
On 12/12/18 8:33 AM, Jens Axboe wrote:
> On 12/12/18 8:32 AM, Christoph Hellwig wrote:
>> On Wed, Dec 12, 2018 at 08:22:37AM -0700, Jens Axboe wrote:
>>> That one is a little worse, since we only need a full page if we
>>> use all 256 segments. I don't want to make the fast case of
>>> 16 bytes single segment allocs get a full page, so we have to
>>> track if we used kmalloc() or mempool_alloc() for this particular
>>> range.
>>>
>>> I guess I could abuse ->end_io for that, set it if we end up
>>> punting to mempool. I'll do that.
>>
>> How about a full emergency page hanging off struct nvme_ctrl,
>> and then in the completion path we can do:
>>
>> 	if (req->special_vec.bv_page == ctrl->discard_emergency_page)
>> 		// clear some bit in ctrl->flags
>> 	else
>> 		kfree(page_address(req->special_vec.bv_page) +
> 
> That's not a bad idea, then we don't have to track it.

Something like this, will test it.


diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c71e879821ad..7ca988e58790 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -564,9 +564,14 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 	struct nvme_dsm_range *range;
 	struct bio *bio;
 
-	range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
-	if (!range)
-		return BLK_STS_RESOURCE;
+	range = kmalloc_array(segments, sizeof(*range),
+				GFP_ATOMIC | __GFP_NOWARN);
+	if (!range) {
+		if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy))
+			return BLK_STS_RESOURCE;
+
+		range = page_address(ns->ctrl->discard_page);
+	}
 
 	__rq_for_each_bio(bio, req) {
 		u64 slba = nvme_block_nr(ns, bio->bi_iter.bi_sector);
@@ -581,7 +586,10 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 	}
 
 	if (WARN_ON_ONCE(n != segments)) {
-		kfree(range);
+		if (virt_to_page(range) == ns->ctrl->discard_page)
+			clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
+		else
+			kfree(range);
 		return BLK_STS_IOERR;
 	}
 
@@ -664,8 +672,13 @@ void nvme_cleanup_cmd(struct request *req)
 				blk_rq_bytes(req) >> ns->lba_shift);
 	}
 	if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
-		kfree(page_address(req->special_vec.bv_page) +
-		      req->special_vec.bv_offset);
+		struct nvme_ns *ns = req->rq_disk->private_data;
+		struct page *page = req->special_vec.bv_page;
+
+		if (page == ns->ctrl->discard_page)
+			clear_bit_unlock(0, &ns->ctrl->discard_page_busy);
+		else
+			kfree(page_address(page) + req->special_vec.bv_offset);
 	}
 }
 EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
@@ -3578,6 +3591,7 @@ static void nvme_free_ctrl(struct device *dev)
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 	kfree(ctrl->effects);
 	nvme_mpath_uninit(ctrl);
+	kfree(ctrl->discard_page);
 
 	if (subsys) {
 		mutex_lock(&subsys->lock);
@@ -3618,6 +3632,12 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
 	ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
 
+	ctrl->discard_page = alloc_page(GFP_KERNEL);
+	if (!ctrl->discard_page) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
 	ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto out;
@@ -3655,6 +3675,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 out_release_instance:
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 out:
+	if (ctrl->discard_page)
+		__free_page(ctrl->discard_page);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(nvme_init_ctrl);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e20e737ac10c..f1fe88598a04 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -241,6 +241,9 @@ struct nvme_ctrl {
 	u16 maxcmd;
 	int nr_reconnects;
 	struct nvmf_ctrl_options *opts;
+
+	struct page *discard_page;
+	unsigned long discard_page_busy;
 };
 
 struct nvme_subsystem {
Martin K. Petersen Dec. 13, 2018, 1:10 a.m. UTC | #7
Jens,

> When boxes are run near (or to) OOM, we have a problem with the
> discard page allocation in sd. If we fail allocating the special page,
> we return busy, and it'll get retried. But since ordering is honored
> for dispatch requests, we can keep retrying this same IO and
> failing. Behind that IO could be requests that want to free memory,
> but they never get the chance. This means you get repeated spews of
> traces like this:

Looks good. Applied to 4.20/scsi-fixes, thanks!
James Bottomley Dec. 22, 2018, 2:48 a.m. UTC | #8
On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote:
> When boxes are run near (or to) OOM, we have a problem with the
> discard page allocation in sd. If we fail allocating the special
> page, we return busy, and it'll get retried. But since ordering is
> honored for dispatch requests, we can keep retrying this same IO and
> failing. Behind that IO could be requests that want to free memory,
> but they never get the chance. This means you get repeated spews of
> traces like this:
> 
> [1201401.625972] Call Trace:
> [1201401.631748]  dump_stack+0x4d/0x65
> [1201401.639445]  warn_alloc+0xec/0x190
> [1201401.647335]  __alloc_pages_slowpath+0xe84/0xf30
> [1201401.657722]  ? get_page_from_freelist+0x11b/0xb10
> [1201401.668475]  ? __alloc_pages_slowpath+0x2e/0xf30
> [1201401.679054]  __alloc_pages_nodemask+0x1f9/0x210
> [1201401.689424]  alloc_pages_current+0x8c/0x110
> [1201401.699025]  sd_setup_write_same16_cmnd+0x51/0x150
> [1201401.709987]  sd_init_command+0x49c/0xb70
> [1201401.719029]  scsi_setup_cmnd+0x9c/0x160
> [1201401.727877]  scsi_queue_rq+0x4d9/0x610
> [1201401.736535]  blk_mq_dispatch_rq_list+0x19a/0x360
> [1201401.747113]  blk_mq_sched_dispatch_requests+0xff/0x190
> [1201401.758844]  __blk_mq_run_hw_queue+0x95/0xa0
> [1201401.768653]  blk_mq_run_work_fn+0x2c/0x30
> [1201401.777886]  process_one_work+0x14b/0x400
> [1201401.787119]  worker_thread+0x4b/0x470
> [1201401.795586]  kthread+0x110/0x150
> [1201401.803089]  ? rescuer_thread+0x320/0x320
> [1201401.812322]  ? kthread_park+0x90/0x90
> [1201401.820787]  ? do_syscall_64+0x53/0x150
> [1201401.829635]  ret_from_fork+0x29/0x40
> 
> Ensure that the discard page allocation has a mempool backing, so we
> know we can make progress.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> We actually hit this in production, it's not a theoretical issue.
> 
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 4a6ed2fc8c71..a1a44f52e0e8 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
[...]
> @@ -759,9 +760,10 @@ static blk_status_t sd_setup_unmap_cmnd(struct
> scsi_cmnd *cmd)
>  	unsigned int data_len = 24;
>  	char *buf;
>  
> -	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC |
> __GFP_ZERO);
> +	rq->special_vec.bv_page = mempool_alloc(sd_page_pool,
> GFP_ATOMIC);
>  	if (!rq->special_vec.bv_page)
>  		return BLK_STS_RESOURCE;
> +	clear_highpage(rq->special_vec.bv_page);

Since the kernel never accesses this page and you take pains to kmap it
when clearing, shouldn't the allocation have __GFP_HIGHMEM?

James
Jens Axboe Dec. 22, 2018, 3:49 a.m. UTC | #9
On 12/21/18 7:48 PM, James Bottomley wrote:
> On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote:
>> When boxes are run near (or to) OOM, we have a problem with the
>> discard page allocation in sd. If we fail allocating the special
>> page, we return busy, and it'll get retried. But since ordering is
>> honored for dispatch requests, we can keep retrying this same IO and
>> failing. Behind that IO could be requests that want to free memory,
>> but they never get the chance. This means you get repeated spews of
>> traces like this:
>>
>> [1201401.625972] Call Trace:
>> [1201401.631748]  dump_stack+0x4d/0x65
>> [1201401.639445]  warn_alloc+0xec/0x190
>> [1201401.647335]  __alloc_pages_slowpath+0xe84/0xf30
>> [1201401.657722]  ? get_page_from_freelist+0x11b/0xb10
>> [1201401.668475]  ? __alloc_pages_slowpath+0x2e/0xf30
>> [1201401.679054]  __alloc_pages_nodemask+0x1f9/0x210
>> [1201401.689424]  alloc_pages_current+0x8c/0x110
>> [1201401.699025]  sd_setup_write_same16_cmnd+0x51/0x150
>> [1201401.709987]  sd_init_command+0x49c/0xb70
>> [1201401.719029]  scsi_setup_cmnd+0x9c/0x160
>> [1201401.727877]  scsi_queue_rq+0x4d9/0x610
>> [1201401.736535]  blk_mq_dispatch_rq_list+0x19a/0x360
>> [1201401.747113]  blk_mq_sched_dispatch_requests+0xff/0x190
>> [1201401.758844]  __blk_mq_run_hw_queue+0x95/0xa0
>> [1201401.768653]  blk_mq_run_work_fn+0x2c/0x30
>> [1201401.777886]  process_one_work+0x14b/0x400
>> [1201401.787119]  worker_thread+0x4b/0x470
>> [1201401.795586]  kthread+0x110/0x150
>> [1201401.803089]  ? rescuer_thread+0x320/0x320
>> [1201401.812322]  ? kthread_park+0x90/0x90
>> [1201401.820787]  ? do_syscall_64+0x53/0x150
>> [1201401.829635]  ret_from_fork+0x29/0x40
>>
>> Ensure that the discard page allocation has a mempool backing, so we
>> know we can make progress.
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> We actually hit this in production, it's not a theoretical issue.
>>
>>
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 4a6ed2fc8c71..a1a44f52e0e8 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
> [...]
>> @@ -759,9 +760,10 @@ static blk_status_t sd_setup_unmap_cmnd(struct
>> scsi_cmnd *cmd)
>>  	unsigned int data_len = 24;
>>  	char *buf;
>>  
>> -	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC |
>> __GFP_ZERO);
>> +	rq->special_vec.bv_page = mempool_alloc(sd_page_pool,
>> GFP_ATOMIC);
>>  	if (!rq->special_vec.bv_page)
>>  		return BLK_STS_RESOURCE;
>> +	clear_highpage(rq->special_vec.bv_page);
> 
> Since the kernel never accesses this page and you take pains to kmap it
> when clearing, shouldn't the allocation have __GFP_HIGHMEM?

It's not going to great lengths to kmap it, clear_page() conveniently
takes an address, not a page. So it's either

clear_page(page_address(rq->special_vec.bv_page));

or the current clear_highpage(). Shrug.
James Bottomley Dec. 22, 2018, 5:42 p.m. UTC | #10
On Fri, 2018-12-21 at 20:49 -0700, Jens Axboe wrote:
> On 12/21/18 7:48 PM, James Bottomley wrote:
> > On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote:
> > > When boxes are run near (or to) OOM, we have a problem with the
> > > discard page allocation in sd. If we fail allocating the special
> > > page, we return busy, and it'll get retried. But since ordering
> > > is
> > > honored for dispatch requests, we can keep retrying this same IO
> > > and
> > > failing. Behind that IO could be requests that want to free
> > > memory,
> > > but they never get the chance. This means you get repeated spews
> > > of
> > > traces like this:
> > > 
> > > [1201401.625972] Call Trace:
> > > [1201401.631748]  dump_stack+0x4d/0x65
> > > [1201401.639445]  warn_alloc+0xec/0x190
> > > [1201401.647335]  __alloc_pages_slowpath+0xe84/0xf30
> > > [1201401.657722]  ? get_page_from_freelist+0x11b/0xb10
> > > [1201401.668475]  ? __alloc_pages_slowpath+0x2e/0xf30
> > > [1201401.679054]  __alloc_pages_nodemask+0x1f9/0x210
> > > [1201401.689424]  alloc_pages_current+0x8c/0x110
> > > [1201401.699025]  sd_setup_write_same16_cmnd+0x51/0x150
> > > [1201401.709987]  sd_init_command+0x49c/0xb70
> > > [1201401.719029]  scsi_setup_cmnd+0x9c/0x160
> > > [1201401.727877]  scsi_queue_rq+0x4d9/0x610
> > > [1201401.736535]  blk_mq_dispatch_rq_list+0x19a/0x360
> > > [1201401.747113]  blk_mq_sched_dispatch_requests+0xff/0x190
> > > [1201401.758844]  __blk_mq_run_hw_queue+0x95/0xa0
> > > [1201401.768653]  blk_mq_run_work_fn+0x2c/0x30
> > > [1201401.777886]  process_one_work+0x14b/0x400
> > > [1201401.787119]  worker_thread+0x4b/0x470
> > > [1201401.795586]  kthread+0x110/0x150
> > > [1201401.803089]  ? rescuer_thread+0x320/0x320
> > > [1201401.812322]  ? kthread_park+0x90/0x90
> > > [1201401.820787]  ? do_syscall_64+0x53/0x150
> > > [1201401.829635]  ret_from_fork+0x29/0x40
> > > 
> > > Ensure that the discard page allocation has a mempool backing, so
> > > we
> > > know we can make progress.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > 
> > > ---
> > > 
> > > We actually hit this in production, it's not a theoretical issue.
> > > 
> > > 
> > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > > index 4a6ed2fc8c71..a1a44f52e0e8 100644
> > > --- a/drivers/scsi/sd.c
> > > +++ b/drivers/scsi/sd.c
> > 
> > [...]
> > > @@ -759,9 +760,10 @@ static blk_status_t
> > > sd_setup_unmap_cmnd(struct
> > > scsi_cmnd *cmd)
> > >  	unsigned int data_len = 24;
> > >  	char *buf;
> > >  
> > > -	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC |
> > > __GFP_ZERO);
> > > +	rq->special_vec.bv_page = mempool_alloc(sd_page_pool,
> > > GFP_ATOMIC);
> > >  	if (!rq->special_vec.bv_page)
> > >  		return BLK_STS_RESOURCE;
> > > +	clear_highpage(rq->special_vec.bv_page);
> > 
> > Since the kernel never accesses this page and you take pains to
> > kmap it
> > when clearing, shouldn't the allocation have __GFP_HIGHMEM?
> 
> It's not going to great lengths to kmap it, clear_page() conveniently
> takes an address, not a page. So it's either
> 
> clear_page(page_address(rq->special_vec.bv_page));
> 
> or the current clear_highpage(). Shrug.

Yes, I guessed that and agree our semantics are slightly problematic.

However, the question was, since this is about starvation, why not use
__GFP_HIGHMEM in the allocation?  I realise it makes no difference on
64 bit, but it does on 32 where it allows access to the HIGHMEM zone
which is often quite large and would lead to better behaviour under
memory pressure.

James
Jens Axboe Dec. 23, 2018, 12:39 a.m. UTC | #11
On 12/22/18 10:42 AM, James Bottomley wrote:
> On Fri, 2018-12-21 at 20:49 -0700, Jens Axboe wrote:
>> On 12/21/18 7:48 PM, James Bottomley wrote:
>>> On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote:
>>>> When boxes are run near (or to) OOM, we have a problem with the
>>>> discard page allocation in sd. If we fail allocating the special
>>>> page, we return busy, and it'll get retried. But since ordering
>>>> is
>>>> honored for dispatch requests, we can keep retrying this same IO
>>>> and
>>>> failing. Behind that IO could be requests that want to free
>>>> memory,
>>>> but they never get the chance. This means you get repeated spews
>>>> of
>>>> traces like this:
>>>>
>>>> [1201401.625972] Call Trace:
>>>> [1201401.631748]  dump_stack+0x4d/0x65
>>>> [1201401.639445]  warn_alloc+0xec/0x190
>>>> [1201401.647335]  __alloc_pages_slowpath+0xe84/0xf30
>>>> [1201401.657722]  ? get_page_from_freelist+0x11b/0xb10
>>>> [1201401.668475]  ? __alloc_pages_slowpath+0x2e/0xf30
>>>> [1201401.679054]  __alloc_pages_nodemask+0x1f9/0x210
>>>> [1201401.689424]  alloc_pages_current+0x8c/0x110
>>>> [1201401.699025]  sd_setup_write_same16_cmnd+0x51/0x150
>>>> [1201401.709987]  sd_init_command+0x49c/0xb70
>>>> [1201401.719029]  scsi_setup_cmnd+0x9c/0x160
>>>> [1201401.727877]  scsi_queue_rq+0x4d9/0x610
>>>> [1201401.736535]  blk_mq_dispatch_rq_list+0x19a/0x360
>>>> [1201401.747113]  blk_mq_sched_dispatch_requests+0xff/0x190
>>>> [1201401.758844]  __blk_mq_run_hw_queue+0x95/0xa0
>>>> [1201401.768653]  blk_mq_run_work_fn+0x2c/0x30
>>>> [1201401.777886]  process_one_work+0x14b/0x400
>>>> [1201401.787119]  worker_thread+0x4b/0x470
>>>> [1201401.795586]  kthread+0x110/0x150
>>>> [1201401.803089]  ? rescuer_thread+0x320/0x320
>>>> [1201401.812322]  ? kthread_park+0x90/0x90
>>>> [1201401.820787]  ? do_syscall_64+0x53/0x150
>>>> [1201401.829635]  ret_from_fork+0x29/0x40
>>>>
>>>> Ensure that the discard page allocation has a mempool backing, so
>>>> we
>>>> know we can make progress.
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>
>>>> ---
>>>>
>>>> We actually hit this in production, it's not a theoretical issue.
>>>>
>>>>
>>>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>>>> index 4a6ed2fc8c71..a1a44f52e0e8 100644
>>>> --- a/drivers/scsi/sd.c
>>>> +++ b/drivers/scsi/sd.c
>>>
>>> [...]
>>>> @@ -759,9 +760,10 @@ static blk_status_t
>>>> sd_setup_unmap_cmnd(struct
>>>> scsi_cmnd *cmd)
>>>>  	unsigned int data_len = 24;
>>>>  	char *buf;
>>>>  
>>>> -	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC |
>>>> __GFP_ZERO);
>>>> +	rq->special_vec.bv_page = mempool_alloc(sd_page_pool,
>>>> GFP_ATOMIC);
>>>>  	if (!rq->special_vec.bv_page)
>>>>  		return BLK_STS_RESOURCE;
>>>> +	clear_highpage(rq->special_vec.bv_page);
>>>
>>> Since the kernel never accesses this page and you take pains to
>>> kmap it
>>> when clearing, shouldn't the allocation have __GFP_HIGHMEM?
>>
>> It's not going to great lengths to kmap it, clear_page() conveniently
>> takes an address, not a page. So it's either
>>
>> clear_page(page_address(rq->special_vec.bv_page));
>>
>> or the current clear_highpage(). Shrug.
> 
> Yes, I guessed that and agree our semantics are slightly problematic.
> 
> However, the question was, since this is about starvation, why not use
> __GFP_HIGHMEM in the allocation?  I realise it makes no difference on
> 64 bit, but it does on 32 where it allows access to the HIGHMEM zone
> which is often quite large and would lead to better behaviour under
> memory pressure.

I seriously doubt it matters, it's two pages that are allocated. The
initial issue was requiring _extra_ memory when you are OOM. If you're
OOM, you'll be OOM at +2 pages too.

I hear what you're saying, but I seriously doubt it matters. What's
important here is forward progress. What happens if you need to bounce
the page? That'll put extra pressure on the allocator, even if we'll be
fine with forward progress with the bounce mempool. It'll still hit the
allocator first.
diff mbox series

Patch

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4a6ed2fc8c71..a1a44f52e0e8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -133,6 +133,7 @@  static DEFINE_MUTEX(sd_ref_mutex);
 
 static struct kmem_cache *sd_cdb_cache;
 static mempool_t *sd_cdb_pool;
+static mempool_t *sd_page_pool;
 
 static const char *sd_cache_types[] = {
 	"write through", "none", "write back",
@@ -759,9 +760,10 @@  static blk_status_t sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 	unsigned int data_len = 24;
 	char *buf;
 
-	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	rq->special_vec.bv_page = mempool_alloc(sd_page_pool, GFP_ATOMIC);
 	if (!rq->special_vec.bv_page)
 		return BLK_STS_RESOURCE;
+	clear_highpage(rq->special_vec.bv_page);
 	rq->special_vec.bv_offset = 0;
 	rq->special_vec.bv_len = data_len;
 	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
@@ -793,9 +795,10 @@  static blk_status_t sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd,
 	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 	u32 data_len = sdp->sector_size;
 
-	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	rq->special_vec.bv_page = mempool_alloc(sd_page_pool, GFP_ATOMIC);
 	if (!rq->special_vec.bv_page)
 		return BLK_STS_RESOURCE;
+	clear_highpage(rq->special_vec.bv_page);
 	rq->special_vec.bv_offset = 0;
 	rq->special_vec.bv_len = data_len;
 	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
@@ -824,9 +827,10 @@  static blk_status_t sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd,
 	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 	u32 data_len = sdp->sector_size;
 
-	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	rq->special_vec.bv_page = mempool_alloc(sd_page_pool, GFP_ATOMIC);
 	if (!rq->special_vec.bv_page)
 		return BLK_STS_RESOURCE;
+	clear_highpage(rq->special_vec.bv_page);
 	rq->special_vec.bv_offset = 0;
 	rq->special_vec.bv_len = data_len;
 	rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
@@ -1277,7 +1281,7 @@  static void sd_uninit_command(struct scsi_cmnd *SCpnt)
 	u8 *cmnd;
 
 	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
-		__free_page(rq->special_vec.bv_page);
+		mempool_free(rq->special_vec.bv_page, sd_page_pool);
 
 	if (SCpnt->cmnd != scsi_req(rq)->cmd) {
 		cmnd = SCpnt->cmnd;
@@ -3614,6 +3618,13 @@  static int __init init_sd(void)
 		goto err_out_cache;
 	}
 
+	sd_page_pool = mempool_create_page_pool(SD_MEMPOOL_SIZE, 0);
+	if (!sd_page_pool) {
+		printk(KERN_ERR "sd: can't init discard page pool\n");
+		err = -ENOMEM;
+		goto err_out_ppool;
+	}
+
 	err = scsi_register_driver(&sd_template.gendrv);
 	if (err)
 		goto err_out_driver;
@@ -3621,6 +3632,9 @@  static int __init init_sd(void)
 	return 0;
 
 err_out_driver:
+	mempool_destroy(sd_page_pool);
+
+err_out_ppool:
 	mempool_destroy(sd_cdb_pool);
 
 err_out_cache:
@@ -3647,6 +3661,7 @@  static void __exit exit_sd(void)
 
 	scsi_unregister_driver(&sd_template.gendrv);
 	mempool_destroy(sd_cdb_pool);
+	mempool_destroy(sd_page_pool);
 	kmem_cache_destroy(sd_cdb_cache);
 
 	class_unregister(&sd_disk_class);