Message ID | 1429101284-19490-2-git-send-email-m@bjorling.me (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Apr 15, 2015 at 02:34:40PM +0200, Matias Bj??rling wrote: > Allow users to hook into prep/unprep functions just before an IO is > dispatched to the device driver. This is necessary for request-based > logic to take place at upper layers. I don't think any of this logic belongs into the block layer. All this should be library functions called by the drivers. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/17/2015 08:34 AM, Christoph Hellwig wrote: > On Wed, Apr 15, 2015 at 02:34:40PM +0200, Matias Bj??rling wrote: >> Allow users to hook into prep/unprep functions just before an IO is >> dispatched to the device driver. This is necessary for request-based >> logic to take place at upper layers. > > I don't think any of this logic belongs into the block layer. All this > should be library functions called by the drivers. > Just the prep/unprep, or other pieces as well? I like that struct request_queue has a ref to struct nvm_dev, and the variables in request and bio to get to the struct is in the block layer. In the future, applications can have an API to get/put flash block directly. (using the blk_nvm_[get/put]_blk interface). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 17, 2015 at 10:15:46AM +0200, Matias Bj?rling wrote: > Just the prep/unprep, or other pieces as well? All of it - it's functionality that lies logically below the block layer, so that's where it should be handled. In fact it should probably work similar to the mtd subsystem - that is have it's own API for low level drivers, and just export a block driver as one consumer on the top side. > In the future, applications can have an API to get/put flash block directly. > (using the blk_nvm_[get/put]_blk interface). s/application/filesystem/? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Den 17-04-2015 kl. 19:46 skrev Christoph Hellwig: > On Fri, Apr 17, 2015 at 10:15:46AM +0200, Matias Bj?rling wrote: >> Just the prep/unprep, or other pieces as well? > > All of it - it's functionality that lies logically below the block > layer, so that's where it should be handled. > > In fact it should probably work similar to the mtd subsystem - that is > have it's own API for low level drivers, and just export a block driver > as one consumer on the top side. The low level drivers will be NVMe and vendor's own PCI-e drivers. It's very generic in their nature. Each driver would duplicate the same work. Both could have normal and open-channel drives attached. I'll like to keep blk-mq in the loop. I don't think it will be pretty to have two data paths in the drivers. For blk-mq, bios are splitted/merged on the way down. Thus, the actual physical addresses needs aren't known before the IO is diced to the right size. The reason it shouldn't be under the a single block device, is that a target should be able to provide a global address space. That allows the address space to grow/shrink dynamically with the disks. Allowing a continuously growing address space, where disks can be added/removed as requirements grow or flash ages. Not on a sector level, but on a flash block level. > >> In the future, applications can have an API to get/put flash block directly. >> (using the blk_nvm_[get/put]_blk interface). > > s/application/filesystem/? > Applications. The goal is that key value stores, e.g. RocksDB, Aerospike, Ceph and similar have direct access to flash storage. There won't be a kernel file-system between. The get/put interface can be seen as a space reservation interface for where a given process is allowed to access the storage media. It can also be seen in the way that we provide a block allocator in the kernel, while applications implement the rest of "file-system" in user-space, specially optimized for their data structures. This makes a lot of sense for a small subset (LSM, Fractal trees, etc.) of database applications. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Apr 18, 2015 at 08:45:19AM +0200, Matias Bjorling wrote: > The low level drivers will be NVMe and vendor's own PCI-e drivers. It's very > generic in their nature. Each driver would duplicate the same work. Both > could have normal and open-channel drives attached. I didn't say the work should move into the driver, bur rather that driver should talk to the open channel ssd code directly instead of hooking into the core block code. > I'll like to keep blk-mq in the loop. I don't think it will be pretty to > have two data paths in the drivers. For blk-mq, bios are splitted/merged on > the way down. Thus, the actual physical addresses needs aren't known before > the IO is diced to the right size. But you _do_ have two different data path already. Nothing says you can't use blk-mq for your data path, ut it should be a separate entry point. Similar to say how a SCSI disk and MMC device both use the block layer but still use different entry points. > The reason it shouldn't be under the a single block device, is that a target > should be able to provide a global address space. > That allows the address > space to grow/shrink dynamically with the disks. Allowing a continuously > growing address space, where disks can be added/removed as requirements grow > or flash ages. Not on a sector level, but on a flash block level. I don't understand what you mean with a single block device here, but I suspect we're talking past each other somehow. > >>In the future, applications can have an API to get/put flash block directly. > >>(using the blk_nvm_[get/put]_blk interface). > > > >s/application/filesystem/? > > > > Applications. The goal is that key value stores, e.g. RocksDB, Aerospike, > Ceph and similar have direct access to flash storage. There won't be a > kernel file-system between. > > The get/put interface can be seen as a space reservation interface for where > a given process is allowed to access the storage media. > > It can also be seen in the way that we provide a block allocator in the > kernel, while applications implement the rest of "file-system" in > user-space, specially optimized for their data structures. This makes a lot > of sense for a small subset (LSM, Fractal trees, etc.) of database > applications. While we'll need a proper API for that first it's just another reason of why we shouldnt shoe horn the open channel ssd support into the block layer. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Sat, Apr 18, 2015 at 08:45:19AM +0200, Matias Bjorling wrote: <snip> >> The reason it shouldn't be under the a single block device, is that a target >> should be able to provide a global address space. >> That allows the address >> space to grow/shrink dynamically with the disks. Allowing a continuously >> growing address space, where disks can be added/removed as requirements grow >> or flash ages. Not on a sector level, but on a flash block level. > > I don't understand what you mean with a single block device here, but I > suspect we're talking past each other somehow. Sorry. I meant that several block devices should form a single address space (exposed as a single block device), consisting of all the flash blocks. Applications could then get/put from that. Thanks for your feedback. I'll push the pieces around and make the integration self-contained outside of the block layer. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/block/blk-mq.c b/block/blk-mq.c index 33c4285..f3dd028 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -338,6 +338,11 @@ EXPORT_SYMBOL(__blk_mq_end_request); void blk_mq_end_request(struct request *rq, int error) { + struct request_queue *q = rq->q; + + if (q->unprep_rq_fn) + q->unprep_rq_fn(q, rq); + if (blk_update_request(rq, error, blk_rq_bytes(rq))) BUG(); __blk_mq_end_request(rq, error); @@ -753,6 +758,17 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list) } } +static int blk_mq_prep_rq(struct request_queue *q, struct request *rq) +{ + if (!q->prep_rq_fn) + return 0; + + if (rq->cmd_flags & REQ_DONTPREP) + return 0; + + return q->prep_rq_fn(q, rq); +} + /* * Run this hardware queue, pulling any software queues mapped to it in. * Note that this function currently has various problems around ordering @@ -812,11 +828,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) bd.list = dptr; bd.last = list_empty(&rq_list); - ret = q->mq_ops->queue_rq(hctx, &bd); + ret = blk_mq_prep_rq(q, rq); + if (likely(!ret)) + ret = q->mq_ops->queue_rq(hctx, &bd); switch (ret) { case BLK_MQ_RQ_QUEUE_OK: queued++; continue; + case BLK_MQ_RQ_QUEUE_DONE: + continue; case BLK_MQ_RQ_QUEUE_BUSY: list_add(&rq->queuelist, &rq_list); __blk_mq_requeue_request(rq); @@ -1270,10 +1290,14 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio) * error (busy), just add it to our list as we previously * would have done */ - ret = q->mq_ops->queue_rq(data.hctx, &bd); + ret = blk_mq_prep_rq(q, rq); + if (likely(!ret)) + ret = q->mq_ops->queue_rq(data.hctx, &bd); if (ret == BLK_MQ_RQ_QUEUE_OK) goto done; else { + if (ret == BLK_MQ_RQ_QUEUE_DONE) + goto done; __blk_mq_requeue_request(rq); if (ret == BLK_MQ_RQ_QUEUE_ERROR) { diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 7aec861..d7b39af 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -140,6 +140,7 @@ enum { BLK_MQ_RQ_QUEUE_OK = 0, /* queued fine */ BLK_MQ_RQ_QUEUE_BUSY = 1, /* requeue IO for later */ BLK_MQ_RQ_QUEUE_ERROR = 2, /* end IO with error */ + BLK_MQ_RQ_QUEUE_DONE = 3, /* IO is already handled */ BLK_MQ_F_SHOULD_MERGE = 1 << 0, BLK_MQ_F_TAG_SHARED = 1 << 1,
Allow users to hook into prep/unprep functions just before an IO is dispatched to the device driver. This is necessary for request-based logic to take place at upper layers. Signed-off-by: Matias Bjørling <m@bjorling.me> --- block/blk-mq.c | 28 ++++++++++++++++++++++++++-- include/linux/blk-mq.h | 1 + 2 files changed, 27 insertions(+), 2 deletions(-)