Message ID | 1556191202-3245-1-git-send-email-joshi.k@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | Extend write-hint framework, and add write-hint for Ext4 journal | expand |
Hi Jens & other maintainers,
If this patch-set is in fine shape now, can it please be considered for merge in near future?
Thanks,
-----Original Message-----
From: Kanchan Joshi [mailto:joshi.k@samsung.com]
Sent: Thursday, April 25, 2019 4:50 PM
To: linux-kernel@vger.kernel.org; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; linux-fsdevel@vger.kernel.org; linux-ext4@vger.kernel.org
Cc: prakash.v@samsung.com; anshul@samsung.com; Kanchan Joshi <joshi.k@samsung.com>
Subject: [PATCH v5 0/7] Extend write-hint framework, and add write-hint for Ext4 journal
V5 series, towards extending write-hint/streams infrastructure for kernel-components, and adding support for sending write-hint with Ext4/JBD2 journal.
Here is the history/changelog -
Changes since v4:
- Removed write-hint field from request. bi_write_hint in bio is used for
merging checks now.
- Modified write-hint-to-stream conversion logic. Now, kernel hints are mapped
to upper range of stream-ids, while user-hints continue to remain mapped to
lower range of stream-ids.
Changes since v3:
- Correction in grouping related changes into patches
- Rectification in commit text at places
Changes since v2:
- Introduce API in block layer so that drivers can register stream info. Added
new limit in request queue for this purpose.
- Block layer does the conversion from write-hint to stream-id.
- Stream feature is not disabled anymore if device reports less streams than
a particular number (which was set as 4 earlier).
- Any write-hint beyond reported stream-count turn to 0.
- New macro "WRITE_LIFE_KERN_MIN" can be used as base by kernel mode components.
Changes since v1:
- introduce four more hints for in-kernel use, as recommended by Dave chinner
& Jens axboe. This isolates kernel-mode hints from user-mode ones.
- remove mount-option to specify write-hint, as recommended by Jan kara &
Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
if device does not support stream.
- Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)
V4 patch:
https://lkml.org/lkml/2019/4/17/870
V3 patch:
https://marc.info/?l=linux-block&m=155384631909082&w=2
V2 patch:
https://patchwork.kernel.org/cover/10754405/
V1 patch:
https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
Kanchan Joshi (7):
fs: introduce write-hint start point for in-kernel hints
block: increase stream count for in-kernel use
block: introduce API to register stream information with block-layer
block: introduce write-hint to stream-id conversion
nvme: register stream info with block layer
fs: introduce APIs to enable passing write-hint with buffer-head
fs/ext4,jbd2: add support for sending write-hint with journal
block/blk-core.c | 29 ++++++++++++++++++++++++++++-
block/blk-merge.c | 4 ++--
block/blk-settings.c | 12 ++++++++++++
drivers/nvme/host/core.c | 23 ++++++-----------------
fs/buffer.c | 18 ++++++++++++++++--
fs/ext4/ext4_jbd2.h | 1 +
fs/ext4/super.c | 2 ++
fs/jbd2/commit.c | 11 +++++++----
fs/jbd2/journal.c | 3 ++-
fs/jbd2/revoke.c | 3 ++-
include/linux/blkdev.h | 8 ++++++--
include/linux/buffer_head.h | 3 +++
include/linux/fs.h | 2 ++
include/linux/jbd2.h | 8 ++++++++
14 files changed, 97 insertions(+), 30 deletions(-)
--
2.7.4
I think this fundamentally goes in the wrong direction. We explicitly designed the block layer infrastructure around life time hints and not the not fish not flesh streams interface, which causes all kinds of problems. Including the one this model causes on at least some SSDs where you now statically allocate resources to a stream that is now not globally available. All for the little log with very short date lifetime that any half decent hot/cold partitioning algorithm in the SSD should be able to detect.
Hi Christoph, > Including the one this model causes on at least some SSDs where you now statically allocate resources to a stream that is now not globally available. Sorry but can you please elaborate the issue? I do not get what is being statically allocated which was globally available earlier. If you are referring to nvme driver, available streams at subsystem level are being reflected for all namespaces. This is same as earlier. There is no attempt to explicitly allocate (using dir-receive) or reserve streams for any namespace. Streams will continue to get allocated/released implicitly as and when writes (with stream id) arrive. > All for the little log with very short date lifetime that any half decent hot/cold partitioning algorithm in the SSD should be able to detect. With streams, hot/cold segregation is happening at the time of placement itself, without algorithm; that is a clear win over algorithms which take time/computation to be able to do the same. And infrastructure update (write-hint-to-stream-id conversion in block-layer, in-kernel hints etc.) seems to be required anyway for streams to extend its reach beyond nvme and user-space hints. Thanks, -----Original Message----- From: Christoph Hellwig [mailto:hch@infradead.org] Sent: Friday, May 10, 2019 10:33 PM To: Kanchan Joshi <joshi.k@samsung.com> Cc: linux-kernel@vger.kernel.org; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; linux-fsdevel@vger.kernel.org; linux-ext4@vger.kernel.org; prakash.v@samsung.com; anshul@samsung.com Subject: Re: [PATCH v5 0/7] Extend write-hint framework, and add write-hint for Ext4 journal I think this fundamentally goes in the wrong direction. We explicitly designed the block layer infrastructure around life time hints and not the not fish not flesh streams interface, which causes all kinds of problems. Including the one this model causes on at least some SSDs where you now statically allocate resources to a stream that is now not globally available. All for the little log with very short date lifetime that any half decent hot/cold partitioning algorithm in the SSD should be able to detect.
On Fri, May 17, 2019 at 11:01:55AM +0530, kanchan wrote: > Sorry but can you please elaborate the issue? I do not get what is being > statically allocated which was globally available earlier. > If you are referring to nvme driver, available streams at subsystem level > are being reflected for all namespaces. This is same as earlier. > There is no attempt to explicitly allocate (using dir-receive) or reserve > streams for any namespace. > Streams will continue to get allocated/released implicitly as and when > writes (with stream id) arrive. We have made a concious decision that we do not want to expose streams as an awkward not fish not flesh interface, but instead life time hints. I see no reason to change from and burden the whole streams complexity on other in-kernel callers.
On Mon 20-05-19 07:27:19, 'Christoph Hellwig' wrote: > On Fri, May 17, 2019 at 11:01:55AM +0530, kanchan wrote: > > Sorry but can you please elaborate the issue? I do not get what is being > > statically allocated which was globally available earlier. > > If you are referring to nvme driver, available streams at subsystem level > > are being reflected for all namespaces. This is same as earlier. > > There is no attempt to explicitly allocate (using dir-receive) or reserve > > streams for any namespace. > > Streams will continue to get allocated/released implicitly as and when > > writes (with stream id) arrive. > > We have made a concious decision that we do not want to expose streams > as an awkward not fish not flesh interface, but instead life time hints. > > I see no reason to change from and burden the whole streams complexity > on other in-kernel callers. I'm not following the "streams complexity" you talk about. At least the usecase Kanchan speaks about here is pretty simple for the filesystem - tagging journal writes with special stream id. I agree that something like dynamically allocating available stream ids to different purposes is complex and has uncertain value but this "static stream id for particular purpose" looks simple and sensible to me and Kanchan has shown significant performance benefits for some drives. After all you can just think about it like RWH_WRITE_LIFE_JOURNAL type of hint available for the kernel... Honza
On Tue, May 21, 2019 at 10:25:28AM +0200, Jan Kara wrote: > performance benefits for some drives. After all you can just think about it > like RWH_WRITE_LIFE_JOURNAL type of hint available for the kernel... Except that it actuallys adds a parallel insfrastructure. A RWH_WRITE_LIFE_JOURNAL would be much more palatable, but someone needs to explain how that is: a) different from RWH_WRITE_LIFE_SHORT b) would not apply to a log/journal maintained in userspace that works exactly the same
On Tue 21-05-19 01:28:46, 'Christoph Hellwig' wrote: > On Tue, May 21, 2019 at 10:25:28AM +0200, Jan Kara wrote: > > performance benefits for some drives. After all you can just think about it > > like RWH_WRITE_LIFE_JOURNAL type of hint available for the kernel... > > Except that it actuallys adds a parallel insfrastructure. A > RWH_WRITE_LIFE_JOURNAL would be much more palatable, but someone needs > to explain how that is: > > a) different from RWH_WRITE_LIFE_SHORT The problem I have with this is: What does "short" mean? What if userspace's notion of short differs from the kernel notion? Also the journal block lifetime is somewhat hard to predict. It depends on the size of the journal and metadata load on the filesystem so there's big variance. So all we really know is that all journal blocks are the same. > b) would not apply to a log/journal maintained in userspace that works > exactly the same Lifetime of userspace journal/log may be significantly different from the lifetime of the filesystem journal. So using the same hint for them does not look like a great idea? Honza
Christoph, May I know if you have thoughts about what Jan mentioned below? I reflected upon the whole series again, and here is my understanding of your concern (I hope to address that, once I get it right). Current patch-set targeted adding two things - 1. Extend write-hint infra for in-kernel callers 2. Send write-hint for FS-journal In the process of doing 1, write-hint gets more closely connected to stream (as hint-to-stream conversion moves to block-layer). And perhaps this is something that you've objection on. Whether write-hint converts into flash-stream or into something-else is deliberately left to device-driver and that's why block layer does not have a hint-to-stream conversion in the first place. Is this the correct understanding of why things are the way they are? On 2, sending write-hint for FS journal is actually important, as there is clear data on both performance and endurance benefits. RWH_WRITE_LIFE_JOURNAL or REQ_JOURNAL (that Martin Petersen suggested) kind of thing will help in identifying Journal I/O which can be useful for other purposes (than streams) as well. I saw this LSFMM coverage https://lwn.net/Articles/788721/ , and felt that this could be useful for turbo-write in UFS. BR, Kanchan -----Original Message----- From: Jan Kara [mailto:jack@suse.cz] Sent: Wednesday, May 22, 2019 3:56 PM To: 'Christoph Hellwig' <hch@infradead.org> Cc: Jan Kara <jack@suse.cz>; kanchan <joshi.k@samsung.com>; linux-kernel@vger.kernel.org; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; linux-fsdevel@vger.kernel.org; linux-ext4@vger.kernel.org; prakash.v@samsung.com; anshul@samsung.com; Martin K. Petersen <martin.petersen@oracle.com> Subject: Re: [PATCH v5 0/7] Extend write-hint framework, and add write-hint for Ext4 journal On Tue 21-05-19 01:28:46, 'Christoph Hellwig' wrote: > On Tue, May 21, 2019 at 10:25:28AM +0200, Jan Kara wrote: > > performance benefits for some drives. After all you can just think > > about it like RWH_WRITE_LIFE_JOURNAL type of hint available for the kernel... > > Except that it actuallys adds a parallel insfrastructure. A > RWH_WRITE_LIFE_JOURNAL would be much more palatable, but someone needs > to explain how that is: > > a) different from RWH_WRITE_LIFE_SHORT The problem I have with this is: What does "short" mean? What if userspace's notion of short differs from the kernel notion? Also the journal block lifetime is somewhat hard to predict. It depends on the size of the journal and metadata load on the filesystem so there's big variance. So all we really know is that all journal blocks are the same. > b) would not apply to a log/journal maintained in userspace that works > exactly the same Lifetime of userspace journal/log may be significantly different from the lifetime of the filesystem journal. So using the same hint for them does not look like a great idea? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Wed, Jun 26, 2019 at 06:17:29PM +0530, kanchan wrote: > Christoph, > May I know if you have thoughts about what Jan mentioned below? As said I fundamentally disagree with exposting the streams mess at the block layer. I have no problem with setting a hint on the journal, but I do object to exposting the streams mess even more.