Message ID | 20241029151922.459139-1-kbusch@meta.com (mailing list archive) |
---|---|
Headers | show |
Series | write hints with nvme fdp, scsi streams | expand |
On Tue, Oct 29, 2024 at 08:19:13AM -0700, Keith Busch wrote: > Return invalid value if user requests an invalid write hint > > Added and exported a block device feature flag for indicating generic > placement hint support But it still talks of write hints everywhere and conflates the write streams with the temperature hints which are completely different beasts.
On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > The block limits exports the number of write hints, so set this limit if > the device reports support for the lifetime hints. Not only does this > inform the user of which hints are possible, it also allows scsi devices > supporting the feature to utilize the full range through raw block > device direct-io. > > Reviewed-by: Bart Van Assche <bvanassche@acm.org> > Reviewed-by: Hannes Reinecke <hare@suse.de> > Signed-off-by: Keith Busch <kbusch@kernel.org> Despite the reviews this is still incorrect. The permanent streams have a relative data temperature associated with them as pointed out last round and are not arbitrary write stream contexts despite (ab)using the SBC streams facilities. Bart, btw: I think the current sd implementation is buggy as well, as it assumes the permanent streams are ordered by their data temperature in the IO Advise hints mode page, but I can't find anything in the spec that requires a particular ordering.
On Tue, Oct 29, 2024 at 04:26:54PM +0100, Christoph Hellwig wrote: > On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote: > > From: Keith Busch <kbusch@kernel.org> > > > > The block limits exports the number of write hints, so set this limit if > > the device reports support for the lifetime hints. Not only does this > > inform the user of which hints are possible, it also allows scsi devices > > supporting the feature to utilize the full range through raw block > > device direct-io. > > > > Reviewed-by: Bart Van Assche <bvanassche@acm.org> > > Reviewed-by: Hannes Reinecke <hare@suse.de> > > Signed-off-by: Keith Busch <kbusch@kernel.org> > > Despite the reviews this is still incorrect. The permanent streams have > a relative data temperature associated with them as pointed out last > round and are not arbitrary write stream contexts despite (ab)using > the SBC streams facilities. So then don't use it that way? I still don't know what change you're expecting to happen with this feedback. What do you want the kernel to do differently here?
On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote: > So then don't use it that way? I still don't know what change you're > expecting to happen with this feedback. What do you want the kernel to > do differently here? Same as before: don't expose them as write streams, because they aren't. A big mess in this series going back to the versions before your involvement is that they somehow want to tie up the temperature hints with the stream separation, which just ends up very messy.
On Tue, Oct 29, 2024 at 04:37:02PM +0100, Christoph Hellwig wrote: > On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote: > > So then don't use it that way? I still don't know what change you're > > expecting to happen with this feedback. What do you want the kernel to > > do differently here? > > Same as before: don't expose them as write streams, because they > aren't. A big mess in this series going back to the versions before > your involvement is that they somehow want to tie up the temperature > hints with the stream separation, which just ends up very messy. They're not exposed as write streams. Patch 7/9 sets the feature if it is a placement id or not, and only nvme sets it, so scsi's attributes are not claiming to be a write stream.
On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote: > They're not exposed as write streams. Patch 7/9 sets the feature if it > is a placement id or not, and only nvme sets it, so scsi's attributes > are not claiming to be a write stream. So it shows up in sysfs, but: - queue_max_write_hints (which really should be queue_max_write_streams) still picks it up, and from there the statx interface - per-inode fcntl hint that encode a temperature still magically get dumpted into the write streams if they are set. In other words it's a really leaky half-backed abstraction. Let's brainstorm how it could be done better: - the max_write_streams values only set by block devices that actually do support write streams, and not the fire and forget temperature hints. They way this is queried is by having a non-zero value there, not need for an extra flag. - but the struct file (or maybe inode) gets a supported flag, as stream separation needs to be supported by the file system - a separate fcntl is used to set per-inode streams (if you care about that, seem like the bdev use case focusses on per-I/O). In that case we'd probably also need a separate inode field for them, or a somewhat complicated scheme to decide what is stored in the inode field if there is only one. - for block devices bdev/fops.c maps the temperature hints into write streams if write streams are supported, any user that mixes and matches write streams and temperature hints gets what they deserve - this could also be a helper for file systems that want to do the same. Just a quick writeup while I'm on the run, there's probably a hole or two that could be poked into it.
On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote: > On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote: > > They're not exposed as write streams. Patch 7/9 sets the feature if it > > is a placement id or not, and only nvme sets it, so scsi's attributes > > are not claiming to be a write stream. > > So it shows up in sysfs, but: > > - queue_max_write_hints (which really should be queue_max_write_streams) > still picks it up, and from there the statx interface > > - per-inode fcntl hint that encode a temperature still magically > get dumpted into the write streams if they are set. > > In other words it's a really leaky half-backed abstraction. Exactly why I asked last time: "who uses it and how do you want them to use it" :) > Let's brainstorm how it could be done better: > > - the max_write_streams values only set by block devices that actually > do support write streams, and not the fire and forget temperature > hints. They way this is queried is by having a non-zero value > there, not need for an extra flag. So we need a completely different attribute for SCSI's permanent write streams? You'd mentioned earlier you were okay with having SCSI be able to utilized per-io raw block write hints. Having multiple things to check for what are all just write classifiers seems unnecessarily complicated. > - but the struct file (or maybe inode) gets a supported flag, as stream > separation needs to be supported by the file system > - a separate fcntl is used to set per-inode streams (if you care about > that, seem like the bdev use case focusses on per-I/O). In that case > we'd probably also need a separate inode field for them, or a somewhat > complicated scheme to decide what is stored in the inode field if there > is only one. No need to create a new fcntl. The people already testing this are successfully using FDP with the existing fcntl hints. Their applications leverage FDP as way to separate files based on expected lifetime. It is how they want to use it and it is working above expectations. > - for block devices bdev/fops.c maps the temperature hints into write > streams if write streams are supported, any user that mixes and > matches write streams and temperature hints gets what they deserve That's fine. This patch series pretty much accomplishes that part. > - this could also be a helper for file systems that want to do the > same. > > Just a quick writeup while I'm on the run, there's probably a hole or > two that could be poked into it.
On 10/29/24 8:26 AM, Christoph Hellwig wrote: > Bart, btw: I think the current sd implementation is buggy as well, as > it assumes the permanent streams are ordered by their data temperature > in the IO Advise hints mode page, but I can't find anything in the > spec that requires a particular ordering. How about modifying sd_read_io_hints() such that permanent stream information is ignored if the order of the RELATIVE LIFETIME information reported by the GET STREAM STATUS command does not match the permanent stream order? Thanks, Bart. diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 41e2dfa2d67d..277035febd82 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3192,7 +3192,12 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer) sdkp->DPOFUA = 0; } -static bool sd_is_perm_stream(struct scsi_disk *sdkp, unsigned int stream_id) +/* + * Returns the relative lifetime of a permanent stream. Returns -1 if the + * GET STREAM STATUS command fails or if the stream is not a permanent stream. + */ +static int sd_perm_stream_rel_lifetime(struct scsi_disk *sdkp, + unsigned int stream_id) { u8 cdb[16] = { SERVICE_ACTION_IN_16, SAI_GET_STREAM_STATUS }; struct { @@ -3212,14 +3217,16 @@ static bool sd_is_perm_stream(struct scsi_disk *sdkp, unsigned int stream_id) res = scsi_execute_cmd(sdev, cdb, REQ_OP_DRV_IN, &buf, sizeof(buf), SD_TIMEOUT, sdkp->max_retries, &exec_args); if (res < 0) - return false; + return -1; if (scsi_status_is_check_condition(res) && scsi_sense_valid(&sshdr)) sd_print_sense_hdr(sdkp, &sshdr); if (res) - return false; + return -1; if (get_unaligned_be32(&buf.h.len) < sizeof(struct scsi_stream_status)) - return false; - return buf.h.stream_status[0].perm; + return -1; + if (!buf.h.stream_status[0].perm) + return -1; + return buf.h.stream_status[0].rel_lifetime; } static void sd_read_io_hints(struct scsi_disk *sdkp, unsigned char *buffer) @@ -3247,9 +3254,17 @@ static void sd_read_io_hints(struct scsi_disk *sdkp, unsigned char *buffer) * should assign the lowest numbered stream identifiers to permanent * streams. */ - for (desc = start; desc < end; desc++) - if (!desc->st_enble || !sd_is_perm_stream(sdkp, desc - start)) + int prev_rel_lifetime = -1; + for (desc = start; desc < end; desc++) { + int rel_lifetime; + + if (!desc->st_enble) break; + rel_lifetime = sd_perm_stream_rel_lifetime(sdkp, desc - start); + if (rel_lifetime < 0 || rel_lifetime < prev_rel_lifetime) + break; + prev_rel_lifetime = rel_lifetime; + } permanent_stream_count_old = sdkp->permanent_stream_count; sdkp->permanent_stream_count = desc - start; if (sdkp->rscs && sdkp->permanent_stream_count < 2)
On 10/29/24 08:19, Keith Busch wrote: > From: Kanchan Joshi <joshi.k@samsung.com> > > Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host > to control the placement of logical blocks so as to reduce the SSD WAF. > Userspace can send the write hint information using io_uring or fcntl. > > Fetch the placement-identifiers if the device supports FDP. The incoming > write-hint is mapped to a placement-identifier, which in turn is set in > the DSPEC field of the write command. > > Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> > Signed-off-by: Hui Qi <hui81.qi@samsung.com> > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> > Reviewed-by: Hannes Reinecke <hare@suse.de> > Signed-off-by: Keith Busch <kbusch@kernel.org> > --- > drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++++++++++++++ > drivers/nvme/host/nvme.h | 5 +++ > include/linux/nvme.h | 19 +++++++++ > 3 files changed, 108 insertions(+) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 3de7555a7de74..bd7b89912ddb9 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -44,6 +44,20 @@ struct nvme_ns_info { > bool is_removed; > }; > > +struct nvme_fdp_ruh_status_desc { > + u16 pid; > + u16 ruhid; > + u32 earutr; > + u64 ruamw; > + u8 rsvd16[16]; > +}; > + > +struct nvme_fdp_ruh_status { > + u8 rsvd0[14]; > + __le16 nruhsd; > + struct nvme_fdp_ruh_status_desc ruhsd[]; > +}; > + > unsigned int admin_timeout = 60; > module_param(admin_timeout, uint, 0644); > MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands"); > @@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref) > ida_free(&head->subsys->ns_ida, head->instance); > cleanup_srcu_struct(&head->srcu); > nvme_put_subsystem(head->subsys); > + kfree(head->plids); > kfree(head); > } > > @@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, > if (req->cmd_flags & REQ_RAHEAD) > dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH; > > + if (req->write_hint && ns->head->nr_plids) { > + u16 hint = max(req->write_hint, ns->head->nr_plids); > + > + dsmgmt |= ns->head->plids[hint - 1] << 16; > + control |= NVME_RW_DTYPE_DPLCMT; > + } > + > if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req)) > return BLK_STS_INVAL; > > @@ -2105,6 +2127,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns, > return ret; > } > > +static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid) > +{ > + struct nvme_fdp_ruh_status_desc *ruhsd; > + struct nvme_ns_head *head = ns->head; > + struct nvme_fdp_ruh_status *ruhs; > + struct nvme_command c = {}; > + int size, ret, i; > + > + if (head->plids) > + return 0; > + > + size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS); > + ruhs = kzalloc(size, GFP_KERNEL); > + if (!ruhs) > + return -ENOMEM; > + > + c.imr.opcode = nvme_cmd_io_mgmt_recv; > + c.imr.nsid = cpu_to_le32(nsid); > + c.imr.mo = 0x1; can we please add some comment where values are hardcoded ? > + c.imr.numd = cpu_to_le32((size >> 2) - 1); > + > + ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size); > + if (ret) > + goto out; > + > + i = le16_to_cpu(ruhs->nruhsd); instead of i why can't we use local variable nr_plids ? > + if (!i) > + goto out; > + > + ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS); > + head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids), > + GFP_KERNEL); > + if (!head->plids) { > + ret = -ENOMEM; > + goto out; > + } > + > + for (i = 0; i < ns->head->nr_plids; i++) { > + ruhsd = &ruhs->ruhsd[i]; > + head->plids[i] = le16_to_cpu(ruhsd->pid); > + } > +out: > + kfree(ruhs); > + return ret; > +} > + > static int nvme_update_ns_info_block(struct nvme_ns *ns, > struct nvme_ns_info *info) > { > @@ -2141,6 +2209,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns, > goto out; > } > > + if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) { > + ret = nvme_fetch_fdp_plids(ns, info->nsid); > + if (ret) > + dev_warn(ns->ctrl->device, > + "FDP failure status:0x%x\n", ret); > + if (ret < 0) > + goto out; > + } else { > + ns->head->nr_plids = 0; > + kfree(ns->head->plids); > + ns->head->plids = NULL; > + } > + > blk_mq_freeze_queue(ns->disk->queue); > ns->head->lba_shift = id->lbaf[lbaf].ds; > ns->head->nuse = le64_to_cpu(id->nuse); > @@ -2171,6 +2252,9 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns, > if (!nvme_init_integrity(ns->head, &lim, info)) > capacity = 0; > > + lim.max_write_hints = ns->head->nr_plids; > + if (lim.max_write_hints) > + lim.features |= BLK_FEAT_PLACEMENT_HINTS; > ret = queue_limits_commit_update(ns->disk->queue, &lim); > if (ret) { > blk_mq_unfreeze_queue(ns->disk->queue); > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 093cb423f536b..cec8e5d96377b 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -454,6 +454,8 @@ struct nvme_ns_ids { > u8 csi; > }; > > +#define NVME_MAX_PLIDS (NVME_CTRL_PAGE_SIZE / sizeof(16)) this calculates how many plids can fit into the ctrl page size ? sorry but I didn't understand sizeof(16) here, since plids are u16 nvme_ns_head -> u16 *plidsshould this be sizeof(u16) ? -ck
On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote: > On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote: > > On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote: > > > They're not exposed as write streams. Patch 7/9 sets the feature if it > > > is a placement id or not, and only nvme sets it, so scsi's attributes > > > are not claiming to be a write stream. > > > > So it shows up in sysfs, but: > > > > - queue_max_write_hints (which really should be queue_max_write_streams) > > still picks it up, and from there the statx interface > > > > - per-inode fcntl hint that encode a temperature still magically > > get dumpted into the write streams if they are set. > > > > In other words it's a really leaky half-backed abstraction. > > Exactly why I asked last time: "who uses it and how do you want them to > use it" :) For the temperature hints the only public user I known is rocksdb, and that only started working when Hans fixed a brown paperbag bug in the rocksdb code a while ago. Given that f2fs interprets the hints I suspect something in the Android world does as well, maybe Bart knows more. For the separate write streams the usage I want for them is poor mans zones - e.g. write N LBAs sequentially into a separate write streams and then eventually discard them together. This will fit nicely into f2fs and the pending xfs work as well as quite a few userspace storage systems. For that the file system or application needs to query the number of available write streams (and in the bitmap world their numbers of they are distontigous) and the size your can fit into the "reclaim unit" in FDP terms. I've not been bothering you much with the latter as it is an easy retrofit once the I/O path bits lands. > > Let's brainstorm how it could be done better: > > > > - the max_write_streams values only set by block devices that actually > > do support write streams, and not the fire and forget temperature > > hints. They way this is queried is by having a non-zero value > > there, not need for an extra flag. > > So we need a completely different attribute for SCSI's permanent write > streams? You'd mentioned earlier you were okay with having SCSI be able > to utilized per-io raw block write hints. Having multiple things to > check for what are all just write classifiers seems unnecessarily > complicated. I don't think the multiple write streams interface applies to SCSIs write streams, as they enforce a relative temperature, and they don't have the concept of how much you can write into an "reclaim unit". OTOH there isn't much you need to query for them anyway, as the temperature hints have always been defined as pure hints with all up and downsides of that. > No need to create a new fcntl. The people already testing this are > successfully using FDP with the existing fcntl hints. Their applications > leverage FDP as way to separate files based on expected lifetime. It is > how they want to use it and it is working above expectations. FYI, I think it's always fine and easy to map the temperature hits to write streams if that's all the driver offers. It loses a lot of the capapilities, but as long as it doesn't enforce a lower level interface that never exposes more that's fine.
On Tue, Oct 29, 2024 at 10:18:31AM -0700, Bart Van Assche wrote: > On 10/29/24 8:26 AM, Christoph Hellwig wrote: >> Bart, btw: I think the current sd implementation is buggy as well, as >> it assumes the permanent streams are ordered by their data temperature >> in the IO Advise hints mode page, but I can't find anything in the >> spec that requires a particular ordering. > > How about modifying sd_read_io_hints() such that permanent stream > information is ignored if the order of the RELATIVE LIFETIME information > reported by the GET STREAM STATUS command does not match the permanent > stream order? That seems odd as there is nothing implying that they should be ordered. The logic thing to do would be to a little array mapping the linux temperature levels to the streams ids.
On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote: > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote: > > > No need to create a new fcntl. The people already testing this are > > successfully using FDP with the existing fcntl hints. Their applications > > leverage FDP as way to separate files based on expected lifetime. It is > > how they want to use it and it is working above expectations. > > FYI, I think it's always fine and easy to map the temperature hits to > write streams if that's all the driver offers. It loses a lot of the > capapilities, but as long as it doesn't enforce a lower level interface > that never exposes more that's fine. But that's just the v2 from this sequence: https://lore.kernel.org/linux-nvme/20240528150233.55562-1-joshi.k@samsung.com/ If you're okay with it now, then let's just go with that and I'm happy continue iterating on the rest separately.
On Wed, Oct 30, 2024 at 09:41:39AM -0600, Keith Busch wrote: > On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote: > > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote: > > > > > No need to create a new fcntl. The people already testing this are > > > successfully using FDP with the existing fcntl hints. Their applications > > > leverage FDP as way to separate files based on expected lifetime. It is > > > how they want to use it and it is working above expectations. > > > > FYI, I think it's always fine and easy to map the temperature hits to > > write streams if that's all the driver offers. It loses a lot of the > > capapilities, but as long as it doesn't enforce a lower level interface > > that never exposes more that's fine. > > But that's just the v2 from this sequence: > > https://lore.kernel.org/linux-nvme/20240528150233.55562-1-joshi.k@samsung.com/ > > If you're okay with it now, then let's just go with that and I'm happy > continue iterating on the rest separately. That's exactly what I do not want - it takes the temperature hints and force them into the write streams down in the driver with no way to make actually useful use of the stream separation.
On Wed, Oct 30, 2024 at 04:45:56PM +0100, Christoph Hellwig wrote: > On Wed, Oct 30, 2024 at 09:41:39AM -0600, Keith Busch wrote: > > On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote: > > > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote: > > > > > > > No need to create a new fcntl. The people already testing this are > > > > successfully using FDP with the existing fcntl hints. Their applications > > > > leverage FDP as way to separate files based on expected lifetime. It is > > > > how they want to use it and it is working above expectations. > > > > > > FYI, I think it's always fine and easy to map the temperature hits to > > > write streams if that's all the driver offers. It loses a lot of the > > > capapilities, but as long as it doesn't enforce a lower level interface > > > that never exposes more that's fine. > > > > But that's just the v2 from this sequence: > > > > https://lore.kernel.org/linux-nvme/20240528150233.55562-1-joshi.k@samsung.com/ > > > > If you're okay with it now, then let's just go with that and I'm happy > > continue iterating on the rest separately. > > That's exactly what I do not want - it takes the temperature hints > and force them into the write streams down in the driver What??? You said to map the temperature hints to a write stream. The driver offers that here. But you specifically don't want that? I'm so confused. > with no way to make actually useful use of the stream separation. Have you tried it? The people who actually do easily demonstrate it is in fact very useful.
On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote: > What??? You said to map the temperature hints to a write stream. The > driver offers that here. But you specifically don't want that? I'm so > confused. In bdev/fops.c (or file systems if they want to do that) not down in the driver forced down everyones throat. Which was the whole point of the discussion that we're running in circles here. > > with no way to make actually useful use of the stream separation. > > Have you tried it? The people who actually do easily demonstrate it is > in fact very useful. While I've read the claim multiple times, I've not actually seen any numbers.
On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote: > On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote: > > What??? You said to map the temperature hints to a write stream. The > > driver offers that here. But you specifically don't want that? I'm so > > confused. > > In bdev/fops.c (or file systems if they want to do that) not down in the > driver forced down everyones throat. Which was the whole point of the > discussion that we're running in circles here. That makes no sense. A change completely isolated to a driver isn't forcing anything on anyone. It's the upper layers that's forcing this down, whether the driver uses it or not: the hints are already getting to the driver, but the driver currently doesn't use it. Finding a way to use them is not some force to be demonized... > > > with no way to make actually useful use of the stream separation. > > > > Have you tried it? The people who actually do easily demonstrate it is > > in fact very useful. > > While I've read the claim multiple times, I've not actually seen any > numbers. Here's something recent from rocksdb developers running ycsb worklada benchmark. The filesystem used is XFS. It sets temperature hints for different SST levels, which already happens today. The last data point made some minor changes with level-to-hint mapping. Without FDP: WAF: 2.72 IOPS: 1465 READ LAT: 2681us UPDATE LAT: 3115us With FDP (rocksdb unmodified): WAF: 2.26 IOPS: 1473 READ LAT: 2415us UPDATE LAT: 2807us With FDP (with some minor rocksdb changes): WAF: 1.67 IOPS: 1547 READ LAT: 1978us UPDATE LAT: 2267us
On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote: > On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote: > > On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote: > > > What??? You said to map the temperature hints to a write stream. The > > > driver offers that here. But you specifically don't want that? I'm so > > > confused. > > > > In bdev/fops.c (or file systems if they want to do that) not down in the > > driver forced down everyones throat. Which was the whole point of the > > discussion that we're running in circles here. > > That makes no sense. A change completely isolated to a driver isn't > forcing anything on anyone. It's the upper layers that's forcing this > down, whether the driver uses it or not: the hints are already getting > to the driver, but the driver currently doesn't use it. And once it uses by default, taking it away will have someone scream regresion, because we're not taking it away form that super special use case. > Here's something recent from rocksdb developers running ycsb worklada > benchmark. The filesystem used is XFS. Thanks for finally putting something up. > It sets temperature hints for different SST levels, which already > happens today. The last data point made some minor changes with > level-to-hint mapping. Do you have a pointer to the changes? > Without FDP: > > WAF: 2.72 > IOPS: 1465 > READ LAT: 2681us > UPDATE LAT: 3115us > > With FDP (rocksdb unmodified): > > WAF: 2.26 > IOPS: 1473 > READ LAT: 2415us > UPDATE LAT: 2807us > > With FDP (with some minor rocksdb changes): > > WAF: 1.67 > IOPS: 1547 > READ LAT: 1978us > UPDATE LAT: 2267us Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code, which should work just fine with FDP IFF we exposed real write streams, which roughly double read nad wirte IOPS and reduce the WAF to almost 1 this doesn't look too spectacular to be honest, but it sure it something. I just wish we could get the real infraѕtructure instead of some band aid, which makes it really hard to expose the real thing because now it's been taken up and directly wired to a UAPI. one
On 10/29/24 9:55 PM, Christoph Hellwig wrote: > For the temperature hints the only public user I known is rocksdb, and > that only started working when Hans fixed a brown paperbag bug in the > rocksdb code a while ago. Given that f2fs interprets the hints I suspect > something in the Android world does as well, maybe Bart knows more. UFS devices typically have less internal memory (SRAM) than the size of a single zone. Hence, it helps UFS devices if it can be decided at the time a write command is received where to send the data (SRAM, SLC NAND or TLC NAND). This is why UFS vendors asked to provide data lifetime information to zoned logical units. More information about UFS device internals is available in this paper: Hwang, Joo-Young, Seokhwan Kim, Daejun Park, Yong-Gil Song, Junyoung Han, Seunghyun Choi, Sangyeun Cho, and Youjip Won. "{ZMS}: Zone Abstraction for Mobile Flash Storage." In 2024 USENIX Annual Technical Conference (USENIX ATC 24), pp. 173-189. 2024 (https://www.usenix.org/system/files/atc24-hwang.pdf). Bart.
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote: > And once it uses by default, taking it away will have someone scream > regresion, because we're not taking it away form that super special > use case. Refusing to allow something because someone might find it useful has got to be the worst reasoning I've heard. :)
On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote: > > On 10/29/24 9:55 PM, Christoph Hellwig wrote: >> For the temperature hints the only public user I known is rocksdb, and >> that only started working when Hans fixed a brown paperbag bug in the >> rocksdb code a while ago. Given that f2fs interprets the hints I suspect >> something in the Android world does as well, maybe Bart knows more. > > UFS devices typically have less internal memory (SRAM) than the size of a > single zone. That wasn't quite the question. Do you know what application in android set the fcntl temperature hints?
On Wed, Oct 30, 2024 at 11:05:03AM -0600, Keith Busch wrote: > On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote: > > And once it uses by default, taking it away will have someone scream > > regresion, because we're not taking it away form that super special > > use case. > > Refusing to allow something because someone might find it useful has got > to be the worst reasoning I've heard. :) That's not that point. The point is by locking us in we can't actually do the proper thing. And that's what I'm really worried about. Especially with the not all that great numbers in the success story.
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote: > > On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote: > > > It sets temperature hints for different SST levels, which already > > happens today. The last data point made some minor changes with > > level-to-hint mapping. > > Do you have a pointer to the changes? The change moves levels 2 and 3 to "MEDIUM" (along with 0 and 1 already there), 4 to "LONG", and >= 5 remain "EXTREME". WAL continues to be "SHORT", as before. > > Without FDP: > > > > WAF: 2.72 > > IOPS: 1465 > > READ LAT: 2681us > > UPDATE LAT: 3115us > > > > With FDP (rocksdb unmodified): > > > > WAF: 2.26 > > IOPS: 1473 > > READ LAT: 2415us > > UPDATE LAT: 2807us > > > > With FDP (with some minor rocksdb changes): > > > > WAF: 1.67 > > IOPS: 1547 > > READ LAT: 1978us > > UPDATE LAT: 2267us > > Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code, > which should work just fine with FDP IFF we exposed real write streams, > which roughly double read nad wirte IOPS and reduce the WAF to almost > 1 this doesn't look too spectacular to be honest, but it sure it something. > > I just wish we could get the real infraѕtructure instead of some band > aid, which makes it really hard to expose the real thing because now > it's been taken up and directly wired to a UAPI. > one This doesn't have to be the end placement streams development. I fundamentally disagree that this locks anyone in to anything.
On 10/30/24 10:14 AM, Christoph Hellwig wrote: > On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote: >> >> On 10/29/24 9:55 PM, Christoph Hellwig wrote: >>> For the temperature hints the only public user I known is rocksdb, and >>> that only started working when Hans fixed a brown paperbag bug in the >>> rocksdb code a while ago. Given that f2fs interprets the hints I suspect >>> something in the Android world does as well, maybe Bart knows more. >> >> UFS devices typically have less internal memory (SRAM) than the size of a >> single zone. > > That wasn't quite the question. Do you know what application in android > set the fcntl temperature hints? I do not know whether there are any Android apps that use the F_SET_(FILE_|)RW_HINT fcntls. The only use case in Android platform code I know of is this one: Daejun Park, "f2fs-tools: add write hint support", f2fs-dev mailing list, September 2024 (https://lore.kernel.org/all/20240904011217epcms2p5a1b15db8e0ae50884429da7be4af4de4@epcms2p5/T/). As you probably know f2fs-tools is a software package that includes fsck.f2fs. Jaegeuk, please correct me if necessary. Bart.
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote: > On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote: > > With FDP (with some minor rocksdb changes): > > > > WAF: 1.67 > > IOPS: 1547 > > READ LAT: 1978us > > UPDATE LAT: 2267us > > Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code, > which should work just fine with FDP IFF we exposed real write streams, > which roughly double read nad wirte IOPS and reduce the WAF to almost > 1 this doesn't look too spectacular to be honest, but it sure it something. Hold up... I absolutely appreciate the work Hans is and has done. But are you talking about this talk? https://lpc.events/event/18/contributions/1822/attachments/1464/3105/Zoned%20XFS%20LPC%20Zoned%20MC%202024%20V1.pdf That is very much apples-to-oranges. The B+ isn't on the same device being evaluated for WAF, where this has all that mixed in. I think the results are pretty good, all things considered. > I just wish we could get the real infraѕtructure instead of some band > aid, which makes it really hard to expose the real thing because now > it's been taken up and directly wired to a UAPI. > one I don't know what make of this. I think we're talking past each other.
On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <kbusch@kernel.org> wrote: > > On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote: > > On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote: > > > With FDP (with some minor rocksdb changes): > > > > > > WAF: 1.67 > > > IOPS: 1547 > > > READ LAT: 1978us > > > UPDATE LAT: 2267us > > > > Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code, > > which should work just fine with FDP IFF we exposed real write streams, > > which roughly double read nad wirte IOPS and reduce the WAF to almost > > 1 this doesn't look too spectacular to be honest, but it sure it something. > > Hold up... I absolutely appreciate the work Hans is and has done. But > are you talking about this talk? > > https://lpc.events/event/18/contributions/1822/attachments/1464/3105/Zoned%20XFS%20LPC%20Zoned%20MC%202024%20V1.pdf > > That is very much apples-to-oranges. The B+ isn't on the same device > being evaluated for WAF, where this has all that mixed in. I think the > results are pretty good, all things considered. No. The meta data IO is just 0.1% of all writes, so that we use a separate device for that in the benchmark really does not matter. Since we can achieve a WAF of ~1 for RocksDB on flash, why should we be content with another 67% of unwanted device side writes on top of that? It's of course impossible to compare your benchmark figures and mine directly since we are using different devices, but hey, we definitely have an opportunity here to make significant gains for FDP if we just provide the right kernel interfaces. Why shouldn't we expose the hardware in a way that enables the users to make the most out of it?
On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote: > No. The meta data IO is just 0.1% of all writes, so that we use a > separate device for that in the benchmark really does not matter. > > Since we can achieve a WAF of ~1 for RocksDB on flash, why should we > be content with another 67% of unwanted device side writes on top of > that? > > It's of course impossible to compare your benchmark figures and mine > directly since we are using different devices, but hey, we definitely > have an opportunity here to make significant gains for FDP if we just > provide the right kernel interfaces. I'll write code to do a 1:1 single device comparism over the weekend and Hans will test it once he is back.
On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote: > On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <kbusch@kernel.org> wrote: > > That is very much apples-to-oranges. The B+ isn't on the same device > > being evaluated for WAF, where this has all that mixed in. I think the > > results are pretty good, all things considered. > > No. The meta data IO is just 0.1% of all writes, so that we use a > separate device for that in the benchmark really does not matter. It's very little spatially, but they overwrite differently than other data, creating many small holes in large erase blocks. > Since we can achieve a WAF of ~1 for RocksDB on flash, why should we > be content with another 67% of unwanted device side writes on top of > that? > > It's of course impossible to compare your benchmark figures and mine > directly since we are using different devices, but hey, we definitely > have an opportunity here to make significant gains for FDP if we just > provide the right kernel interfaces. > > Why shouldn't we expose the hardware in a way that enables the users > to make the most out of it? Because the people using this want this interface. Stalling for the last 6 months hasn't produced anything better, appealing to non-existent vaporware to block something ready-to-go that satisfies a need right now is just wasting everyone's time. Again, I absolutely disagree that this locks anyone in to anything. That's an overly dramatic excuse.
On 10/30, Bart Van Assche wrote: > On 10/30/24 10:14 AM, Christoph Hellwig wrote: > > On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote: > > > > > > On 10/29/24 9:55 PM, Christoph Hellwig wrote: > > > > For the temperature hints the only public user I known is rocksdb, and > > > > that only started working when Hans fixed a brown paperbag bug in the > > > > rocksdb code a while ago. Given that f2fs interprets the hints I suspect > > > > something in the Android world does as well, maybe Bart knows more. > > > > > > UFS devices typically have less internal memory (SRAM) than the size of a > > > single zone. > > > > That wasn't quite the question. Do you know what application in android > > set the fcntl temperature hints? > > I do not know whether there are any Android apps that use the > F_SET_(FILE_|)RW_HINT fcntls. > > The only use case in Android platform code I know of is this one: Daejun > Park, "f2fs-tools: add write hint support", f2fs-dev mailing list, > September 2024 (https://lore.kernel.org/all/20240904011217epcms2p5a1b15db8e0ae50884429da7be4af4de4@epcms2p5/T/). > As you probably know f2fs-tools is a software package that includes > fsck.f2fs. > > Jaegeuk, please correct me if necessary. Yes, f2fs-tools in Android calls fcntl(fd, F_SET_RW_HINT, &hint); > > Bart. > >
On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <kbusch@kernel.org> wrote: > > On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote: > > On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <kbusch@kernel.org> wrote: > > > That is very much apples-to-oranges. The B+ isn't on the same device > > > being evaluated for WAF, where this has all that mixed in. I think the > > > results are pretty good, all things considered. > > > > No. The meta data IO is just 0.1% of all writes, so that we use a > > separate device for that in the benchmark really does not matter. > > It's very little spatially, but they overwrite differently than other > data, creating many small holes in large erase blocks. I don't really get how this could influence anything significantly.(If at all). > > > Since we can achieve a WAF of ~1 for RocksDB on flash, why should we > > be content with another 67% of unwanted device side writes on top of > > that? > > > > It's of course impossible to compare your benchmark figures and mine > > directly since we are using different devices, but hey, we definitely > > have an opportunity here to make significant gains for FDP if we just > > provide the right kernel interfaces. > > > > Why shouldn't we expose the hardware in a way that enables the users > > to make the most out of it? > > Because the people using this want this interface. Stalling for the last > 6 months hasn't produced anything better, appealing to non-existent > vaporware to block something ready-to-go that satisfies a need right > now is just wasting everyone's time. > > Again, I absolutely disagree that this locks anyone in to anything. > That's an overly dramatic excuse. Locking in or not, to constructively move things forward (if we are now stuck on how to wire up fs support) I believe it would be worthwhile to prototype active fdp data placement in xfs and evaluate it. Happy to help out with that. Fdp and zns are different beasts, so I don't expect the results in the presentation to be directly translatable but we can see what we can do. Is RocksDB the only file system user at the moment? Is the benchmark setup/config something that could be shared?
On 01.11.2024 08:16, Hans Holmberg wrote: >Locking in or not, to constructively move things forward (if we are >now stuck on how to wire up fs support) I believe it would be >worthwhile to prototype active fdp data placement in xfs and evaluate >it. Happy to help out with that. I appreciate you willingness to move things forward. I really mean it. I have talked several times in this thread about collaborating in the API that you have in mind. I would _very_ much like to have a common abstraction for ZNS, ZUFS, FDP, and whatever people build on other protocols. But without tangible patches showing this, we simply cannot block this anymore. > >Fdp and zns are different beasts, so I don't expect the results in the >presentation to be directly translatable but we can see what we can >do. > >Is RocksDB the only file system user at the moment? >Is the benchmark setup/config something that could be shared? It is a YCSB workload. You have the scripts here: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada If you have other standard workload you want us to run, let me know and we will post the results in the list too. We will post the changes to the L3 placement in RocksDB. I think we can make them available somewhere for you to test before that. Let me come back to you on this.
On Fri, Nov 01, 2024 at 08:16:30AM +0100, Hans Holmberg wrote: > On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <kbusch@kernel.org> wrote: > > On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote: > > > No. The meta data IO is just 0.1% of all writes, so that we use a > > > separate device for that in the benchmark really does not matter. > > > > It's very little spatially, but they overwrite differently than other > > data, creating many small holes in large erase blocks. > > I don't really get how this could influence anything significantly.(If at all). Fill your filesystem to near capacity, then continue using it for a few months. While the filesystem will report some available space, there may not be many good blocks available to erase. Maybe. > > Again, I absolutely disagree that this locks anyone in to anything. > > That's an overly dramatic excuse. > > Locking in or not, to constructively move things forward (if we are > now stuck on how to wire up fs support) But we're not stuck on how to wire up to fs. That part was settled and in kernel 10 years ago. We're stuck on wiring it down to the driver, which should have been the easiest part. > I believe it would be worthwhile to prototype active fdp data > placement in xfs and evaluate it. Happy to help out with that. When are we allowed to conclude evaluation? We have benefits my customers want on well tested kernels, and wish to proceed now. I'm not discouraing anyone from continuing further prototypes, innovations, and improvements. I'd like to spend more time doing that too, and merging something incrementally better doesn't prevent anyone from doing that. > Fdp and zns are different beasts, so I don't expect the results in the > presentation to be directly translatable but we can see what we can > do. > > Is RocksDB the only file system user at the moment? Rocks is the only open source one I know about. There are propietary users, too.
I've pushed my branch that tries to make this work with the XFS data separation here: http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams This is basically my current WIP xfs zoned (aka always write out place) work optimistically destined for 6.14 + the patch set in this thread + a little fix to make it work for nvme-multipath plus the tiny patch to wire it up. The good news is that the API from Keith mostly works. I don't really know how to cope with the streams per partition bitmap, and I suspect this will need to be dealt with a bit better. One option might be to always have a bitmap, which would also support discontiguous write stream numbers as actually supported by the underlying NVMe implementation, another option would be to always map to consecutive numbers. The bad news is that for file systems or applications to make full use of the API we also really need an API to expose how much space is left in a write stream, as otherwise they can easily get out of sync on a power fail. I've left that code in as a TODO, it should not affect basic testing. We get the same kind of performance numbers as the ZNS support on comparable hardware platforms, which is expected. Testing on an actual state of the art non-prototype hardware will take more time as the capacities are big enough that getting serious numbers will take a lot more time.
On Fri, Nov 1, 2024 at 3:49 PM Keith Busch <kbusch@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 08:16:30AM +0100, Hans Holmberg wrote: > > On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <kbusch@kernel.org> wrote: > > > On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote: > > > > No. The meta data IO is just 0.1% of all writes, so that we use a > > > > separate device for that in the benchmark really does not matter. > > > > > > It's very little spatially, but they overwrite differently than other > > > data, creating many small holes in large erase blocks. > > > > I don't really get how this could influence anything significantly.(If at all). > > Fill your filesystem to near capacity, then continue using it for a few > months. While the filesystem will report some available space, there > may not be many good blocks available to erase. Maybe. For *this* benchmark workload, the metadata io is such a tiny fraction so I doubt the impact on wa could be measured. I completely agree it's a good idea to separate metadata from data blocks in general. It is actually a good reason for letting the file system control write stream allocation for all blocks :) > > I believe it would be worthwhile to prototype active fdp data > > placement in xfs and evaluate it. Happy to help out with that. > > When are we allowed to conclude evaluation? We have benefits my > customers want on well tested kernels, and wish to proceed now. Christoph has now wired up prototype support for FDP on top of the xfs-rt-zoned work + this patch set, and I have had time to look over it and started doing some testing on HW. In addition to the FDP support, metadata can also be stored on the same block device as the data. Now that all placement handles are available, we can use the full data separation capabilities of the underlying storage, so that's good. We can map out the placement handles to different write streams much like we assign open zones for zoned storage and this opens up for supporting data placement heuristics for a wider range use cases (not just the RocksDB use case discussed here). The big pieces that are missing from the FDP plumbing as I see it is the ability to read reclaim unit size and syncing up the remaining capacity of the placement units with the file system allocation groups, but I guess that can be added later. I've started benchmarking on the hardware I have at hand, iterating on a good workload configuration. It will take some time to get to some robust write amp measurements since the drives are very big and require a painfully long warmup time.
On Tue, Nov 05, 2024 at 04:50:14PM +0100, Christoph Hellwig wrote: > I've pushed my branch that tries to make this work with the XFS > data separation here: > > http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams > > This is basically my current WIP xfs zoned (aka always write out place) > work optimistically destined for 6.14 + the patch set in this thread + > a little fix to make it work for nvme-multipath plus the tiny patch to > wire it up. > > The good news is that the API from Keith mostly works. I don't really > know how to cope with the streams per partition bitmap, and I suspect > this will need to be dealt with a bit better. One option might be > to always have a bitmap, which would also support discontiguous > write stream numbers as actually supported by the underlying NVMe > implementation, another option would be to always map to consecutive > numbers. Thanks for sharing that. Seeing the code makes it much easier to understand where you're trying to steer this. I'll take a look and probably have some feedback after a couple days going through it.
On 10/29/24 9:19 AM, Keith Busch wrote: > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 0247452837830..6e1985d3b306c 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -92,6 +92,10 @@ struct io_uring_sqe { > __u16 addr_len; > __u16 __pad3[1]; > }; > + struct { > + __u16 write_hint; > + __u16 __pad4[1]; > + }; Might make more sense to have this overlap further down, with the passthrough command. That'd put it solidly out of anything that isn't passthrough or needs addr3. > diff --git a/io_uring/rw.c b/io_uring/rw.c > index 7ce1cbc048faf..b5dea58356d93 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -279,7 +279,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, > rw->kiocb.ki_ioprio = get_current_ioprio(); > } > rw->kiocb.dio_complete = NULL; > - > + if (ddir == ITER_SOURCE) > + rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint); > rw->addr = READ_ONCE(sqe->addr); > rw->len = READ_ONCE(sqe->len); > rw->flags = READ_ONCE(sqe->rw_flags); Can't we just read it unconditionally? I know it's a write hint, hence why checking for ITER_SOURCE, but if we can just set it regardless, then we don't need to branch around that.
On Tue, Nov 05, 2024 at 04:50:14PM +0100, Christoph Hellwig wrote: > I've pushed my branch that tries to make this work with the XFS > data separation here: > > http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams The zone block support all looks pretty neat, but I think you're making this harder than necessary to support streams. You don't need to treat these like a sequential write device. The controller side does its own garbage collection, so no need to duplicate the effort on the host. And it looks like the host side gc potentially merges multiple streams into a single gc stream, so that's probably not desirable.
On Thu, Nov 07, 2024 at 01:36:35PM -0700, Keith Busch wrote: > The zone block support all looks pretty neat, but I think you're making > this harder than necessary to support streams. You don't need to treat > these like a sequential write device. The controller side does its own > garbage collection, so no need to duplicate the effort on the host. And > it looks like the host side gc potentially merges multiple streams into > a single gc stream, so that's probably not desirable. We're not really duplicating much. Writing sequential is pretty easy, and tracking reclaim units separately means you need another tracking data structure, and either that or the LBA one is always going to be badly fragmented if they aren't the same.
On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote: > On Thu, Nov 07, 2024 at 01:36:35PM -0700, Keith Busch wrote: > > The zone block support all looks pretty neat, but I think you're making > > this harder than necessary to support streams. You don't need to treat > > these like a sequential write device. The controller side does its own > > garbage collection, so no need to duplicate the effort on the host. And > > it looks like the host side gc potentially merges multiple streams into > > a single gc stream, so that's probably not desirable. > > We're not really duplicating much. Writing sequential is pretty easy, > and tracking reclaim units separately means you need another tracking > data structure, and either that or the LBA one is always going to be > badly fragmented if they aren't the same. You're getting fragmentation anyway, which is why you had to implement gc. You're just shifting who gets to deal with it from the controller to the host. The host is further from the media, so you're starting from a disadvantage. The host gc implementation would have to be quite a bit better to justify the link and memory usage necessary for the copies (...queue a copy-offload discussion? oom?). This xfs implementation also has logic to recover from a power fail. The device already does that if you use the LBA abstraction instead of tracking sequential write pointers and free blocks. I think you are underestimating the duplication of efforts going on here.
On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote: > On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote: > > We're not really duplicating much. Writing sequential is pretty easy, > > and tracking reclaim units separately means you need another tracking > > data structure, and either that or the LBA one is always going to be > > badly fragmented if they aren't the same. > > You're getting fragmentation anyway, which is why you had to implement > gc. You're just shifting who gets to deal with it from the controller to > the host. The host is further from the media, so you're starting from a > disadvantage. The host gc implementation would have to be quite a bit > better to justify the link and memory usage necessary for the copies > (...queue a copy-offload discussion? oom?). But the filesystem knows which blocks are actually in use. Sending TRIM/DISCARD information to the drive at block-level granularity hasn't worked out so well in the past. So the drive is the one at a disadvantage because it has to copy blocks which aren't actually in use. I like the idea of using copy-offload though.
> -----Original Message----- > From: Matthew Wilcox <willy@infradead.org> > Sent: Friday, November 8, 2024 5:55 PM > To: Keith Busch <kbusch@kernel.org> > Cc: Christoph Hellwig <hch@lst.de>; Keith Busch <kbusch@meta.com>; linux- > block@vger.kernel.org; linux-nvme@lists.infradead.org; linux-scsi@vger.kernel.org; > io-uring@vger.kernel.org; linux-fsdevel@vger.kernel.org; joshi.k@samsung.com; > Javier Gonzalez <javier.gonz@samsung.com>; bvanassche@acm.org > Subject: Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams > > On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote: > > On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote: > > > We're not really duplicating much. Writing sequential is pretty easy, > > > and tracking reclaim units separately means you need another tracking > > > data structure, and either that or the LBA one is always going to be > > > badly fragmented if they aren't the same. > > > > You're getting fragmentation anyway, which is why you had to implement > > gc. You're just shifting who gets to deal with it from the controller to > > the host. The host is further from the media, so you're starting from a > > disadvantage. The host gc implementation would have to be quite a bit > > better to justify the link and memory usage necessary for the copies > > (...queue a copy-offload discussion? oom?). > > But the filesystem knows which blocks are actually in use. Sending > TRIM/DISCARD information to the drive at block-level granularity hasn't > worked out so well in the past. So the drive is the one at a disadvantage > because it has to copy blocks which aren't actually in use. It is true that trim has not been great. I would say that at least enterprise SSDs have fixed this in general. For FDP, DSM Deallocate is respected, which Provides a good "erase" interface to the host. It is true though that this is not properly described in the spec and we should fix it. > > I like the idea of using copy-offload though. We have been iterating in the patches for years, but it is unfortunately one of these series that go in circles forever. I don't think it is due to any specific problem, but mostly due to unaligned requests form different folks reviewing. Last time I talked to Damien he asked me to send the patches again; we have not followed through due to bandwidth. If there is an interest, we can re-spin this again...
On 11/8/24 9:43 AM, Javier Gonzalez wrote:
> If there is an interest, we can re-spin this again...
I'm interested. Work is ongoing in JEDEC on support for copy offloading
for UFS devices. This work involves standardizing which SCSI copy
offloading features should be supported and which features are not
required. Implementations are expected to be available soon.
Thanks,
Bart.
On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote: > You're getting fragmentation anyway, which is why you had to implement > gc. A general purpose file system always has fragmentation of some kind, even it manages to avoid those for certain workloads with cooperative applications. If there was magic pixies dust to ensure freespace never fragments file system development would be solved problem :) > You're just shifting who gets to deal with it from the controller to > the host. The host is further from the media, so you're starting from a > disadvantage. And the controller is further from the application and misses a lot of information like say the file structure, so it inherently is at a disadvantage. > The host gc implementation would have to be quite a bit > better to justify the link and memory usage necessary for the copies That assumes you still have to device GC. If you do align to the zone/erase (super)block/reclaim unit boundaries you don't. > This xfs implementation also has logic to recover from a power fail. The > device already does that if you use the LBA abstraction instead of > tracking sequential write pointers and free blocks. Every file system has logic to recover from a power fail. I'm not sure what kind of discussion you're trying to kick off here. > I think you are underestimating the duplication of efforts going on > here. I'm still not sure what discussion you're trying to to start here. There is very little work in here, and it is work required to support SMR drives. It turns out for a fair amount of workloads it actually works really well on SSDs as well beating everything else we've tried.
On Fri, Nov 08, 2024 at 04:54:34PM +0000, Matthew Wilcox wrote:
> I like the idea of using copy-offload though.
FYI, the XFS GC code is written so that copy offload can be easily
plugged into it. We'll have to see how beneficial it actually is,
but at least it should give us a good test platform.
On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: > We have been iterating in the patches for years, but it is unfortunately > one of these series that go in circles forever. I don't think it is due > to any specific problem, but mostly due to unaligned requests form > different folks reviewing. Last time I talked to Damien he asked me to > send the patches again; we have not followed through due to bandwidth. A big problem is that it actually lacks a killer use case. If you'd actually manage to plug it into an in-kernel user and show a real speedup people might actually be interested in it and help optimizing for it.
On 11.11.2024 07:51, Christoph Hellwig wrote: >On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >> We have been iterating in the patches for years, but it is unfortunately >> one of these series that go in circles forever. I don't think it is due >> to any specific problem, but mostly due to unaligned requests form >> different folks reviewing. Last time I talked to Damien he asked me to >> send the patches again; we have not followed through due to bandwidth. > >A big problem is that it actually lacks a killer use case. If you'd >actually manage to plug it into an in-kernel user and show a real >speedup people might actually be interested in it and help optimizing >for it. > Agree. Initially it was all about ZNS. Seems ZUFS can use it. Then we saw good results in offload to target on NVMe-OF, similar to copy_file_range, but that does not seem to be enough. You seem to indicacte too that XFS can use it for GC. We can try putting a new series out to see where we are...
On 08.11.2024 10:51, Bart Van Assche wrote: >On 11/8/24 9:43 AM, Javier Gonzalez wrote: >>If there is an interest, we can re-spin this again... > >I'm interested. Work is ongoing in JEDEC on support for copy offloading >for UFS devices. This work involves standardizing which SCSI copy >offloading features should be supported and which features are not >required. Implementations are expected to be available soon. > Do you have any specific blockers on the last series? I know you have left comments in many of the patches already, but I think we are all a bit confused on where we are ATM.
On 11.11.24 10:31, Javier Gonzalez wrote: > On 11.11.2024 07:51, Christoph Hellwig wrote: >> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >>> We have been iterating in the patches for years, but it is unfortunately >>> one of these series that go in circles forever. I don't think it is due >>> to any specific problem, but mostly due to unaligned requests form >>> different folks reviewing. Last time I talked to Damien he asked me to >>> send the patches again; we have not followed through due to bandwidth. >> >> A big problem is that it actually lacks a killer use case. If you'd >> actually manage to plug it into an in-kernel user and show a real >> speedup people might actually be interested in it and help optimizing >> for it. >> > > Agree. Initially it was all about ZNS. Seems ZUFS can use it. > > Then we saw good results in offload to target on NVMe-OF, similar to > copy_file_range, but that does not seem to be enough. You seem to > indicacte too that XFS can use it for GC. > > We can try putting a new series out to see where we are... I don't want to sound like a broken record, but I've said more than once, that btrfs (regardless of zoned or non-zoned) would be very interested in that as well and I'd be willing to help with the code or even do it myself once the block bits are in. But apparently my voice doesn't count here
On 11.11.2024 09:37, Johannes Thumshirn wrote: >On 11.11.24 10:31, Javier Gonzalez wrote: >> On 11.11.2024 07:51, Christoph Hellwig wrote: >>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >>>> We have been iterating in the patches for years, but it is unfortunately >>>> one of these series that go in circles forever. I don't think it is due >>>> to any specific problem, but mostly due to unaligned requests form >>>> different folks reviewing. Last time I talked to Damien he asked me to >>>> send the patches again; we have not followed through due to bandwidth. >>> >>> A big problem is that it actually lacks a killer use case. If you'd >>> actually manage to plug it into an in-kernel user and show a real >>> speedup people might actually be interested in it and help optimizing >>> for it. >>> >> >> Agree. Initially it was all about ZNS. Seems ZUFS can use it. >> >> Then we saw good results in offload to target on NVMe-OF, similar to >> copy_file_range, but that does not seem to be enough. You seem to >> indicacte too that XFS can use it for GC. >> >> We can try putting a new series out to see where we are... > >I don't want to sound like a broken record, but I've said more than >once, that btrfs (regardless of zoned or non-zoned) would be very >interested in that as well and I'd be willing to help with the code or >even do it myself once the block bits are in. > >But apparently my voice doesn't count here You are right. Sorry I forgot. Would this be through copy_file_range or something different?
On Mon, Nov 11, 2024 at 10:41:33AM +0100, Javier Gonzalez wrote: > You are right. Sorry I forgot. > > Would this be through copy_file_range or something different? Just like for f2fs, nilfs2, or the upcoming zoned xfs the prime user would be the file system GC code.
On 11.11.24 10:41, Javier Gonzalez wrote: > On 11.11.2024 09:37, Johannes Thumshirn wrote: >> On 11.11.24 10:31, Javier Gonzalez wrote: >>> On 11.11.2024 07:51, Christoph Hellwig wrote: >>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >>>>> We have been iterating in the patches for years, but it is unfortunately >>>>> one of these series that go in circles forever. I don't think it is due >>>>> to any specific problem, but mostly due to unaligned requests form >>>>> different folks reviewing. Last time I talked to Damien he asked me to >>>>> send the patches again; we have not followed through due to bandwidth. >>>> >>>> A big problem is that it actually lacks a killer use case. If you'd >>>> actually manage to plug it into an in-kernel user and show a real >>>> speedup people might actually be interested in it and help optimizing >>>> for it. >>>> >>> >>> Agree. Initially it was all about ZNS. Seems ZUFS can use it. >>> >>> Then we saw good results in offload to target on NVMe-OF, similar to >>> copy_file_range, but that does not seem to be enough. You seem to >>> indicacte too that XFS can use it for GC. >>> >>> We can try putting a new series out to see where we are... >> >> I don't want to sound like a broken record, but I've said more than >> once, that btrfs (regardless of zoned or non-zoned) would be very >> interested in that as well and I'd be willing to help with the code or >> even do it myself once the block bits are in. >> >> But apparently my voice doesn't count here > > You are right. Sorry I forgot. > > Would this be through copy_file_range or something different? > Unfortunately not, brtfs' reclaim/balance path is a wrapper on top of buffered read and write (plus some extra things). _BUT_ this makes it possible to switch the read/write part and do copy offload (where possible).
On 11.11.2024 09:43, Johannes Thumshirn wrote: >On 11.11.24 10:41, Javier Gonzalez wrote: >> On 11.11.2024 09:37, Johannes Thumshirn wrote: >>> On 11.11.24 10:31, Javier Gonzalez wrote: >>>> On 11.11.2024 07:51, Christoph Hellwig wrote: >>>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >>>>>> We have been iterating in the patches for years, but it is unfortunately >>>>>> one of these series that go in circles forever. I don't think it is due >>>>>> to any specific problem, but mostly due to unaligned requests form >>>>>> different folks reviewing. Last time I talked to Damien he asked me to >>>>>> send the patches again; we have not followed through due to bandwidth. >>>>> >>>>> A big problem is that it actually lacks a killer use case. If you'd >>>>> actually manage to plug it into an in-kernel user and show a real >>>>> speedup people might actually be interested in it and help optimizing >>>>> for it. >>>>> >>>> >>>> Agree. Initially it was all about ZNS. Seems ZUFS can use it. >>>> >>>> Then we saw good results in offload to target on NVMe-OF, similar to >>>> copy_file_range, but that does not seem to be enough. You seem to >>>> indicacte too that XFS can use it for GC. >>>> >>>> We can try putting a new series out to see where we are... >>> >>> I don't want to sound like a broken record, but I've said more than >>> once, that btrfs (regardless of zoned or non-zoned) would be very >>> interested in that as well and I'd be willing to help with the code or >>> even do it myself once the block bits are in. >>> >>> But apparently my voice doesn't count here >> >> You are right. Sorry I forgot. >> >> Would this be through copy_file_range or something different? >> > >Unfortunately not, brtfs' reclaim/balance path is a wrapper on top of >buffered read and write (plus some extra things). _BUT_ this makes it >possible to switch the read/write part and do copy offload (where possible). On 11.11.2024 10:42, hch wrote: >On Mon, Nov 11, 2024 at 10:41:33AM +0100, Javier Gonzalez wrote: >> You are right. Sorry I forgot. >> >> Would this be through copy_file_range or something different? > >Just like for f2fs, nilfs2, or the upcoming zoned xfs the prime user >would be the file system GC code. Replying to both. Thanks. Makes sense. Now that we can talke a look at your branch, we can think how this would look like.
On 11/11/24 1:31 AM, Javier Gonzalez wrote: > On 08.11.2024 10:51, Bart Van Assche wrote: >> On 11/8/24 9:43 AM, Javier Gonzalez wrote: >>> If there is an interest, we can re-spin this again... >> >> I'm interested. Work is ongoing in JEDEC on support for copy offloading >> for UFS devices. This work involves standardizing which SCSI copy >> offloading features should be supported and which features are not >> required. Implementations are expected to be available soon. > > Do you have any specific blockers on the last series? I know you have > left comments in many of the patches already, but I think we are all a > bit confused on where we are ATM. Nobody replied to this question that was raised 4 months ago: https://lore.kernel.org/linux-block/4c7f30af-9fbc-4f19-8f48-ad741aa557c4@acm.org/ I think we need to agree about the answer to that question before we can continue with implementing copy offloading. Thanks, Bart.
On 11/11/24 09:45AM, Bart Van Assche wrote: >On 11/11/24 1:31 AM, Javier Gonzalez wrote: >>On 08.11.2024 10:51, Bart Van Assche wrote: >>>On 11/8/24 9:43 AM, Javier Gonzalez wrote: >>>>If there is an interest, we can re-spin this again... >>> >>>I'm interested. Work is ongoing in JEDEC on support for copy offloading >>>for UFS devices. This work involves standardizing which SCSI copy >>>offloading features should be supported and which features are not >>>required. Implementations are expected to be available soon. >> >>Do you have any specific blockers on the last series? I know you have >>left comments in many of the patches already, but I think we are all a >>bit confused on where we are ATM. > >Nobody replied to this question that was raised 4 months ago: >https://lore.kernel.org/linux-block/4c7f30af-9fbc-4f19-8f48-ad741aa557c4@acm.org/ > >I think we need to agree about the answer to that question before we can >continue with implementing copy offloading. > Yes, even I feel the same. Blocker with copy has been how we should plumb things in block layer. A couple of approaches we tried in the past[1]. Restating for reference, 1.payload based approach: a. Based on Mikulas patch, here a common payload is used for both source and destination bio. b. Initially we send source bio, upon reaching driver we update payload and complete the bio. c. Send destination bio, in driver layer we recover the source info from the payload and send the copy command to device. Drawback: Request payload contains IO information rather than data. Based on past experience Christoph and Bart suggested not a good way forward. Alternate suggestion from Christoph was to used separate BIOs for src and destination and match them using token/id. As Bart pointed, I find it hard how to match when the IO split happens. 2. Plug based approach: a. Take a plug, send destination bio, form request and wait for src bio b. send source bio, merge with destination bio c. Upon release of plug send request down to driver. Drawback: Doesn't work for stacked devices which has async submission. Bart suggested this is not good solution overall. Alternate suggestion was to use list based approach. But we observed lifetime management problems, especially in failure handling. 3. Single bio approach: a. Use single bio to represent both src and dst info. b. Use abnormal IO handling similar to discard. Drawback: Christoph pointed out that, this will have issue of payload containing information for both IO stack and wire. I am really torn on how to proceed further ? -- Nitesh Shetty [1] https://lore.kernel.org/linux-block/20240624103212.2donuac5apwwqaor@nj.shetty@samsung.com/
Nitesh, > 1.payload based approach: > a. Based on Mikulas patch, here a common payload is used for both source > and destination bio. > b. Initially we send source bio, upon reaching driver we update payload > and complete the bio. > c. Send destination bio, in driver layer we recover the source info > from the payload and send the copy command to device. > > Drawback: > Request payload contains IO information rather than data. > Based on past experience Christoph and Bart suggested not a good way > forward. > Alternate suggestion from Christoph was to used separate BIOs for src > and destination and match them using token/id. > As Bart pointed, I find it hard how to match when the IO split happens. In my experience the payload-based approach was what made things work. I tried many things before settling on that. Also note that to support token-based SCSI devices, you inevitably need to separate the read/copy_in operation from the write/copy_out ditto and carry the token in the payload. For "single copy command" devices, you can just synthesize the token in the driver. Although I don't really know what the point of the token is in that case because as far as I'm concerned, the only interesting information is that the read/copy_in operation made it down the stack without being split. Handling splits made things way too complicated for my taste. Especially with a potential many-to-many mapping. Better to just fall back to regular read/writes if either the copy_in or the copy_out operation needs to be split. If your stacked storage is configured with a prohibitively small stripe chunk size, then your copy performance is just going to be approaching that of a regular read/write data movement. Not a big deal as far as I'm concerned...
From: Keith Busch <kbusch@kernel.org> Changes from v9: Document the partition hint mask Use bitmap_alloc API Fixup bitmap memory leak Return invalid value if user requests an invalid write hint Added and exported a block device feature flag for indicating generic placement hint support Added statx write hint max field Added BUILD_BUG_ON check for new io_uring SQE fields. Added reviews Kanchan Joshi (2): io_uring: enable per-io hinting capability nvme: enable FDP support Keith Busch (7): block: use generic u16 for write hints block: introduce max_write_hints queue limit statx: add write hint information block: allow ability to limit partition write hints block, fs: add write hint to kiocb block: export placement hint feature scsi: set permanent stream count in block limits Documentation/ABI/stable/sysfs-block | 13 +++++ block/bdev.c | 18 ++++++ block/blk-settings.c | 5 ++ block/blk-sysfs.c | 6 ++ block/fops.c | 31 +++++++++- block/partitions/core.c | 44 ++++++++++++++- drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++ drivers/nvme/host/nvme.h | 5 ++ drivers/scsi/sd.c | 2 + fs/stat.c | 1 + include/linux/blk-mq.h | 3 +- include/linux/blk_types.h | 4 +- include/linux/blkdev.h | 15 +++++ include/linux/fs.h | 1 + include/linux/nvme.h | 19 +++++++ include/linux/stat.h | 1 + include/uapi/linux/io_uring.h | 4 ++ include/uapi/linux/stat.h | 3 +- io_uring/io_uring.c | 2 + io_uring/rw.c | 3 +- 20 files changed, 253 insertions(+), 11 deletions(-)