Message ID | 20191212183816.102402-1-damien.lemoal@wdc.com (mailing list archive) |
---|---|
Headers | show |
Series | New zonefs file system | expand |
On 12.12.19 19:38, Damien Le Moal wrote: Hi, > zonefs is a very simple file system exposing each zone of a zoned block > device as a file. Unlike a regular file system with zoned block device > support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide > the sequential write constraint of zoned block devices to the user. Just curious: what's the exact definition of "zoned" here ? Something like partitions ? Can these files then also serve as block devices for other filesystems ? Just a funny idea: could we handle partitions by a file system ? Even more funny idea: give file systems block device ops, so they can be directly used as such (w/o explicitly using loopdev) ;-) > Files representing sequential write zones of the device must be written > sequentially starting from the end of the file (append only writes). So, these files can only be accessed like a tape ? Assuming you're working ontop of standard block devices anyways (instead of tape-like media ;-)) - why introducing such a limitation ? > zonefs is not a POSIX compliant file system. It's goal is to simplify > the implementation of zoned block devices support in applications by > replacing raw block device file accesses with a richer file based API, > avoiding relying on direct block device file ioctls which may > be more obscure to developers. ioctls ? Last time I checked, block devices could be easily accessed via plain file ops (read, write, seek, ...). You can basically treat them just like big files of fixed size. > One example of this approach is the > implementation of LSM (log-structured merge) tree structures (such as > used in RocksDB and LevelDB) The same LevelDB as used eg. in Chrome browser, which destroys itself every time a little temporary problem (eg. disk full) occours ? If that's the usecase I'd rather use an simple in-memory table instead and and enough swap, as leveldb isn't reliable enough for persistent data anyways :p > on zoned block devices by allowing SSTables > to be stored in a zone file similarly to a regular file system rather > than as a range of sectors of a zoned device. The introduction of the > higher level construct "one file is one zone" can help reducing the > amount of changes needed in the application while at the same time > allowing the use of zoned block devices with various programming > languages other than C. Why not just simply use files on a suited filesystem (w/ low block io overhead) or LVM volumes ? --mtx
On Mon, Dec 16, 2019 at 09:18:23AM +0100, Enrico Weigelt, metux IT consult wrote: > On 12.12.19 19:38, Damien Le Moal wrote: > > Hi, > > > zonefs is a very simple file system exposing each zone of a zoned block > > device as a file. Unlike a regular file system with zoned block device > > support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide > > the sequential write constraint of zoned block devices to the user. > > Just curious: what's the exact definition of "zoned" here ? > Something like partitions ? Zones inside a SMR HDD. > > Can these files then also serve as block devices for other filesystems ? > Just a funny idea: could we handle partitions by a file system ? > > Even more funny idea: give file systems block device ops, so they can > be directly used as such (w/o explicitly using loopdev) ;-) > > > Files representing sequential write zones of the device must be written > > sequentially starting from the end of the file (append only writes). > > So, these files can only be accessed like a tape ? On a SMR HDD, each zone can only be written sequentially, due to physics constraints. I won't post any link with references because I think majordomo will spam my email if I do, but do a google search of something like 'SMR HDD zones' and you'll get a better idea > > Assuming you're working ontop of standard block devices anyways (instead > of tape-like media ;-)) - why introducing such a limitation ? The limitation is already there on SMR drives, some of them (Device Managed models), just hide it from the system. > > > zonefs is not a POSIX compliant file system. It's goal is to simplify > > the implementation of zoned block devices support in applications by > > replacing raw block device file accesses with a richer file based API, > > avoiding relying on direct block device file ioctls which may > > be more obscure to developers. > > ioctls ? > > Last time I checked, block devices could be easily accessed via plain > file ops (read, write, seek, ...). You can basically treat them just > like big files of fixed size. > > > One example of this approach is the > > implementation of LSM (log-structured merge) tree structures (such as > > used in RocksDB and LevelDB) > > The same LevelDB as used eg. in Chrome browser, which destroys itself > every time a little temporary problem (eg. disk full) occours ? > If that's the usecase I'd rather use an simple in-memory table instead > and and enough swap, as leveldb isn't reliable enough for persistent > data anyways :p > > > on zoned block devices by allowing SSTables > > to be stored in a zone file similarly to a regular file system rather > > than as a range of sectors of a zoned device. The introduction of the > > higher level construct "one file is one zone" can help reducing the > > amount of changes needed in the application while at the same time > > allowing the use of zoned block devices with various programming > > languages other than C. > > Why not just simply use files on a suited filesystem (w/ low block io > overhead) or LVM volumes ? > > > --mtx > > -- > Dringender Hinweis: aufgrund existenzieller Bedrohung durch "Emotet" > sollten Sie *niemals* MS-Office-Dokumente via E-Mail annehmen/öffenen, > selbst wenn diese von vermeintlich vertrauenswürdigen Absendern zu > stammen scheinen. Andernfalls droht Totalschaden. > --- > Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert > werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren > GPG/PGP-Schlüssel zu. > --- > Enrico Weigelt, metux IT consult > Free software and Linux embedded engineering > info@metux.net -- +49-151-27565287 >
On Mon, Dec 16, 2019 at 10:36:00AM +0100, Carlos Maiolino wrote: > On Mon, Dec 16, 2019 at 09:18:23AM +0100, Enrico Weigelt, metux IT consult wrote: > > On 12.12.19 19:38, Damien Le Moal wrote: > > > > Hi, > > > > > zonefs is a very simple file system exposing each zone of a zoned block > > > device as a file. Unlike a regular file system with zoned block device > > > support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide > > > the sequential write constraint of zoned block devices to the user. > > > > Just curious: what's the exact definition of "zoned" here ? > > Something like partitions ? > > Zones inside a SMR HDD. > Btw, Zoned devices concept are not limited on HDDs only. I'm not sure now if the patchset itself also targets SMR devices or is more focused on Zoned SDDs, but well, the limitation where each zone can only be written sequentially still applies.
On 2019/12/16 17:19, Enrico Weigelt, metux IT consult wrote: > On 12.12.19 19:38, Damien Le Moal wrote: > > Hi, > >> zonefs is a very simple file system exposing each zone of a zoned block >> device as a file. Unlike a regular file system with zoned block device >> support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide >> the sequential write constraint of zoned block devices to the user. > > Just curious: what's the exact definition of "zoned" here ? > Something like partitions ? As Carlos commented already, a zoned block device is Linux abstraction used to handle SMR HDDs (Shingled Magnetic Recording). These disks expose an LBA range that is divided into zones that can only be written sequentially for host-managed models. Other models such as host-aware or drive-managed allow random writes to all zones at the cost of potential serious performance degradation due to disk internal garbage collection of zones (similarly to an SSD handling of erase blocks). While today zoned block devices exist on the market only in the form of SMR disks, NVMe SSDs will also soon be available with the completion of the Zoned Namespace specifications. Zoning of block devices has several advantages: higher capacities for HDDs and more predictable and lower IO latencies for SSDs (almost no internal GC/weir leveling needed). But taking full advantage of these devices require software changes on the host due to the sequential write constraint imposed by the devices interface. > Can these files then also serve as block devices for other filesystems ? > Just a funny idea: could we handle partitions by a file system ? > > Even more funny idea: give file systems block device ops, so they can > be directly used as such (w/o explicitly using loopdev) ;-) This is outside the scope of this thread, so let's not start a discussion about this here. Start a new thread ! >> Files representing sequential write zones of the device must be written >> sequentially starting from the end of the file (append only writes). > > So, these files can only be accessed like a tape ? Writes must be sequential within a zone but reads can be random to any writen LBA. > Assuming you're working ontop of standard block devices anyways (instead > of tape-like media ;-)) - why introducing such a limitation ? See above: the limitation is physical, by the device, so that different improvements can be achieved depending on the storage medium being used (increased capacity, lower latencies, lower over provisioning, etc) > >> zonefs is not a POSIX compliant file system. It's goal is to simplify >> the implementation of zoned block devices support in applications by >> replacing raw block device file accesses with a richer file based API, >> avoiding relying on direct block device file ioctls which may >> be more obscure to developers. > > ioctls ? > > Last time I checked, block devices could be easily accessed via plain > file ops (read, write, seek, ...). You can basically treat them just > like big files of fixed size. I was not clear, my apologies. I am refering here to the zoned block device related ioctls defined in include/uapi/linux/blkzoned.h. These ioctls allow an application to manage the device zones (obtain zone information, erase zones, etc). These ioctls trigger issuing zone related commands to the device. These commands are defined by the ZBC and ZAC standards for SCSI and ATA, and NVMe Zoned Namespace in the very near future. >> One example of this approach is the >> implementation of LSM (log-structured merge) tree structures (such as >> used in RocksDB and LevelDB) > > The same LevelDB as used eg. in Chrome browser, which destroys itself > every time a little temporary problem (eg. disk full) occours ? > If that's the usecase I'd rather use an simple in-memory table instead > and and enough swap, as leveldb isn't reliable enough for persistent > data anyways :p The intent of my comment was not to advocate for or discuss the merits of any particular KV implementation. I was only pointing out that zonefs does not come in a void and that we do have use cases for it and did the work on some user space software to validate it. Leveldb and RocksDB are the 2 LSM-tree based KV stores we worked on as they are very popular and widely used. >> on zoned block devices by allowing SSTables >> to be stored in a zone file similarly to a regular file system rather >> than as a range of sectors of a zoned device. The introduction of the >> higher level construct "one file is one zone" can help reducing the >> amount of changes needed in the application while at the same time >> allowing the use of zoned block devices with various programming >> languages other than C. > > Why not just simply use files on a suited filesystem (w/ low block io > overhead) or LVM volumes ? Using a file system compliant with zoned block device constraint such as f2fs or btrfs (on-going work) is certainly a valid approach. However, this may not be the most optimal one if the application being used as a mostly sequential write behavior. LSM-tree based KV stores fall into this category: SSTables are large (several MB) and always written sequentially. There are not random writes, which facilitates supporting directly zoned block devices without the need for a file system which would add a GC background process and degrade performance. As mentioned in the cover letter, zonefs goal is to facilitate the implementation of this support compared toa pure raw block device use. > > > --mtx >
On 2019/12/16 18:42, Carlos Maiolino wrote: > On Mon, Dec 16, 2019 at 10:36:00AM +0100, Carlos Maiolino wrote: >> On Mon, Dec 16, 2019 at 09:18:23AM +0100, Enrico Weigelt, metux IT consult wrote: >>> On 12.12.19 19:38, Damien Le Moal wrote: >>> >>> Hi, >>> >>>> zonefs is a very simple file system exposing each zone of a zoned block >>>> device as a file. Unlike a regular file system with zoned block device >>>> support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide >>>> the sequential write constraint of zoned block devices to the user. >>> >>> Just curious: what's the exact definition of "zoned" here ? >>> Something like partitions ? >> >> Zones inside a SMR HDD. >> > > Btw, Zoned devices concept are not limited on HDDs only. I'm not sure now if the > patchset itself also targets SMR devices or is more focused on Zoned SDDs, but > well, the limitation where each zone can only be written sequentially still > applies. zonefs supports any block device that advertised itself as "zoned" (blk_queue_is_zoned(q) is true) through the zoned block device abstraction (block/blk-zoned.c). This includes all SMR HDDs (both SCSI and ATA), null_blk devices with zoned mode enabled and DM-linear drives built on top of zoned devices. On the SSD front, NVMe Zoned Namespace standard is still a draft and being worked on be the NVMe committee and no devices are available on the market yet.
On 16.12.19 10:35, Carlos Maiolino wrote: Hi, >> Just curious: what's the exact definition of "zoned" here ? >> Something like partitions ? > > Zones inside a SMR HDD. Oh, I wasn't aware that those things are exposed to the host at all. Are you dealing with host-managed SMR-HDDs ? > On a SMR HDD, each zone can only be written sequentially, due to physics > constraints. I won't post any link with references because I think majordomo > will spam my email if I do, but do a google search of something like 'SMR HDD > zones' and you'll get a better idea Reminds me on classic CDRs or tapes. Why not dealing them similarily ? --mtx --- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info@metux.net -- +49-151-27565287
On 17.12.19 01:26, Damien Le Moal wrote: Hi, > On the SSD front, NVMe Zoned Namespace standard is still a draft and > being worked on be the NVMe committee and no devices are available on > the market yet. anybody here who can tell why this could be useful ? Can erase blocks made be so enourmously huge and is there really a huge gain in doing so, which makes any practical difference ? Oh, BTW, since the write semantics seem so similar, why not treating them similar to raw flashes ? --mtx
On 2019/12/17 21:34, Enrico Weigelt, metux IT consult wrote: > On 16.12.19 10:35, Carlos Maiolino wrote: > > Hi, > >>> Just curious: what's the exact definition of "zoned" here ? >>> Something like partitions ? >> >> Zones inside a SMR HDD. > > Oh, I wasn't aware that those things are exposed to the host at all. > Are you dealing with host-managed SMR-HDDs ? Yes. The host-managed models of SMR drives have become the de-facto standard for enterprise applications because of their more predictable performance compared to host-aware models. Many USB external disks these days also use SMR, but drive-managed models. These are regular block devices from the interface point of view: the host does not and cannot see the "zones" of the disk. SMR constraints are hidden by the device firmware. > >> On a SMR HDD, each zone can only be written sequentially, due to physics >> constraints. I won't post any link with references because I think majordomo >> will spam my email if I do, but do a google search of something like 'SMR HDD >> zones' and you'll get a better idea > > Reminds me on classic CDRs or tapes. Why not dealing them similarily ? Because of the performance difference. Excluding any software/use difference (i.e. GC overhead if needed), from a purely IO perspective, SMR host-managed disks are as fast as regular disks and can handle multiple streams simultaneously at high queue depth for better throughput (think video surveillance applications or video streaming). That is not the case for CDs or tapes. The performance difference with CDs and tapes, leading to different possible workloads and usage patterns, is even more pronounced with SSDs. In the end, only the write pattern looks similar with CDs and Tapes. Everything else is the same as a regular block device. > > > --mtx > > --- > Enrico Weigelt, metux IT consult > Free software and Linux embedded engineering > info@metux.net -- +49-151-27565287 >
On 2019/12/17 22:05, Enrico Weigelt, metux IT consult wrote: > On 17.12.19 01:26, Damien Le Moal wrote: > > Hi, > >> On the SSD front, NVMe Zoned Namespace standard is still a draft and >> being worked on be the NVMe committee and no devices are available on >> the market yet. > > anybody here who can tell why this could be useful ? To reduce device costs thanks to less flash over provisioning needed (leading to higher usable capacities), simpler device firmware FTL (leading to lower DRAM needs, so lower power and less heat) and higher predictability of IO latencies. Yes, there is the sequential write constraint (that's the "no free lunch" part of the picture), but many workloads can accommodate this constraint (any video streaming application, sensor logging, etc...) > Can erase blocks made be so enourmously huge and is there really a huge > gain in doing so, which makes any practical difference ? Making the erase block enormous would likely lead to enormous zone sizes, which is generally not desired as that becomes very costly if the application/user needs to do GC on the zones. A balance is generally reached here between HW media needs and usability. > Oh, BTW, since the write semantics seem so similar, why not treating > them similar to raw flashes ? This is the OpenChannel SSD model. This exists and is supported by Linux (lightnvm). This model is however more complex due to the plethora of parameters that the host can/needs to control. The zone model is much simpler and its application to NVMe with Zoned Namespace fits very well into the block IO stack work that was done for SMR since kernel 4.10. Another reason for choosing ZNS over OCSSD is that device vendors can actually give guarantees for devices sold as the device firmware retains control over the the flash cells health management, which is much less the case for OCSSD (the device health depends much more on what the user is doing). Best regards.