Message ID | e5e0ab0429b1fc8a4e3f9614d2d1cc43dea78093.1668644705.git.daniel@makrotopia.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: uImage.FIT filesystem image mapper | expand |
On Thu, Nov 17, 2022 at 12:42:43AM +0000, Daniel Golle wrote: > In init/do_mounts.c there are helper functions devt_from_partuuid, > devt_from_partlabel and devt_from_devname. In order to make these > functions available to other users, move them to block/bdev.c. Yes, they should be moved and I have an old patch to do this a bit smarter. But that doesn't mean anyone else should call them.
Hi Christoph, On Wed, Nov 16, 2022 at 09:55:55PM -0800, Christoph Hellwig wrote: > On Thu, Nov 17, 2022 at 12:42:43AM +0000, Daniel Golle wrote: > > In init/do_mounts.c there are helper functions devt_from_partuuid, > > devt_from_partlabel and devt_from_devname. In order to make these > > functions available to other users, move them to block/bdev.c. > > Yes, they should be moved and I have an old patch to do this a bit > smarter. But that doesn't mean anyone else should call them. I have followed your advise and implemented a "real" block driver stacking on top of an existing block device to expose uImage.FIT filesystem subimages as block devices. It mangles any bio submitted to it and re-submits the bio with changed bi_bdev and sector offset added to bi_iter.bi_sector for all list elements. That works, but has slightly less utility value than the partition parser approach as in this way I cannot easily populate the PARTNAME uevent which can later help userspace to identify a device by the FIT subimage name -- I'd have to either invent a new bus_type or device_type, both seem inappropriate and have unwanted side effects. Or am I missing something and there is a way to use add_uevent_var() for a disk_type device? In exchange, instead of using the PARTNAME= uevent var, I can now freely name the device node by setting the disk_name string in struct gendisk. While this doesn't offer the same amount of freedom, it is sufficient for our use-case. However, I don't see a way to avoid using (or duplicating) devt_from_devname() to select the lower device somehow without having to probe and parse *all* block devices present (which is, from what I understood, what you want to avoid, and I agree that it is more safe to not do that...) Can you or anyone give some advise on how this should be done? As block partitions are not present in device tree, using a cmdline argument (with similar semantics to 'root=') specifying the lower block device seems unavoidable. Also note that due to libfdt not exporting symbols, the driver currently cannot be built as a module. Hence, I would either need to export most of libfdt API or only allow the driver to be built-into the kernel. I don't see a problem with having it always built-in. Yet another (imho not terrible) problem is removal of the lower device. Many of the supported SBC use a micro SD card to boot, which can be removed by the user while the system is running (which is generally not a good idea, but anyway). For partitions this is handled automatically by blk_drop_partitions() called directly from genhd.c. I'm currently playing with doing something similar using the bus device removal notification, but it doesn't seem to work for all cases, e.g. mmcblk device do not seem to have the ->bus pointer populated at all (ie. disk_to_dev(disk)->bus == NULL for mmcblk devices). Any ideas regarding that? Thank you for your advise! Daniel
On Sat, Nov 19, 2022 at 04:03:11PM +0000, Daniel Golle wrote: > That works, but has slightly less utility value than the partition > parser approach as in this way I cannot easily populate the PARTNAME > uevent which can later help userspace to identify a device by the FIT > subimage name -- I'd have to either invent a new bus_type or > device_type, both seem inappropriate and have unwanted side effects. > Or am I missing something and there is a way to use add_uevent_var() > for a disk_type device? You're not exposing a partition here - this is an image format that sits in a partition and we should not pretend that is a partition. > However, I don't see a way to avoid using (or duplicating) > devt_from_devname() to select the lower device somehow without having > to probe and parse *all* block devices present (which is, from what I > understood, what you want to avoid, and I agree that it is more safe to > not do that...) > > Can you or anyone give some advise on how this should be done? Just set the block driver up from an initramfs, like we do for all modern stackable drivers. > Yet another (imho not terrible) problem is removal of the lower device. > Many of the supported SBC use a micro SD card to boot, which can be > removed by the user while the system is running (which is generally not > a good idea, but anyway). For partitions this is handled automatically > by blk_drop_partitions() called directly from genhd.c. > I'm currently playing with doing something similar using the bus device > removal notification, but it doesn't seem to work for all cases, e.g. > mmcblk device do not seem to have the ->bus pointer populated at all > (ie. disk_to_dev(disk)->bus == NULL for mmcblk devices). I have WIP patches that allow the claimer of a block device get resize and removal notification. It's not going to land for 6.2, but I hope I have it ready in time for the next merge window.
Hi Christoph, On Tue, Nov 22, 2022 at 04:37:08AM -0800, Christoph Hellwig wrote: > On Sat, Nov 19, 2022 at 04:03:11PM +0000, Daniel Golle wrote: > > That works, but has slightly less utility value than the partition > > parser approach as in this way I cannot easily populate the PARTNAME > > uevent which can later help userspace to identify a device by the FIT > > subimage name -- I'd have to either invent a new bus_type or > > device_type, both seem inappropriate and have unwanted side effects. > > Or am I missing something and there is a way to use add_uevent_var() > > for a disk_type device? > > You're not exposing a partition here - this is an image format that > sits in a partition and we should not pretend that is a partition. It doesn't need to be literally the PARTNAME uevent, just any way to communicate the names of mapped subimages to userspace. My understanding by now is that there is no way around introducing a new device_type and then mitigate the unwanted side effects by follow-up changes, ie. make it possible to use that new device_type when specifying the rootfs= cmdline variable (currently only disks and partitions are considered there). Or give up on the idea that uImage.FIT subimages mapped by the new driver can be identified by userspace by poking uevent from sysfs and just rely on convention and ordering. > > > However, I don't see a way to avoid using (or duplicating) > > devt_from_devname() to select the lower device somehow without having > > to probe and parse *all* block devices present (which is, from what I > > understood, what you want to avoid, and I agree that it is more safe to > > not do that...) > > > > Can you or anyone give some advise on how this should be done? > > Just set the block driver up from an initramfs, like we do for all > modern stackable drivers. Instead of using a kernel cmdline parameter we could also have the bootloader embed that information as string in the 'chosen' section in the device tree blob, right next to the cmdline. However, as there is no representation of block partitions in device tree, also in that case the lower device will have to be referenced by a string somehow, ie. devt_from_devname() or the like will be needed. Needing an initramfs, even if it boils down to just one statically compile executable, is a massive bloat and complication when building embedded device firmware and none of the over 1580 devices currently supported by OpenWrt need an intermediate initramfs to mount their on-flash squashfs rootfs (some, however, already use this uImage.FIT partition parser, and not needing a downstream patch for that would be nice). uImage.FIT typically contains the complete firmware used on an embedded device, ie. at least a Linux kernel, device tree blob and a filesystem. The main use of this whole uImage.FIT-parsing-in-Linux approach I'm trying to get across here is to expose one or more 'filesystem'-type subimages of such an image as block devices, also so that one of them can directly be mounted as rootfs by the kernel. This *replaces* the use of 'ramdisk' type sub-images which need to remain allocated at runtime, while using a squashfs 'filesystem' type sub-image allows freeing the filesystem cache if ram is becomes scarce. As both, storage and memory, are often very limited on small embedded devices, OpenWrt has always been using a squashfs as rootfs with a storage-type specific filesytem used as r/w overlay on top. Up to now, the rootfs is often stored in platform-specific ways, ie. an additional partition on block devices, MTD partition on NOR flash or UBI volume on NAND flash. Carrying the read-only squashfs filesystem inside the uImage.FIT structure has the advantage of being agnostic regarding the storage-type (NOR/mtdblockX, NAND/ubiblockX, MMC/mmcblkXpY) and allows the bootloader to validate the filesystem hash before starting the kernel, ie. ensuring integrity of the firmware as-a-whole which includes the root filesystem. > > > Yet another (imho not terrible) problem is removal of the lower device. > > Many of the supported SBC use a micro SD card to boot, which can be > > removed by the user while the system is running (which is generally not > > a good idea, but anyway). For partitions this is handled automatically > > by blk_drop_partitions() called directly from genhd.c. > > I'm currently playing with doing something similar using the bus device > > removal notification, but it doesn't seem to work for all cases, e.g. > > mmcblk device do not seem to have the ->bus pointer populated at all > > (ie. disk_to_dev(disk)->bus == NULL for mmcblk devices). > > I have WIP patches that allow the claimer of a block device get > resize and removal notification. It's not going to land for 6.2, > but I hope I have it ready in time for the next merge window. I'm looking forward to integrate that in the uImage.FIT block driver I've been working on. In the meantime, should I already post my current draft so we can start discussing if that solution could be acceptable? Best regards Daniel
On Fri, Dec 09, 2022 at 05:00:34PM +0000, Daniel Golle wrote: > It doesn't need to be literally the PARTNAME uevent, just any way to > communicate the names of mapped subimages to userspace. > > My understanding by now is that there is no way around introducing a > new device_type and then mitigate the unwanted side effects by > follow-up changes, ie. make it possible to use that new device_type > when specifying the rootfs= cmdline variable (currently only disks and > partitions are considered there). > > Or give up on the idea that uImage.FIT subimages mapped by the new > driver can be identified by userspace by poking uevent from sysfs and > just rely on convention and ordering. To be honest I'm still really confused by the patch set. What is the problem with simply using an initramfs, which is the way to deal with any non-trivial init issue for about 20 years now, predated by the somewhat more awkware initrd... > Needing an initramfs, even if it boils down to just one statically > compile executable, is a massive bloat and complication when building > embedded device firmware and none of the over 1580 devices currently > supported by OpenWrt need an intermediate initramfs to mount their > on-flash squashfs rootfs (some, however, already use this uImage.FIT > partition parser, and not needing a downstream patch for that would be > nice). You can always build the initrams into the kernel image, I don't see how that is bloat. > Carrying the read-only squashfs filesystem inside the uImage.FIT > structure has the advantage of being agnostic regarding the > storage-type (NOR/mtdblockX, NAND/ubiblockX, MMC/mmcblkXpY) and allows > the bootloader to validate the filesystem hash before starting the > kernel, ie. ensuring integrity of the firmware as-a-whole which > includes the root filesystem. What is the point of the uImage.FIT? Why can't you use a sane format? Or at least not use it on top of partitions, which is just braindead. Who actually came up with this amazingly convoluted and annoying scheme and why?
On Mon, Dec 12, 2022 at 01:03:09AM -0800, Christoph Hellwig wrote: > On Fri, Dec 09, 2022 at 05:00:34PM +0000, Daniel Golle wrote: > > It doesn't need to be literally the PARTNAME uevent, just any way to > > communicate the names of mapped subimages to userspace. > > > > My understanding by now is that there is no way around introducing a > > new device_type and then mitigate the unwanted side effects by > > follow-up changes, ie. make it possible to use that new device_type > > when specifying the rootfs= cmdline variable (currently only disks and > > partitions are considered there). > > > > Or give up on the idea that uImage.FIT subimages mapped by the new > > driver can be identified by userspace by poking uevent from sysfs and > > just rely on convention and ordering. > > To be honest I'm still really confused by the patch set. What is > the problem with simply using an initramfs, which is the way to > deal with any non-trivial init issue for about 20 years now, predated > by the somewhat more awkware initrd... The thing is that there isn't anything extraordinarily complex here, just dynamically partitioning a block device based on a well-known format. > > > Needing an initramfs, even if it boils down to just one statically > > compile executable, is a massive bloat and complication when building > > embedded device firmware and none of the over 1580 devices currently > > supported by OpenWrt need an intermediate initramfs to mount their > > on-flash squashfs rootfs (some, however, already use this uImage.FIT > > partition parser, and not needing a downstream patch for that would be > > nice). > > You can always build the initrams into the kernel image, I don't > see how that is bloat. Using initramfs implies that we would need a 2nd copy of the standard C library and libfdt, both alone will already occupy more than just a single 64kB block of flash. I understand that from the point of view of classic x86 servers or even embedded devices with eMMC this seems negligible. However, keep in mind that a huge number of existing devices and also new designs of embedded devices often boot from just a few megabytes of NOR flash, and there every byte counts. > > > Carrying the read-only squashfs filesystem inside the uImage.FIT > > structure has the advantage of being agnostic regarding the > > storage-type (NOR/mtdblockX, NAND/ubiblockX, MMC/mmcblkXpY) and allows > > the bootloader to validate the filesystem hash before starting the > > kernel, ie. ensuring integrity of the firmware as-a-whole which > > includes the root filesystem. > > What is the point of the uImage.FIT? It is the format used by Das U-Boot, which is by far the most common bootloader found on small embedded devices running Linux. Is is already used by Das U-Boot to validate and load kernel, devicetree, initramfs, ... to RAM before launching Linux. I've included a link to the documentation[1] which gives insights regarding the motivation to create such a format. > Why can't you use a sane format? Define "sane format". :) And tell that to the people over at Das U-Boot please. I'm just trying to make best use of the status-quo which is that uImage.FIT is the de-facto standard to boot Linux on small embedded devices. > Or at least not use it on top of partitions, which is just > braindead. In fact, that's only one out of three possible uses in which parsing the contained sub-image boundaries can be useful: * On devices with NOR flash uImage.FIT is stored in an MTD partition, hence the uImage.FIT partition parser (or small stackable block driver) would then operate on top of /dev/mtdblockX. * On devices with NAND flash uImage.FIT is stored in a UBI volume, hence in this case /dev/ubiblockX needs to be processed. * On devices with block storage (like eMMC or SD card) the image is stored on a (GPT) partition (/dev/mmcblkXpY). Generally speaking, when it comes to storing the rootfs (typically squashfs), this can be solved in two ways: a) keep the rootfs inside the uImage.FIT structure (to be then dealt with by this patch series) b) store the rootfs on an additional partition or volume: - additional GPT partition on block devices - additional MTD partition on devices with NOR flash - additional UBI volume on devices with NAND flash Now lets look at the consequences of each approach: I. It is often a requirement that the bootloader shall decide about the integrity of the firmware before starting it (ie. verify hashes of kernel, dtb, **and rootfs**). In case of approch b) this will have to be implemented in a custom, platform-specific way; many vendors do things like this, resulting in huge variety of different, customs ways which let the bootloader validate the rootfs. As a huge majority of devices actually use Das U-Boot which already supports validating uImage.FIT content, this becomes a no-brainer when using approach a). II. Updating the firmware in case of approach a) being used is very simple: Just write the whole image to a storage volume. In case of approach b) an archive or tarball has to be used, allowing kernel.uImage and rootfs to each be written to different storage volumes. And for the special case of NOR flash, developers came up with a complete out-of-tree subsystem handling only the splitting of MTD partitions in the various ways used by vendor-modified bootloaders [2]. III. Approach a) allows the very same uImage.FIT binary to be read by the bootloader, used as firmware-upgrade format within Linux userland and also serve as a recovery image format which can be loaded e.g. via TFTP and stored by the bootloader. Approach b) will require different formats for firmware upgrades (eg. a tarball) and storage on-disk, depending on the platform and type of storage media. And when it comes to bootloader-level recovery via TFTP, one would have to implement extraction of that tarball **by the bootloader** or use yet another format. IV. When it comes to devices allowing to boot from different media, approach a) allows using the exact same firmware image independently of the storage media used for booting. When using approach b), different firmware images have to be provided **for the same device**, depending on whether a block device, NAND/UBI or NOR/MTD is used for booting. The BananaPi BPi-R64 and more recent BananaPi BPi-R3 boards are good examples for boards which allow booting of different media, and are already using uImage.FIT as a unified format. I hope this explains my motivation. Please ask should there by any doubts or if any of my explainations above are not clear. > Who actually came up with this amazingly convoluted and > annoying scheme and why? This may answer your question: [1]: https://source.denx.de/u-boot/u-boot/-/blob/master/doc/uImage.FIT/howto.txt [2]: https://git.openwrt.org/?p=openwrt/openwrt.git;a=tree;f=target/linux/generic/files/drivers/mtd/mtdsplit
On Mon, Dec 12, 2022 at 11:02:33AM +0000, Daniel Golle wrote: > The thing is that there isn't anything extraordinarily complex here, > just dynamically partitioning a block device based on a well-known > format. Yes, but a completely non-standard format that nests inside an partition. > Using initramfs implies that we would need a 2nd copy of the standard C > library and libfdt, both alone will already occupy more than just a > single 64kB block of flash. Why do you need libfdt? And with a simple statically linked kpartx you won't pull in much of libc either. > I understand that from the point of view of > classic x86 servers or even embedded devices with eMMC this seems > negligible. However, keep in mind that a huge number of existing > devices and also new designs of embedded devices often boot from just a > few megabytes of NOR flash, and there every byte counts. So I've worked quite a bit on really small deeply embedded systems, and for those I wouldn't even think of using strange image formats or the rather wasteful GPT partition format. There we wouldn't dare to use paritions or weird image formats, but > > What is the point of the uImage.FIT? > > It is the format used by Das U-Boot, which is by far the most common > bootloader found on small embedded devices running Linux. > Is is already used by Das U-Boot to validate and load kernel, > devicetree, initramfs, ... to RAM before launching Linux. > I've included a link to the documentation[1] which gives insights > regarding the motivation to create such a format. That doesn't explain why you'd want to use it. Nor how people came up with it. > In fact, that's only one out of three possible uses in which parsing > the contained sub-image boundaries can be useful: > * On devices with NOR flash uImage.FIT is stored in an MTD partition, > hence the uImage.FIT partition parser (or small stackable block > driver) would then operate on top of /dev/mtdblockX. > > * On devices with NAND flash uImage.FIT is stored in a UBI volume, > hence in this case /dev/ubiblockX needs to be processed. And all the mtdblock / ubiblock is due to the lack of a native mtd/ubi backend for squashfs? Why can't we take the block layer out of the loop entirely? > I hope this explains my motivation. Please ask should there by any > doubts or if any of my explainations above are not clear. None of this explains the silly nesting inside the GPT partition. It is not needed for the any use cases and the root probem here. Without that you could simply implement a parition format, with that you get into crazy nesting behavior. Note that it would have any benefit over just not doing this silly image. Maybe someone just needs to go back and come up wit ha scheme that actually works and implement that in uboot as well.
On Mon, Dec 12, 2022 at 10:48:12PM -0800, Christoph Hellwig wrote: > On Mon, Dec 12, 2022 at 11:02:33AM +0000, Daniel Golle wrote: > > The thing is that there isn't anything extraordinarily complex here, > > just dynamically partitioning a block device based on a well-known > > format. > > Yes, but a completely non-standard format that nests inside an > partition. The reason for this current discussion (see subject line) is exactly that you didn't like the newly introduced partition type GUID which then calls the newly introduced partition parser taking care of the uImage.FIT content of a partition. Instead you suggested to implement it as a small stackable block driver, which I did (and will post as RFC in a moment from now). This block driver (if built-into the kernel and relied upon to expose the block device used as root filesystem) will need to identify the lower device it should work on. And for that the helper functions such as devt_from_devname() need to be available for that driver. > > > Using initramfs implies that we would need a 2nd copy of the standard C > > library and libfdt, both alone will already occupy more than just a > > single 64kB block of flash. > > Why do you need libfdt? uImage.FIT is based on FDT, so the easiest way to parse this format without reinventing the wheel is using libfdt. > And with a simple statically linked kpartx you won't pull in much of > libc either. Well, enough to have kpartx and libfdt covered. I still assume that's more than 64KiB of compressed bytecode in total. > > > I understand that from the point of view of > > classic x86 servers or even embedded devices with eMMC this seems > > negligible. However, keep in mind that a huge number of existing > > devices and also new designs of embedded devices often boot from just a > > few megabytes of NOR flash, and there every byte counts. > > So I've worked quite a bit on really small deeply embedded systems, > and for those I wouldn't even think of using strange image formats > or the rather wasteful GPT partition format. > There we wouldn't dare to use paritions or weird image formats, but Well, the reality as of today is that even the smallest of all embedded devices will need updated firmware at some point. Using an image format which allows to dynamically partition a firmware image into kernel and rootfs parts is needed **especially** on devices with very little storage space -- it's there that we cannot afford to use hard-coded MTD partition sizes either limiting or overestimating the future kernel size. And having a unified format which serves a whole range of devices does make that easier -- surely, on NOR flash, there would not be a GPT partition; mtdblock on top of an MTD partition is used in that case. > > > > What is the point of the uImage.FIT? > > > > It is the format used by Das U-Boot, which is by far the most common > > bootloader found on small embedded devices running Linux. > > Is is already used by Das U-Boot to validate and load kernel, > > devicetree, initramfs, ... to RAM before launching Linux. > > I've included a link to the documentation[1] which gives insights > > regarding the motivation to create such a format. > > That doesn't explain why you'd want to use it. Nor how people > came up with it. I didn't want to quote all the history here, but it is explained rather well in their documentation I had linked there. > > > In fact, that's only one out of three possible uses in which parsing > > the contained sub-image boundaries can be useful: > > * On devices with NOR flash uImage.FIT is stored in an MTD partition, > > hence the uImage.FIT partition parser (or small stackable block > > driver) would then operate on top of /dev/mtdblockX. > > > > * On devices with NAND flash uImage.FIT is stored in a UBI volume, > > hence in this case /dev/ubiblockX needs to be processed. > > And all the mtdblock / ubiblock is due to the lack of a native > mtd/ubi backend for squashfs? Why can't we take the block layer > out of the loop entirely? A block representation is the common denominator of all the above. Sure, I could implement splitting MTD devices according to uImage.FIT and then add MTD support to squashfs. Then implement splitting of UBI volumes and add UBI support to squashfs. However, simply using mtdblock on NOR and ubiblock on NAND seem well-suited and hence there wasn't ever a need to implement MTD or UBI support for any **read-only** filesystem. Afaik that is what mtdblock and ubiblock are intended to be used for. > > I hope this explains my motivation. Please ask should there by any > > doubts or if any of my explainations above are not clear. > > None of this explains the silly nesting inside the GPT partition. > It is not needed for the any use cases and the root probem here. So where would you store the uImage (which will have to exist even to just load kernel and DTB in U-Boot, even without containing the root filesystem) on devices with eMMC then? I did consider block2mtd on eMMC and gluebi (which is deprecated) on UBI, but in order to mount the filesystem I'd then use mtdblock further down the road, and that'd actually be silly. > Without that you could simply implement a parition format, with that > you get into crazy nesting behavior. Note that it would have > any benefit over just not doing this silly image. Are you suggesting to come up with an entirely new type of partition table only for that purpose? Which will require its own tools and implementation in both, U-Boot and Linux? What would be the benefit over just using GPT partitioning? I'm not saying that this cannot be done (and I will do it if you convince me that it'd be needed). > > Maybe someone just needs to go back and come up wit ha scheme that > actually works and implement that in uboot as well. In terms of "works" both are working very well, the uImage.FIT partition parser as well as the tiny stackable block driver you had asked me to write instead. I've tested both on multiple devices by now, the former is even used in production on one of the currently most popular devices running OpenWrt (according to update backend stats). Thank you for your patience! Daniel
On Tue, Dec 13, 2022 at 12:45:27PM +0000, Daniel Golle wrote: > > Yes, but a completely non-standard format that nests inside an > > partition. > > The reason for this current discussion (see subject line) is exactly > that you didn't like the newly introduced partition type GUID which > then calls the newly introduced partition parser taking care of the > uImage.FIT content of a partition. Which is the exact nesting I'm complaining about. Why do you need to use your format inside a GPT partition table? What you're doing is bascially nesting a partition table format inside another one, which doesn't make any sense at all. > This block driver (if built-into the kernel and relied upon to expose > the block device used as root filesystem) will need to identify the > lower device it should work on. And for that the helper functions such > as devt_from_devname() need to be available for that driver. And devt_from_devname must not be used by more non-init code. It is bad it got exposed at all, but new users are not acceptable. > A block representation is the common denominator of all the > above. Sure, I could implement splitting MTD devices according to > uImage.FIT and then add MTD support to squashfs. Then implement > splitting of UBI volumes and add UBI support to squashfs. Implementing MTD and/or UBI support would allow you to build a kernel without CONFIG_BLOCK, which will save you a lot more than the 64k you were whining about above. > > None of this explains the silly nesting inside the GPT partition. > > It is not needed for the any use cases and the root probem here. > > So where would you store the uImage (which will have to exist > even to just load kernel and DTB in U-Boot, even without containing > the root filesystem) on devices with eMMC then? Straight on the block device, where else? > Are you suggesting to come up with an entirely new type of partition > table only for that purpose? Which will require its own tools and > implementation in both, U-Boot and Linux? What would be the benefit > over just using GPT partitioning? Why do you need another layer of partitioning instead of storing all your information either in the uImage, or in some other partition format of your choice? See, if you have GPT, DOS or whatever partitions, you just use partitions and store all the bits your care about in them. If you want a fancy not invented here syndrome image format you use that. But don't use both.
On Wed, Dec 14, 2022 at 08:43:29AM -0800, Christoph Hellwig wrote: > On Tue, Dec 13, 2022 at 12:45:27PM +0000, Daniel Golle wrote: > > > Yes, but a completely non-standard format that nests inside an > > > partition. > > > > The reason for this current discussion (see subject line) is exactly > > that you didn't like the newly introduced partition type GUID which > > then calls the newly introduced partition parser taking care of the > > uImage.FIT content of a partition. > > Which is the exact nesting I'm complaining about. Why do you need > to use your format inside a GPT partition table? The GPT partition table is typically written only once to an eMMC- based device in factory. Firmware images (typically uImage.FIT) are stored in partitions because there are sometimes two of them (for A/B dual-boot, or recovery/production dual-boot). As a working device firmware consists of kernel, dtb and rootfs, all these three things have to be written and used together, typically they also come together in one file for firmware upgrade (ie. rootfs appended to kernel, tarballs, or uImage.FIT containing all three of them). As the size of kernel and rootfs cannot be determined accurately at the time the device is made, having individual GPT partitions for kernel and rootfs ends up to either being a limit to future groth of the kernel image or wastes space by overestimating the kernel size. Changing the GPT partitioning when updating the device to match the exact sizes is also not an option as a damage to the GPT would then present a single point of failure (backup GPT also wouldn't help much here), so for dual-boot to actually be meaningful, we shouldn't ever write to any parts of the disk/flash which affect more than one of the dual-boot options. > What you're doing is bascially nesting a partition table format > inside another one, which doesn't make any sense at all. See the last paragraph of this message for good reasons why one would want to do exactly that. > > > This block driver (if built-into the kernel and relied upon to expose > > the block device used as root filesystem) will need to identify the > > lower device it should work on. And for that the helper functions such > > as devt_from_devname() need to be available for that driver. > > And devt_from_devname must not be used by more non-init code. It is > bad it got exposed at all, but new users are not acceptable. I assume that implementing anything similar using blk_lookup_devt in the driver itself is then also not acceptable, right? Yet another option would be to implement a way to acquire this information from device tree. Ie. have a reference to the disk device as well as an unsigned integer in the 'chosen' node which the bootloader can use to communicate this to the kernel. Example: chosen { bootdev = <&mmc0 3>; }; It's a bit more tricky for ubiblock or mtdblock devices because they don't have *any* representation in device tree at all at this point. In case of an MTD partition (for mtdblockX) we would just reference the mtd partition in the same way. To do this cleanly with NAND/UBI, I'd start with adding device-tree-based attaching of an MTD partition to UBI using a device-tree attribute 'compatible = "linux,ubi"' on the MTD partition. We could then have sub-nodes referencing specific UBI volumes, to select them for use with ubiblock but also for those nodes then being a valid reference for use with the to-be-introduced 'bootdev' attribute in 'chosen'. Does that sound acceptable from your perspective? > > > A block representation is the common denominator of all the > > above. Sure, I could implement splitting MTD devices according to > > uImage.FIT and then add MTD support to squashfs. Then implement > > splitting of UBI volumes and add UBI support to squashfs. > > Implementing MTD and/or UBI support would allow you to build a > kernel without CONFIG_BLOCK, which will save you a lot more than > the 64k you were whining about above. Even devices with NOR flash may still want support for removable block devices like USB pendrives or SD cards... Many home-routers got only 8MiB of NOR flash and yet come with USB 2.0 ports intended for a pendrive which is then shared via Samba. Also, heavily customzied per-device kernel builds would never scale up to support thousands of devices -- hence OpenWrt uses the exact same kernel build for many devices, which makes both, the build process and also debugging kernel bugs, much much easier (or even doable at all). > > > > None of this explains the silly nesting inside the GPT partition. > > > It is not needed for the any use cases and the root probem here. > > > > So where would you store the uImage (which will have to exist > > even to just load kernel and DTB in U-Boot, even without containing > > the root filesystem) on devices with eMMC then? > > Straight on the block device, where else? As the first few blocks are typically used for bootloader code and bootloader environment, we would then need to hard-code the offset(s) of the uImage.FIT on the block device. Imho this becomes messy and just using partitions seemed like a straight forward solution. And what about dual-boot systems where you have more than one firmware image? Hard-code more offsets? For each device? In a way, I was considering this by using blkdevparts= cmdline option instead of GPT, but > > > Are you suggesting to come up with an entirely new type of partition > > table only for that purpose? Which will require its own tools and > > implementation in both, U-Boot and Linux? What would be the benefit > > over just using GPT partitioning? > > Why do you need another layer of partitioning instead of storing > all your information either in the uImage, or in some other > partition format of your choice? The reason is the different life-cycle of the device main partition table, bootloader, bootloader environment, ... on one hand and each firmware image on a dual boot system on the other hand. Hence there is more than just one uImage: typically bootloader, bootloader environment, firmware A (uImage.FIT) and firmware B. Relace "A" and "B" with "recovery" and "production", depending on the dual-boot style implemented. Therefore re-writing the whole disk during firmware upgrades is not an option because it is risky, eg. in case of a sudden power failure we could end up with a hard-bricked system. So to me it makes sense that for a firmware upgrade, we write only to one partition and don't touch GPT or anything else on the device. So in case something goes wrong, the device will still boot, the bootloader will realize that the uImage.FIT in one partition is broken (uImage.FIT also comes with hashes to ensure image integrity) and it will load something else (from another partition) instead.
On Wed, Dec 14, 2022 at 08:01:48PM +0000, Daniel Golle wrote: > I assume that implementing anything similar using blk_lookup_devt in > the driver itself is then also not acceptable, right? No. blk_lookup_devt is a workaround for the early init code where the file system is not available yet. The only reasonable use case is the early init code, but even that is better handled by an initramfs. > Yet another option would be to implement a way to acquire this > information from device tree. Ie. have a reference to the disk device > as well as an unsigned integer in the 'chosen' node which the > bootloader can use to communicate this to the kernel. Example: > chosen { > bootdev = <&mmc0 3>; > }; > > It's a bit more tricky for ubiblock or mtdblock devices because they > don't have *any* representation in device tree at all at this point. > > In case of an MTD partition (for mtdblockX) we would just reference > the mtd partition in the same way. > > To do this cleanly with NAND/UBI, I'd start with adding > device-tree-based attaching of an MTD partition to UBI using a > device-tree attribute 'compatible = "linux,ubi"' on the MTD partition. > We could then have sub-nodes referencing specific UBI volumes, to > select them for use with ubiblock but also for those nodes then > being a valid reference for use with the to-be-introduced 'bootdev' > attribute in 'chosen'. > > Does that sound acceptable from your perspective? Device tree questions are really out of my knowledge and domain, the right people to talk to would be the device-tree and relevant driver maintainers. > Even devices with NOR flash may still want support for removable > block devices like USB pendrives or SD cards... Many home-routers > got only 8MiB of NOR flash and yet come with USB 2.0 ports intended > for a pendrive which is then shared via Samba. Ok, so we're not _that_ resource contrained at all, because for the really small devices with just a few MB of ram, not having the block layer makes a huge difference. > As the first few blocks are typically used for bootloader code and > bootloader environment, we would then need to hard-code the offset(s) > of the uImage.FIT on the block device. Imho this becomes messy and just > using partitions seemed like a straight forward solution. Ok, so the whole DOS-World boot load nightmare has spread to these devices as well. > And what about dual-boot systems where you have more than one firmware > image? Hard-code more offsets? For each device? Is this for a fail save use case where on image is update while the other on is in use? I've usually seen people use two different NOR chips for that to have full error isolation. > The reason is the different life-cycle of the device main partition > table, bootloader, bootloader environment, ... on one hand and each > firmware image on a dual boot system on the other hand. Oh, so the firmware image doesn't even include the bootloader, but the bootload is on the same device? That just seems like a very odd setup, but I'll take your word for it being used. > Therefore re-writing the whole disk during firmware upgrades is not an > option because it is risky, eg. in case of a sudden power failure we > could end up with a hard-bricked system. So to me it makes sense that > for a firmware upgrade, we write only to one partition and don't touch > GPT or anything else on the device. So in case something goes wrong, > the device will still boot, the bootloader will realize that the > uImage.FIT in one partition is broken (uImage.FIT also comes with > hashes to ensure image integrity) and it will load something else (from > another partition) instead. I'm just really confused about this whole setup. Maybe it's just because what I've worked with, which is either really deeply embededd devices where the everything is one image, that you might in some case have twice, or a full blown UEFI-like setup where the boot loader is separate, but can simply load a kernel from the actual root file system. I have to say even after all these round trip I'm still not sure what problem where really solving. In basically all other environments that have a powerful enough bootloader to read file systems your rootfs would simply store the kernel image and dtb if they are interdependent. Is the root cause that uboot doesn't support reading files from squashfs despite supporting half a dozen other file systems?
Hi Christoph, On Tue, Nov 22, 2022 at 04:37:08AM -0800, Christoph Hellwig wrote: > On Sat, Nov 19, 2022 at 04:03:11PM +0000, Daniel Golle wrote: > > [...] > > Yet another (imho not terrible) problem is removal of the lower device. > > Many of the supported SBC use a micro SD card to boot, which can be > > removed by the user while the system is running (which is generally not > > a good idea, but anyway). For partitions this is handled automatically > > by blk_drop_partitions() called directly from genhd.c. > > I'm currently playing with doing something similar using the bus device > > removal notification, but it doesn't seem to work for all cases, e.g. > > mmcblk device do not seem to have the ->bus pointer populated at all > > (ie. disk_to_dev(disk)->bus == NULL for mmcblk devices). > > I have WIP patches that allow the claimer of a block device get > resize and removal notification. It's not going to land for 6.2, > but I hope I have it ready in time for the next merge window. Any news about that patchset? I'd happily review, test and use it ;)
diff --git a/block/bdev.c b/block/bdev.c index d699ecdb3260..92e3e5c81337 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1092,3 +1092,169 @@ void bdev_statx_dioalign(struct inode *inode, struct kstat *stat) blkdev_put_no_open(bdev); } + +struct uuidcmp { + const char *uuid; + int len; +}; + +/** + * match_dev_by_uuid - callback for finding a partition using its uuid + * @dev: device passed in by the caller + * @data: opaque pointer to the desired struct uuidcmp to match + * + * Returns 1 if the device matches, and 0 otherwise. + */ +static int match_dev_by_uuid(struct device *dev, const void *data) +{ + struct block_device *bdev = dev_to_bdev(dev); + const struct uuidcmp *cmp = data; + + if (!bdev->bd_meta_info || + strncasecmp(cmp->uuid, bdev->bd_meta_info->uuid, cmp->len)) + return 0; + return 1; +} + +/** + * devt_from_partuuid - looks up the dev_t of a partition by its UUID + * @uuid_str: char array containing ascii UUID + * + * The function will return the first partition which contains a matching + * UUID value in its partition_meta_info struct. This does not search + * by filesystem UUIDs. + * + * If @uuid_str is followed by a "/PARTNROFF=%d", then the number will be + * extracted and used as an offset from the partition identified by the UUID. + * + * Returns the matching dev_t on success or 0 on failure. + */ +dev_t devt_from_partuuid(const char *uuid_str, int *root_wait) +{ + struct uuidcmp cmp; + struct device *dev = NULL; + dev_t devt = 0; + int offset = 0; + char *slash; + + cmp.uuid = uuid_str; + + slash = strchr(uuid_str, '/'); + /* Check for optional partition number offset attributes. */ + if (slash) { + char c = 0; + + /* Explicitly fail on poor PARTUUID syntax. */ + if (sscanf(slash + 1, "PARTNROFF=%d%c", &offset, &c) != 1) + goto clear_root_wait; + cmp.len = slash - uuid_str; + } else { + cmp.len = strlen(uuid_str); + } + + if (!cmp.len) + goto clear_root_wait; + + dev = class_find_device(&block_class, NULL, &cmp, &match_dev_by_uuid); + if (!dev) + return 0; + + if (offset) { + /* + * Attempt to find the requested partition by adding an offset + * to the partition number found by UUID. + */ + devt = part_devt(dev_to_disk(dev), + dev_to_bdev(dev)->bd_partno + offset); + } else { + devt = dev->devt; + } + + put_device(dev); + return devt; + +clear_root_wait: + pr_err("VFS: PARTUUID= is invalid.\n" + "Expected PARTUUID=<valid-uuid-id>[/PARTNROFF=%%d]\n"); + if (root_wait && *root_wait) { + pr_err("Disabling rootwait; root= is invalid.\n"); + *root_wait = 0; + } + return 0; +} +EXPORT_SYMBOL_GPL(devt_from_partuuid); + +/** + * match_dev_by_label - callback for finding a partition using its label + * @dev: device passed in by the caller + * @data: opaque pointer to the label to match + * + * Returns 1 if the device matches, and 0 otherwise. + */ +static int match_dev_by_label(struct device *dev, const void *data) +{ + struct block_device *bdev = dev_to_bdev(dev); + const char *label = data; + + if (!bdev->bd_meta_info || strcmp(label, bdev->bd_meta_info->volname)) + return 0; + return 1; +} + +dev_t devt_from_partlabel(const char *label) +{ + struct device *dev; + dev_t devt = 0; + + dev = class_find_device(&block_class, NULL, label, &match_dev_by_label); + if (dev) { + devt = dev->devt; + put_device(dev); + } + + return devt; +} +EXPORT_SYMBOL_GPL(devt_from_partlabel); + +dev_t devt_from_devname(const char *name) +{ + dev_t devt = 0; + int part; + char s[32]; + char *p; + + if (strlen(name) > 31) + return 0; + strcpy(s, name); + for (p = s; *p; p++) { + if (*p == '/') + *p = '!'; + } + + devt = blk_lookup_devt(s, 0); + if (devt) + return devt; + + /* + * Try non-existent, but valid partition, which may only exist after + * opening the device, like partitioned md devices. + */ + while (p > s && isdigit(p[-1])) + p--; + if (p == s || !*p || *p == '0') + return 0; + + /* try disk name without <part number> */ + part = simple_strtoul(p, NULL, 10); + *p = '\0'; + devt = blk_lookup_devt(s, part); + if (devt) + return devt; + + /* try disk name without p<part number> */ + if (p < s + 2 || !isdigit(p[-2]) || p[-1] != 'p') + return 0; + p[-1] = '\0'; + return blk_lookup_devt(s, part); +} +EXPORT_SYMBOL_GPL(devt_from_devname); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index fb27dfaaaf85..b45cdcdccc6d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1494,6 +1494,9 @@ int truncate_bdev_range(struct block_device *bdev, fmode_t mode, loff_t lstart, loff_t lend); #ifdef CONFIG_BLOCK +dev_t devt_from_partuuid(const char *uuid_str, int *root_wait); +dev_t devt_from_partlabel(const char *label); +dev_t devt_from_devname(const char *name); void invalidate_bdev(struct block_device *bdev); int sync_blockdev(struct block_device *bdev); int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend); @@ -1502,6 +1505,18 @@ void sync_bdevs(bool wait); void bdev_statx_dioalign(struct inode *inode, struct kstat *stat); void printk_all_partitions(void); #else +static inline dev_t devt_from_partuuid(const char *uuid_str, int *root_wait) +{ + return MKDEV(0, 0); +} +static inline dev_t devt_from_partlabel(const char *label); +{ + return MKDEV(0, 0); +} +static inline dev_t devt_from_devname(const char *name); +{ + return MKDEV(0, 0); +} static inline void invalidate_bdev(struct block_device *bdev) { } diff --git a/init/do_mounts.c b/init/do_mounts.c index 811e94daf0a8..d55e34994f18 100644 --- a/init/do_mounts.c +++ b/init/do_mounts.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only #include <linux/module.h> #include <linux/sched.h> +#include <linux/blkdev.h> #include <linux/ctype.h> #include <linux/fd.h> #include <linux/tty.h> @@ -60,169 +61,6 @@ static int __init readwrite(char *str) __setup("ro", readonly); __setup("rw", readwrite); -#ifdef CONFIG_BLOCK -struct uuidcmp { - const char *uuid; - int len; -}; - -/** - * match_dev_by_uuid - callback for finding a partition using its uuid - * @dev: device passed in by the caller - * @data: opaque pointer to the desired struct uuidcmp to match - * - * Returns 1 if the device matches, and 0 otherwise. - */ -static int match_dev_by_uuid(struct device *dev, const void *data) -{ - struct block_device *bdev = dev_to_bdev(dev); - const struct uuidcmp *cmp = data; - - if (!bdev->bd_meta_info || - strncasecmp(cmp->uuid, bdev->bd_meta_info->uuid, cmp->len)) - return 0; - return 1; -} - -/** - * devt_from_partuuid - looks up the dev_t of a partition by its UUID - * @uuid_str: char array containing ascii UUID - * - * The function will return the first partition which contains a matching - * UUID value in its partition_meta_info struct. This does not search - * by filesystem UUIDs. - * - * If @uuid_str is followed by a "/PARTNROFF=%d", then the number will be - * extracted and used as an offset from the partition identified by the UUID. - * - * Returns the matching dev_t on success or 0 on failure. - */ -static dev_t devt_from_partuuid(const char *uuid_str) -{ - struct uuidcmp cmp; - struct device *dev = NULL; - dev_t devt = 0; - int offset = 0; - char *slash; - - cmp.uuid = uuid_str; - - slash = strchr(uuid_str, '/'); - /* Check for optional partition number offset attributes. */ - if (slash) { - char c = 0; - - /* Explicitly fail on poor PARTUUID syntax. */ - if (sscanf(slash + 1, "PARTNROFF=%d%c", &offset, &c) != 1) - goto clear_root_wait; - cmp.len = slash - uuid_str; - } else { - cmp.len = strlen(uuid_str); - } - - if (!cmp.len) - goto clear_root_wait; - - dev = class_find_device(&block_class, NULL, &cmp, &match_dev_by_uuid); - if (!dev) - return 0; - - if (offset) { - /* - * Attempt to find the requested partition by adding an offset - * to the partition number found by UUID. - */ - devt = part_devt(dev_to_disk(dev), - dev_to_bdev(dev)->bd_partno + offset); - } else { - devt = dev->devt; - } - - put_device(dev); - return devt; - -clear_root_wait: - pr_err("VFS: PARTUUID= is invalid.\n" - "Expected PARTUUID=<valid-uuid-id>[/PARTNROFF=%%d]\n"); - if (root_wait) - pr_err("Disabling rootwait; root= is invalid.\n"); - root_wait = 0; - return 0; -} - -/** - * match_dev_by_label - callback for finding a partition using its label - * @dev: device passed in by the caller - * @data: opaque pointer to the label to match - * - * Returns 1 if the device matches, and 0 otherwise. - */ -static int match_dev_by_label(struct device *dev, const void *data) -{ - struct block_device *bdev = dev_to_bdev(dev); - const char *label = data; - - if (!bdev->bd_meta_info || strcmp(label, bdev->bd_meta_info->volname)) - return 0; - return 1; -} - -static dev_t devt_from_partlabel(const char *label) -{ - struct device *dev; - dev_t devt = 0; - - dev = class_find_device(&block_class, NULL, label, &match_dev_by_label); - if (dev) { - devt = dev->devt; - put_device(dev); - } - - return devt; -} - -static dev_t devt_from_devname(const char *name) -{ - dev_t devt = 0; - int part; - char s[32]; - char *p; - - if (strlen(name) > 31) - return 0; - strcpy(s, name); - for (p = s; *p; p++) { - if (*p == '/') - *p = '!'; - } - - devt = blk_lookup_devt(s, 0); - if (devt) - return devt; - - /* - * Try non-existent, but valid partition, which may only exist after - * opening the device, like partitioned md devices. - */ - while (p > s && isdigit(p[-1])) - p--; - if (p == s || !*p || *p == '0') - return 0; - - /* try disk name without <part number> */ - part = simple_strtoul(p, NULL, 10); - *p = '\0'; - devt = blk_lookup_devt(s, part); - if (devt) - return devt; - - /* try disk name without p<part number> */ - if (p < s + 2 || !isdigit(p[-2]) || p[-1] != 'p') - return 0; - p[-1] = '\0'; - return blk_lookup_devt(s, part); -} -#endif /* CONFIG_BLOCK */ static dev_t devt_from_devnum(const char *name) { @@ -284,7 +122,7 @@ dev_t name_to_dev_t(const char *name) return Root_RAM0; #ifdef CONFIG_BLOCK if (strncmp(name, "PARTUUID=", 9) == 0) - return devt_from_partuuid(name + 9); + return devt_from_partuuid(name + 9, &root_wait); if (strncmp(name, "PARTLABEL=", 10) == 0) return devt_from_partlabel(name + 10); if (strncmp(name, "/dev/", 5) == 0)
In init/do_mounts.c there are helper functions devt_from_partuuid, devt_from_partlabel and devt_from_devname. In order to make these functions available to other users, move them to block/bdev.c. Signed-off-by: Daniel Golle <daniel@makrotopia.org> --- block/bdev.c | 166 +++++++++++++++++++++++++++++++++++++++++ include/linux/blkdev.h | 15 ++++ init/do_mounts.c | 166 +---------------------------------------- 3 files changed, 183 insertions(+), 164 deletions(-)