Message ID | 20250328060853.4124527-1-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
Headers | show |
Series | md: introduce a new lockless bitmap | expand |
On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote: > A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in > llbitmap_add_disk(). And a file is created as well to manage bitmap IO for > this disk, see details in llbitmap_open_disk(). Read/write bitmap is > converted to buffer IO to this file. > > IO fast path will set bits to dirty, and those dirty bits will be cleared > by daemon after IO is done. llbitmap_barrier is used to syncronize between > IO path and daemon; Why do you need a separate gendisk? I'll try to find some time to read the code to understand what it does, but it would also be really useful to explain the need for such an unusual concept here.
Hi, 在 2025/03/28 19:06, Christoph Hellwig 写道: > On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote: >> A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in >> llbitmap_add_disk(). And a file is created as well to manage bitmap IO for >> this disk, see details in llbitmap_open_disk(). Read/write bitmap is >> converted to buffer IO to this file. >> >> IO fast path will set bits to dirty, and those dirty bits will be cleared >> by daemon after IO is done. llbitmap_barrier is used to syncronize between >> IO path and daemon; > > Why do you need a separate gendisk? I'll try to find some time to read > the code to understand what it does, but it would also be really useful > to explain the need for such an unusual concept here. The purpose here is to hide the low level bitmap IO implementation to the API disk->submit_bio(), and the bitmap IO can be converted to buffer IO to the bdev_file. This is the easiest way that I can think of to resue the pagecache, with natural ability for dirty page writeback. I do think about creating a new anon file and implement a new file_operations, this will be much more complicated. Meanwhile, bitmap file for the old bitmap will be removed sooner or later, and this bdev_file implementation will compatible with bitmap file as well. Thanks, Kuai > > . >
From: Yu Kuai <yukuai3@huawei.com> #### Background Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. #### Key Concept ##### State Machine Each bit is one byte, contain 6 difference state, see llbitmap_state. And there are total 8 differenct actions, see llbitmap_action, can change state: llbitmap state machine: transitions between states | | Startwrite | Startsync | Endsync | Abortsync| Reload | Daemon | Discard | Stale | | --------- | ---------- | --------- | ------- | ------- | -------- | ------ | --------- | --------- | | Unwritten | Dirty | x | x | x | x | x | x | x | | Clean | Dirty | x | x | x | x | x | Unwritten | NeedSync | | Dirty | x | x | x | x | NeedSync | Clean | Unwritten | NeedSync | | NeedSync | x | Syncing | x | x | x | x | Unwritten | x | | Syncing | x | Syncing | Dirty | NeedSync | NeedSync | x | Unwritten | NeedSync | special illustration: - Unwritten is special state, which means user never write data, hence there is no need to resync/recover data. This is safe if user create filesystems for the array, filesystem will make sure user will get zero data for unwritten blocks. - After resync is done, change state from Syncing to Dirty first, in case Startwrite happen before the state is Clean. ##### Bitmap IO A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in llbitmap_add_disk(). And a file is created as well to manage bitmap IO for this disk, see details in llbitmap_open_disk(). Read/write bitmap is converted to buffer IO to this file. IO fast path will set bits to dirty, and those dirty bits will be cleared by daemon after IO is done. llbitmap_barrier is used to syncronize between IO path and daemon; Test result: to be added Noted: 1) user must apply the following mdadm patch, and then llbitmap can be enabled by --bitmap=lockless https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/ 2) this set is cooked on the top of my other set: https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/ Yu Kuai (14): block: factor out a helper bdev_file_alloc() md/md-bitmap: pass discard information to bitmap_{start, end}write md/md-bitmap: remove parameter slot from bitmap_create() md: add a new sysfs api bitmap_version md: delay registeration of bitmap_ops until creating bitmap md/md-llbitmap: implement bit state machine md/md-llbitmap: implement hidden disk to manage bitmap IO md/md-llbitmap: implement APIs for page level dirty bits synchronization md/md-llbitmap: implement APIs to mange bitmap lifetime md/md-llbitmap: implement APIs to dirty bits and clear bits md/md-llbitmap: implement APIs for sync_thread md/md-llbitmap: implement all bitmap operations md/md-llbitmap: implement sysfs APIs md/md-llbitmap: add Kconfig block/bdev.c | 21 +- drivers/md/Kconfig | 12 + drivers/md/Makefile | 1 + drivers/md/md-bitmap.c | 10 +- drivers/md/md-bitmap.h | 21 +- drivers/md/md-llbitmap.c | 1410 ++++++++++++++++++++++++++++++++++++++ drivers/md/md.c | 180 ++++- drivers/md/md.h | 3 + include/linux/blkdev.h | 1 + 9 files changed, 1614 insertions(+), 45 deletions(-) create mode 100644 drivers/md/md-llbitmap.c