Message ID | SJ2PR10MB7082DDA3ECBDD9A7A649F558D87E2@SJ2PR10MB7082.namprd10.prod.outlook.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | [RFC,V8] md/bitmap: Optimize lock contention. | expand |
Context | Check | Description |
---|---|---|
mdraidci/vmtest-md-6_12-PR | fail | merge-conflict |
On 10/8/24 12:24 AM, Shushu Yi wrote: > Optimize scattered address space. Achieves significant improvements in > both throughput and latency. > > Maximize thread-level parallelism and reduce CPU suspension time caused > by lock contention. Achieve increased overall storage throughput by > 89.4% on a system with four PCIe 4.0 SSDs. (Set the iodepth to 32 and > employ libaio. Configure the I/O size as 4 KB with sequential write and > 16 threads. In RAID5 consist of 2+1 1TB Samsung 980Pro SSDs, throughput > went from 5218MB/s to 9884MB/s.) > Thanks for the rebase! I'll get a system starting perfomrance tests on this later today. As this is a rebase I will somply spot check with one vendor to make sure it tracks well with previous results. -Paul > Specifically: > Customize different types of lock structures (mlock and bmclocks) to > manage data and metadata by their own characteristics. Scatter each > segment across the entire address range in the storage such that CPU > threads can be interleaved to access different segments. > The counter lock is also used to protect the metadata update of the > counter table. Modifying both the counter values and their metadata > simultaneously can result in memory faults. mdraid rarely updates the > metadata of the counter table. Thus, employ a readers-writer lock > mechanism to protect the metadata, which can reap the most benefits > from this condition. > Before updating the metadata, the CPU thread acquires the writer lock. > Otherwise, if the CPU threads need to revise the counter values, they > apply for the reader locks and the counter locks successively. > Sequential stripe heads are spread across different locations in the > SSDs via a configurable hash function rather than mapping to a > continuous SSD space. Thus, sequential stripe heads are dispersed > uniformly across the whole space. Sequential write requests are > shuffled to access scattered space. Can effectively reduce the counter > preemption. > > Note: Publish this work as a paper and provide the URL > (https://www.hotstorage.org/2022/camera-ready/hotstorage22-5/pdf/ > hotstorage22-5.pdf). > > Co-developed-by: Yiming Xu <teddyxym@outlook.com> > Signed-off-by: Yiming Xu <teddyxym@outlook.com> > Signed-off-by: Shushu Yi <firnyee@gmail.com> > Tested-by: Paul Luse <paul.e.luse@intel.com> > --- > V1 -> V2: Cleaned up coding style and divided into 2 patches (HemiRAID > and ScalaRAID corresponding to the paper mentioned above). ScalaRAID > equipped every counter with a counter lock and employ our D-Block. > HemiRAID increased the number of stripe locks to 128 > V2 -> V3: Adjusted the language used in the subject and changelog. > Since patch 1/2 in V2 cannot be used independently and does not > encompass all of our work, it has been merged into a single patch. > V3 -> V4: Fixed incorrect sending address and changelog format. > V4 -> V5: Resolved a adress conflict on main (commit > f0e729af2eb6bee9eb58c4df1087f14ebaefe26b (HEAD -> md-6.10, tag: > md-6.10-20240502, origin/md-6.10)). > V5 -> V6: Restored changes to the number of NR_STRIPE_HASH_LOCKS. Set > the maximum number of bmclocks (65536) to prevent excessive memory > usage. Optimized the number of bmclocks in RAID5 when the capacity of > each SSD is less than or equal to 4TB, and still performs well when > the capacity of each SSD is greater than 4TB. > V6 -> V7: Changed according to Song Liu's comments: redundant > "locktypes" were subtracted, unnecessary fixes were removed, NULL > Pointers were changed from "0" to "NULL", code comments were added, and > commit logs were refined. > V7 -> V8: Rebase onto newest Linux (6.12-rc2). > > --- > drivers/md/md-bitmap.c | 211 +++++++++++++++++++++++++++++++---------- > drivers/md/md-bitmap.h | 9 +- > 2 files changed, 170 insertions(+), 50 deletions(-) > > diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c > index 29da10e6f703e..793959a533ae4 100644 > --- a/drivers/md/md-bitmap.c > +++ b/drivers/md/md-bitmap.c > @@ -149,7 +149,18 @@ struct bitmap_page { > struct bitmap { > > struct bitmap_counts { > - spinlock_t lock; > + /* > + * Customize different types of lock structures to manage > + * data and metadata. > + * Split the counter table into multiple segments and assigns a > + * dedicated lock to each segment. The counters in the counter > + * table, which map to neighboring stripe blocks, are interleaved > + * across different segments. > + * CPU threads that target different segments can acquire the > locks > + * simultaneously, resulting in better thread-level parallelism. > + */ > + rwlock_t mlock; /* lock for metadata */ > + spinlock_t *bmclocks; /* locks for bmc */ > struct bitmap_page *bp; > /* total number of pages in the bitmap */ > unsigned long pages; > @@ -246,10 +257,12 @@ static bool bitmap_enabled(struct mddev *mddev) > * if we find our page, we increment the page's refcount so that it stays > * allocated while we're using it > */ > -static int md_bitmap_checkpage(struct bitmap_counts *bitmap, > - unsigned long page, int create, int no_hijack) > -__releases(bitmap->lock) > -__acquires(bitmap->lock) > +static int md_bitmap_checkpage(struct bitmap_counts *bitmap, unsigned > long page, > + int create, int no_hijack, spinlock_t *bmclock) > +__releases(bmclock) > +__acquires(bmclock) > +__releases(bitmap->mlock) > +__acquires(bitmap->mlock) > { > unsigned char *mappage; > > @@ -264,8 +277,10 @@ __acquires(bitmap->lock) > return -ENOENT; > > /* this page has not been allocated yet */ > - > - spin_unlock_irq(&bitmap->lock); > + if (bmclock) > + spin_unlock_irq(bmclock); /* lock for bmc */ > + else > + write_unlock_irq(&bitmap->mlock); /* lock for metadata */ > /* It is possible that this is being called inside a > * prepare_to_wait/finish_wait loop from raid5c:make_request(). > * In general it is not permitted to sleep in that context as it > @@ -280,7 +295,11 @@ __acquires(bitmap->lock) > */ > sched_annotate_sleep(); > mappage = kzalloc(PAGE_SIZE, GFP_NOIO); > - spin_lock_irq(&bitmap->lock); > + > + if (bmclock) > + spin_lock_irq(bmclock); /* lock for bmc */ > + else > + write_lock_irq(&bitmap->mlock); /* lock for metadata */ > > if (mappage == NULL) { > pr_debug("md/bitmap: map page allocation failed, hijacking\n"); > @@ -1456,16 +1475,35 @@ static void bitmap_write_all(struct mddev *mddev) > static void md_bitmap_count_page(struct bitmap_counts *bitmap, > sector_t offset, int inc) > { > - sector_t chunk = offset >> bitmap->chunkshift; > - unsigned long page = chunk >> PAGE_COUNTER_SHIFT; > + /* > + * The stripe heads are spread across different locations in the > + * SSDs via a configurable hash function rather than mapping to a > + * continuous SSD space. > + * Sequential write requests are shuffled to different counter to > + * reduce the counter preemption. > + */ > + sector_t blockno = offset >> (PAGE_SHIFT - SECTOR_SHIFT); > + sector_t totblocks = bitmap->chunks << (bitmap->chunkshift - > (PAGE_SHIFT - SECTOR_SHIFT)); > + unsigned long bits = totblocks ? fls((totblocks - 1)) : 0; > + unsigned long mask = ULONG_MAX << bits | ~(ULONG_MAX << > + (bits - (bitmap->chunkshift + SECTOR_SHIFT - > PAGE_SHIFT))); > + unsigned long cntid = blockno & mask; > + unsigned long page = cntid >> PAGE_COUNTER_SHIFT; > + > bitmap->bp[page].count += inc; > md_bitmap_checkfree(bitmap, page); > } > > static void md_bitmap_set_pending(struct bitmap_counts *bitmap, > sector_t offset) > { > - sector_t chunk = offset >> bitmap->chunkshift; > - unsigned long page = chunk >> PAGE_COUNTER_SHIFT; > + sector_t blockno = offset >> (PAGE_SHIFT - SECTOR_SHIFT); > + sector_t totblocks = bitmap->chunks << (bitmap->chunkshift - > (PAGE_SHIFT - SECTOR_SHIFT)); > + unsigned long bits = totblocks ? fls((totblocks - 1)) : 0; > + unsigned long mask = ULONG_MAX << bits | ~(ULONG_MAX << > + (bits - (bitmap->chunkshift + SECTOR_SHIFT - > PAGE_SHIFT))); > + unsigned long cntid = blockno & mask; > + unsigned long page = cntid >> PAGE_COUNTER_SHIFT; > + > struct bitmap_page *bp = &bitmap->bp[page]; > > if (!bp->pending) > @@ -1474,7 +1512,7 @@ static void md_bitmap_set_pending(struct > bitmap_counts *bitmap, sector_t offset) > > static bitmap_counter_t *md_bitmap_get_counter(struct bitmap_counts > *bitmap, > sector_t offset, sector_t *blocks, > - int create); > + int create, spinlock_t *bmclock); > > static void mddev_set_timeout(struct mddev *mddev, unsigned long timeout, > bool force) > @@ -1557,7 +1595,7 @@ static void bitmap_daemon_work(struct mddev *mddev) > * decrement and handle accordingly. > */ > counts = &bitmap->counts; > - spin_lock_irq(&counts->lock); > + write_lock_irq(&counts->mlock); > nextpage = 0; > for (j = 0; j < counts->chunks; j++) { > bitmap_counter_t *bmc; > @@ -1572,7 +1610,7 @@ static void bitmap_daemon_work(struct mddev *mddev) > counts->bp[j >> PAGE_COUNTER_SHIFT].pending = 0; > } > > - bmc = md_bitmap_get_counter(counts, block, &blocks, 0); > + bmc = md_bitmap_get_counter(counts, block, &blocks, 0, NULL); > if (!bmc) { > j |= PAGE_COUNTER_MASK; > continue; > @@ -1588,7 +1626,7 @@ static void bitmap_daemon_work(struct mddev *mddev) > bitmap->allclean = 0; > } > } > - spin_unlock_irq(&counts->lock); > + write_unlock_irq(&counts->mlock); > > md_bitmap_wait_writes(bitmap); > /* Now start writeout on any page in NEEDWRITE that isn't DIRTY. > @@ -1621,17 +1659,25 @@ static void bitmap_daemon_work(struct mddev *mddev) > > static bitmap_counter_t *md_bitmap_get_counter(struct bitmap_counts > *bitmap, > sector_t offset, sector_t *blocks, > - int create) > -__releases(bitmap->lock) > -__acquires(bitmap->lock) > + int create, spinlock_t *bmclock) > +__releases(bmclock) > +__acquires(bmclock) > +__releases(bitmap->mlock) > +__acquires(bitmap->mlock) > { > /* If 'create', we might release the lock and reclaim it. > * The lock must have been taken with interrupts enabled. > * If !create, we don't release the lock. > */ > - sector_t chunk = offset >> bitmap->chunkshift; > - unsigned long page = chunk >> PAGE_COUNTER_SHIFT; > - unsigned long pageoff = (chunk & PAGE_COUNTER_MASK) << > COUNTER_BYTE_SHIFT; > + sector_t blockno = offset >> (PAGE_SHIFT - SECTOR_SHIFT); > + sector_t totblocks = bitmap->chunks << (bitmap->chunkshift - > (PAGE_SHIFT - SECTOR_SHIFT)); > + unsigned long bits = totblocks ? fls((totblocks - 1)) : 0; > + unsigned long mask = ULONG_MAX << bits | ~(ULONG_MAX << > + (bits - (bitmap->chunkshift + SECTOR_SHIFT - > PAGE_SHIFT))); > + unsigned long cntid = blockno & mask; > + unsigned long page = cntid >> PAGE_COUNTER_SHIFT; > + unsigned long pageoff = (cntid & PAGE_COUNTER_MASK) << > COUNTER_BYTE_SHIFT; > + > sector_t csize = ((sector_t)1) << bitmap->chunkshift; > int err; > > @@ -1644,7 +1690,7 @@ __acquires(bitmap->lock) > *blocks = csize - (offset & (csize - 1)); > return NULL; > } > - err = md_bitmap_checkpage(bitmap, page, create, 0); > + err = md_bitmap_checkpage(bitmap, page, create, 0, bmclock); > > if (bitmap->bp[page].hijacked || > bitmap->bp[page].map == NULL) > @@ -1669,6 +1715,28 @@ __acquires(bitmap->lock) > &(bitmap->bp[page].map[pageoff]); > } > > +/* set-association */ > +static spinlock_t *md_bitmap_get_bmclock(struct bitmap_counts *bitmap, > sector_t offset); > + > +static spinlock_t *md_bitmap_get_bmclock(struct bitmap_counts *bitmap, > sector_t offset) > +{ > + sector_t blockno = offset >> (PAGE_SHIFT - SECTOR_SHIFT); > + sector_t totblocks = bitmap->chunks << (bitmap->chunkshift - > (PAGE_SHIFT - SECTOR_SHIFT)); > + unsigned long bitscnt = totblocks ? fls((totblocks - 1)) : 0; > + unsigned long maskcnt = ULONG_MAX << bitscnt | ~(ULONG_MAX << > (bitscnt - > + (bitmap->chunkshift + SECTOR_SHIFT - PAGE_SHIFT))); > + unsigned long cntid = blockno & maskcnt; > + > + unsigned long totcnts = bitmap->chunks; > + unsigned long bitslock = totcnts ? fls((totcnts - 1)) : 0; > + unsigned long masklock = ULONG_MAX << bitslock | ~(ULONG_MAX << > + (bitslock - BITMAP_COUNTER_LOCK_RATIO_SHIFT)); > + unsigned long lockid = cntid & masklock; > + > + spinlock_t *bmclock = &(bitmap->bmclocks[lockid]); > + return bmclock; > +} > + > static int bitmap_startwrite(struct mddev *mddev, sector_t offset, > unsigned long sectors, bool behind) > { > @@ -1691,11 +1759,15 @@ static int bitmap_startwrite(struct mddev > *mddev, sector_t offset, > while (sectors) { > sector_t blocks; > bitmap_counter_t *bmc; > + spinlock_t *bmclock; > > - spin_lock_irq(&bitmap->counts.lock); > - bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 1); > + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); > + read_lock(&bitmap->counts.mlock); > + spin_lock_irq(bmclock); > + bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, > 1, bmclock); > if (!bmc) { > - spin_unlock_irq(&bitmap->counts.lock); > + spin_unlock_irq(bmclock); > + read_unlock(&bitmap->counts.mlock); > return 0; > } > > @@ -1707,7 +1779,8 @@ static int bitmap_startwrite(struct mddev *mddev, > sector_t offset, > */ > prepare_to_wait(&bitmap->overflow_wait, &__wait, > TASK_UNINTERRUPTIBLE); > - spin_unlock_irq(&bitmap->counts.lock); > + spin_unlock_irq(bmclock); > + read_unlock(&bitmap->counts.mlock); > schedule(); > finish_wait(&bitmap->overflow_wait, &__wait); > continue; > @@ -1724,7 +1797,8 @@ static int bitmap_startwrite(struct mddev *mddev, > sector_t offset, > > (*bmc)++; > > - spin_unlock_irq(&bitmap->counts.lock); > + spin_unlock_irq(bmclock); > + read_unlock(&bitmap->counts.mlock); > > offset += blocks; > if (sectors > blocks) > @@ -1755,11 +1829,15 @@ static void bitmap_endwrite(struct mddev *mddev, > sector_t offset, > sector_t blocks; > unsigned long flags; > bitmap_counter_t *bmc; > + spinlock_t *bmclock; > > - spin_lock_irqsave(&bitmap->counts.lock, flags); > - bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 0); > + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); > + read_lock(&bitmap->counts.mlock); > + spin_lock_irqsave(bmclock, flags); > + bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, > 0, bmclock); > if (!bmc) { > - spin_unlock_irqrestore(&bitmap->counts.lock, flags); > + spin_unlock_irqrestore(bmclock, flags); > + read_unlock(&bitmap->counts.mlock); > return; > } > > @@ -1781,7 +1859,8 @@ static void bitmap_endwrite(struct mddev *mddev, > sector_t offset, > md_bitmap_set_pending(&bitmap->counts, offset); > bitmap->allclean = 0; > } > - spin_unlock_irqrestore(&bitmap->counts.lock, flags); > + spin_unlock_irqrestore(bmclock, flags); > + read_unlock(&bitmap->counts.mlock); > offset += blocks; > if (sectors > blocks) > sectors -= blocks; > @@ -1794,16 +1873,19 @@ static bool __bitmap_start_sync(struct bitmap > *bitmap, sector_t offset, > sector_t *blocks, bool degraded) > { > bitmap_counter_t *bmc; > + spinlock_t *bmclock; > bool rv; > > if (bitmap == NULL) {/* FIXME or bitmap set as 'failed' */ > *blocks = 1024; > return true; /* always resync if no bitmap */ > } > - spin_lock_irq(&bitmap->counts.lock); > + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); > + read_lock(&bitmap->counts.mlock); > + spin_lock_irq(bmclock); > > rv = false; > - bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0); > + bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0, > bmclock); > if (bmc) { > /* locked */ > if (RESYNC(*bmc)) { > @@ -1816,7 +1898,8 @@ static bool __bitmap_start_sync(struct bitmap > *bitmap, sector_t offset, > } > } > } > - spin_unlock_irq(&bitmap->counts.lock); > + spin_unlock_irq(bmclock); > + read_unlock(&bitmap->counts.mlock); > > return rv; > } > @@ -1850,13 +1933,16 @@ static void __bitmap_end_sync(struct bitmap > *bitmap, sector_t offset, > { > bitmap_counter_t *bmc; > unsigned long flags; > + spinlock_t *bmclock; > > if (bitmap == NULL) { > *blocks = 1024; > return; > } > - spin_lock_irqsave(&bitmap->counts.lock, flags); > - bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0); > + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); > + read_lock(&bitmap->counts.mlock); > + spin_lock_irqsave(bmclock, flags); > + bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0, > bmclock); > if (bmc == NULL) > goto unlock; > /* locked */ > @@ -1873,7 +1959,8 @@ static void __bitmap_end_sync(struct bitmap > *bitmap, sector_t offset, > } > } > unlock: > - spin_unlock_irqrestore(&bitmap->counts.lock, flags); > + spin_unlock_irqrestore(bmclock, flags); > + read_unlock(&bitmap->counts.mlock); > } > > static void bitmap_end_sync(struct mddev *mddev, sector_t offset, > @@ -1961,10 +2048,15 @@ static void md_bitmap_set_memory_bits(struct > bitmap *bitmap, sector_t offset, in > > sector_t secs; > bitmap_counter_t *bmc; > - spin_lock_irq(&bitmap->counts.lock); > - bmc = md_bitmap_get_counter(&bitmap->counts, offset, &secs, 1); > + spinlock_t *bmclock; > + > + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); > + read_lock(&bitmap->counts.mlock); > + spin_lock_irq(bmclock); > + bmc = md_bitmap_get_counter(&bitmap->counts, offset, &secs, 1, > bmclock); > if (!bmc) { > - spin_unlock_irq(&bitmap->counts.lock); > + spin_unlock_irq(bmclock); > + read_unlock(&bitmap->counts.mlock); > return; > } > if (!*bmc) { > @@ -1975,7 +2067,8 @@ static void md_bitmap_set_memory_bits(struct > bitmap *bitmap, sector_t offset, in > } > if (needed) > *bmc |= NEEDED_MASK; > - spin_unlock_irq(&bitmap->counts.lock); > + spin_unlock_irq(bmclock); > + read_unlock(&bitmap->counts.mlock); > } > > /* dirty the memory and file bits for bitmap chunks "s" to "e" */ > @@ -2030,6 +2123,7 @@ static void md_bitmap_free(void *data) > unsigned long k, pages; > struct bitmap_page *bp; > struct bitmap *bitmap = data; > + spinlock_t *bmclocks; > > if (!bitmap) /* there was no bitmap */ > return; > @@ -2050,6 +2144,7 @@ static void md_bitmap_free(void *data) > > bp = bitmap->counts.bp; > pages = bitmap->counts.pages; > + bmclocks = bitmap->counts.bmclocks; > > /* free all allocated memory */ > > @@ -2058,6 +2153,7 @@ static void md_bitmap_free(void *data) > if (bp[k].map && !bp[k].hijacked) > kfree(bp[k].map); > kfree(bp); > + kfree(bmclocks); > kfree(bitmap); > } > > @@ -2123,7 +2219,9 @@ static struct bitmap *__bitmap_create(struct mddev > *mddev, int slot) > if (!bitmap) > return ERR_PTR(-ENOMEM); > > - spin_lock_init(&bitmap->counts.lock); > + /* initialize metadata lock */ > + rwlock_init(&bitmap->counts.mlock); > + > atomic_set(&bitmap->pending_writes, 0); > init_waitqueue_head(&bitmap->write_wait); > init_waitqueue_head(&bitmap->overflow_wait); > @@ -2382,6 +2480,8 @@ static int __bitmap_resize(struct bitmap *bitmap, > sector_t blocks, > int ret = 0; > long pages; > struct bitmap_page *new_bp; > + spinlock_t *new_bmclocks; > + int num_bmclocks, i; > > if (bitmap->storage.file && !init) { > pr_info("md: cannot resize file-based bitmap\n"); > @@ -2450,7 +2550,7 @@ static int __bitmap_resize(struct bitmap *bitmap, > sector_t blocks, > memcpy(page_address(store.sb_page), > page_address(bitmap->storage.sb_page), > sizeof(bitmap_super_t)); > - spin_lock_irq(&bitmap->counts.lock); > + write_lock_irq(&bitmap->counts.mlock); > md_bitmap_file_unmap(&bitmap->storage); > bitmap->storage = store; > > @@ -2466,11 +2566,23 @@ static int __bitmap_resize(struct bitmap > *bitmap, sector_t blocks, > blocks = min(old_counts.chunks << old_counts.chunkshift, > chunks << chunkshift); > > + /* initialize bmc locks */ > + num_bmclocks = DIV_ROUND_UP(chunks, BITMAP_COUNTER_LOCK_RATIO); > + num_bmclocks = min(num_bmclocks, BITMAP_COUNTER_LOCK_MAX); > + > + new_bmclocks = kvcalloc(num_bmclocks, sizeof(*new_bmclocks), > GFP_KERNEL); > + bitmap->counts.bmclocks = new_bmclocks; > + for (i = 0; i < num_bmclocks; ++i) { > + spinlock_t *bmclock = &(bitmap->counts.bmclocks)[i]; > + > + spin_lock_init(bmclock); > + } > + > /* For cluster raid, need to pre-allocate bitmap */ > if (mddev_is_clustered(bitmap->mddev)) { > unsigned long page; > for (page = 0; page < pages; page++) { > - ret = md_bitmap_checkpage(&bitmap->counts, page, 1, 1); > + ret = md_bitmap_checkpage(&bitmap->counts, page, 1, 1, NULL); > if (ret) { > unsigned long k; > > @@ -2500,11 +2612,12 @@ static int __bitmap_resize(struct bitmap > *bitmap, sector_t blocks, > bitmap_counter_t *bmc_old, *bmc_new; > int set; > > - bmc_old = md_bitmap_get_counter(&old_counts, block, > &old_blocks, 0); > + bmc_old = md_bitmap_get_counter(&old_counts, block, > &old_blocks, 0, NULL); > set = bmc_old && NEEDED(*bmc_old); > > if (set) { > - bmc_new = md_bitmap_get_counter(&bitmap->counts, block, > &new_blocks, 1); > + bmc_new = md_bitmap_get_counter(&bitmap->counts, block, > &new_blocks, > + 1, NULL); > if (bmc_new) { > if (*bmc_new == 0) { > /* need to set on-disk bits too. */ > @@ -2540,7 +2653,7 @@ static int __bitmap_resize(struct bitmap *bitmap, > sector_t blocks, > int i; > while (block < (chunks << chunkshift)) { > bitmap_counter_t *bmc; > - bmc = md_bitmap_get_counter(&bitmap->counts, block, > &new_blocks, 1); > + bmc = md_bitmap_get_counter(&bitmap->counts, block, > &new_blocks, 1, NULL); > if (bmc) { > /* new space. It needs to be resynced, so > * we set NEEDED_MASK. > @@ -2556,7 +2669,7 @@ static int __bitmap_resize(struct bitmap *bitmap, > sector_t blocks, > for (i = 0; i < bitmap->storage.file_pages; i++) > set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); > } > - spin_unlock_irq(&bitmap->counts.lock); > + write_unlock_irq(&bitmap->counts.mlock); > > if (!init) { > __bitmap_unplug(bitmap); > diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h > index 662e6fc141a77..93da6d836b579 100644 > --- a/drivers/md/md-bitmap.h > +++ b/drivers/md/md-bitmap.h > @@ -2,7 +2,9 @@ > /* > * bitmap.h: Copyright (C) Peter T. Breuer (ptb@ot.uc3m.es) 2003 > * > - * additions: Copyright (C) 2003-2004, Paul Clements, SteelEye > Technology, Inc. > + * additions: > + * Copyright (C) 2003-2004, Paul Clements, SteelEye Technology, > Inc. > + * Copyright (C) 2022-2023, Shushu Yi (firnyee@gmail.com) > */ > #ifndef BITMAP_H > #define BITMAP_H 1 > @@ -18,6 +20,11 @@ typedef __u16 bitmap_counter_t; > #define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2))) > #define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1) > > +/* how many counters share the same bmclock? */ > +#define BITMAP_COUNTER_LOCK_RATIO_SHIFT 0 > +#define BITMAP_COUNTER_LOCK_RATIO (1 << BITMAP_COUNTER_LOCK_RATIO_SHIFT) > +#define BITMAP_COUNTER_LOCK_MAX 65536 > + > /* use these for bitmap->flags and bitmap->sb->state bit-fields */ > enum bitmap_state { > BITMAP_STALE = 1, /* the bitmap file is out of date or had > -EIO */
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 29da10e6f703e..793959a533ae4 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -149,7 +149,18 @@ struct bitmap_page { struct bitmap { struct bitmap_counts { - spinlock_t lock; + /* + * Customize different types of lock structures to manage + * data and metadata. + * Split the counter table into multiple segments and assigns a + * dedicated lock to each segment. The counters in the counter + * table, which map to neighboring stripe blocks, are interleaved + * across different segments. + * CPU threads that target different segments can acquire the locks + * simultaneously, resulting in better thread-level parallelism. + */ + rwlock_t mlock; /* lock for metadata */ + spinlock_t *bmclocks; /* locks for bmc */ struct bitmap_page *bp; /* total number of pages in the bitmap */ unsigned long pages; @@ -246,10 +257,12 @@ static bool bitmap_enabled(struct mddev *mddev) * if we find our page, we increment the page's refcount so that it stays * allocated while we're using it */ -static int md_bitmap_checkpage(struct bitmap_counts *bitmap, - unsigned long page, int create, int no_hijack) -__releases(bitmap->lock) -__acquires(bitmap->lock) +static int md_bitmap_checkpage(struct bitmap_counts *bitmap, unsigned long page, + int create, int no_hijack, spinlock_t *bmclock) +__releases(bmclock) +__acquires(bmclock) +__releases(bitmap->mlock) +__acquires(bitmap->mlock) { unsigned char *mappage; @@ -264,8 +277,10 @@ __acquires(bitmap->lock) return -ENOENT; /* this page has not been allocated yet */ - - spin_unlock_irq(&bitmap->lock); + if (bmclock) + spin_unlock_irq(bmclock); /* lock for bmc */ + else + write_unlock_irq(&bitmap->mlock); /* lock for metadata */ /* It is possible that this is being called inside a * prepare_to_wait/finish_wait loop from raid5c:make_request(). * In general it is not permitted to sleep in that context as it @@ -280,7 +295,11 @@ __acquires(bitmap->lock) */ sched_annotate_sleep(); mappage = kzalloc(PAGE_SIZE, GFP_NOIO); - spin_lock_irq(&bitmap->lock); + + if (bmclock) + spin_lock_irq(bmclock); /* lock for bmc */ + else + write_lock_irq(&bitmap->mlock); /* lock for metadata */ if (mappage == NULL) { pr_debug("md/bitmap: map page allocation failed, hijacking\n"); @@ -1456,16 +1475,35 @@ static void bitmap_write_all(struct mddev *mddev) static void md_bitmap_count_page(struct bitmap_counts *bitmap, sector_t offset, int inc) { - sector_t chunk = offset >> bitmap->chunkshift; - unsigned long page = chunk >> PAGE_COUNTER_SHIFT; + /* + * The stripe heads are spread across different locations in the + * SSDs via a configurable hash function rather than mapping to a + * continuous SSD space. + * Sequential write requests are shuffled to different counter to + * reduce the counter preemption. + */ + sector_t blockno = offset >> (PAGE_SHIFT - SECTOR_SHIFT); + sector_t totblocks = bitmap->chunks << (bitmap->chunkshift - (PAGE_SHIFT - SECTOR_SHIFT)); + unsigned long bits = totblocks ? fls((totblocks - 1)) : 0; + unsigned long mask = ULONG_MAX << bits | ~(ULONG_MAX << + (bits - (bitmap->chunkshift + SECTOR_SHIFT - PAGE_SHIFT))); + unsigned long cntid = blockno & mask; + unsigned long page = cntid >> PAGE_COUNTER_SHIFT; + bitmap->bp[page].count += inc; md_bitmap_checkfree(bitmap, page); } static void md_bitmap_set_pending(struct bitmap_counts *bitmap, sector_t offset) { - sector_t chunk = offset >> bitmap->chunkshift; - unsigned long page = chunk >> PAGE_COUNTER_SHIFT; + sector_t blockno = offset >> (PAGE_SHIFT - SECTOR_SHIFT); + sector_t totblocks = bitmap->chunks << (bitmap->chunkshift - (PAGE_SHIFT - SECTOR_SHIFT)); + unsigned long bits = totblocks ? fls((totblocks - 1)) : 0; + unsigned long mask = ULONG_MAX << bits | ~(ULONG_MAX << + (bits - (bitmap->chunkshift + SECTOR_SHIFT - PAGE_SHIFT))); + unsigned long cntid = blockno & mask; + unsigned long page = cntid >> PAGE_COUNTER_SHIFT; + struct bitmap_page *bp = &bitmap->bp[page]; if (!bp->pending) @@ -1474,7 +1512,7 @@ static void md_bitmap_set_pending(struct bitmap_counts *bitmap, sector_t offset) static bitmap_counter_t *md_bitmap_get_counter(struct bitmap_counts *bitmap, sector_t offset, sector_t *blocks, - int create); + int create, spinlock_t *bmclock); static void mddev_set_timeout(struct mddev *mddev, unsigned long timeout, bool force) @@ -1557,7 +1595,7 @@ static void bitmap_daemon_work(struct mddev *mddev) * decrement and handle accordingly. */ counts = &bitmap->counts; - spin_lock_irq(&counts->lock); + write_lock_irq(&counts->mlock); nextpage = 0; for (j = 0; j < counts->chunks; j++) { bitmap_counter_t *bmc; @@ -1572,7 +1610,7 @@ static void bitmap_daemon_work(struct mddev *mddev) counts->bp[j >> PAGE_COUNTER_SHIFT].pending = 0; } - bmc = md_bitmap_get_counter(counts, block, &blocks, 0); + bmc = md_bitmap_get_counter(counts, block, &blocks, 0, NULL); if (!bmc) { j |= PAGE_COUNTER_MASK; continue; @@ -1588,7 +1626,7 @@ static void bitmap_daemon_work(struct mddev *mddev) bitmap->allclean = 0; } } - spin_unlock_irq(&counts->lock); + write_unlock_irq(&counts->mlock); md_bitmap_wait_writes(bitmap); /* Now start writeout on any page in NEEDWRITE that isn't DIRTY. @@ -1621,17 +1659,25 @@ static void bitmap_daemon_work(struct mddev *mddev) static bitmap_counter_t *md_bitmap_get_counter(struct bitmap_counts *bitmap, sector_t offset, sector_t *blocks, - int create) -__releases(bitmap->lock) -__acquires(bitmap->lock) + int create, spinlock_t *bmclock) +__releases(bmclock) +__acquires(bmclock) +__releases(bitmap->mlock) +__acquires(bitmap->mlock) { /* If 'create', we might release the lock and reclaim it. * The lock must have been taken with interrupts enabled. * If !create, we don't release the lock. */ - sector_t chunk = offset >> bitmap->chunkshift; - unsigned long page = chunk >> PAGE_COUNTER_SHIFT; - unsigned long pageoff = (chunk & PAGE_COUNTER_MASK) << COUNTER_BYTE_SHIFT; + sector_t blockno = offset >> (PAGE_SHIFT - SECTOR_SHIFT); + sector_t totblocks = bitmap->chunks << (bitmap->chunkshift - (PAGE_SHIFT - SECTOR_SHIFT)); + unsigned long bits = totblocks ? fls((totblocks - 1)) : 0; + unsigned long mask = ULONG_MAX << bits | ~(ULONG_MAX << + (bits - (bitmap->chunkshift + SECTOR_SHIFT - PAGE_SHIFT))); + unsigned long cntid = blockno & mask; + unsigned long page = cntid >> PAGE_COUNTER_SHIFT; + unsigned long pageoff = (cntid & PAGE_COUNTER_MASK) << COUNTER_BYTE_SHIFT; + sector_t csize = ((sector_t)1) << bitmap->chunkshift; int err; @@ -1644,7 +1690,7 @@ __acquires(bitmap->lock) *blocks = csize - (offset & (csize - 1)); return NULL; } - err = md_bitmap_checkpage(bitmap, page, create, 0); + err = md_bitmap_checkpage(bitmap, page, create, 0, bmclock); if (bitmap->bp[page].hijacked || bitmap->bp[page].map == NULL) @@ -1669,6 +1715,28 @@ __acquires(bitmap->lock) &(bitmap->bp[page].map[pageoff]); } +/* set-association */ +static spinlock_t *md_bitmap_get_bmclock(struct bitmap_counts *bitmap, sector_t offset); + +static spinlock_t *md_bitmap_get_bmclock(struct bitmap_counts *bitmap, sector_t offset) +{ + sector_t blockno = offset >> (PAGE_SHIFT - SECTOR_SHIFT); + sector_t totblocks = bitmap->chunks << (bitmap->chunkshift - (PAGE_SHIFT - SECTOR_SHIFT)); + unsigned long bitscnt = totblocks ? fls((totblocks - 1)) : 0; + unsigned long maskcnt = ULONG_MAX << bitscnt | ~(ULONG_MAX << (bitscnt - + (bitmap->chunkshift + SECTOR_SHIFT - PAGE_SHIFT))); + unsigned long cntid = blockno & maskcnt; + + unsigned long totcnts = bitmap->chunks; + unsigned long bitslock = totcnts ? fls((totcnts - 1)) : 0; + unsigned long masklock = ULONG_MAX << bitslock | ~(ULONG_MAX << + (bitslock - BITMAP_COUNTER_LOCK_RATIO_SHIFT)); + unsigned long lockid = cntid & masklock; + + spinlock_t *bmclock = &(bitmap->bmclocks[lockid]); + return bmclock; +} + static int bitmap_startwrite(struct mddev *mddev, sector_t offset, unsigned long sectors, bool behind) { @@ -1691,11 +1759,15 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset, while (sectors) { sector_t blocks; bitmap_counter_t *bmc; + spinlock_t *bmclock; - spin_lock_irq(&bitmap->counts.lock); - bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 1); + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); + read_lock(&bitmap->counts.mlock); + spin_lock_irq(bmclock); + bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 1, bmclock); if (!bmc) { - spin_unlock_irq(&bitmap->counts.lock); + spin_unlock_irq(bmclock); + read_unlock(&bitmap->counts.mlock); return 0; } @@ -1707,7 +1779,8 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset, */ prepare_to_wait(&bitmap->overflow_wait, &__wait, TASK_UNINTERRUPTIBLE); - spin_unlock_irq(&bitmap->counts.lock); + spin_unlock_irq(bmclock); + read_unlock(&bitmap->counts.mlock); schedule(); finish_wait(&bitmap->overflow_wait, &__wait); continue; @@ -1724,7 +1797,8 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset, (*bmc)++; - spin_unlock_irq(&bitmap->counts.lock); + spin_unlock_irq(bmclock); + read_unlock(&bitmap->counts.mlock); offset += blocks; if (sectors > blocks) @@ -1755,11 +1829,15 @@ static void bitmap_endwrite(struct mddev *mddev, sector_t offset, sector_t blocks; unsigned long flags; bitmap_counter_t *bmc; + spinlock_t *bmclock; - spin_lock_irqsave(&bitmap->counts.lock, flags); - bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 0); + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); + read_lock(&bitmap->counts.mlock); + spin_lock_irqsave(bmclock, flags); + bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 0, bmclock); if (!bmc) { - spin_unlock_irqrestore(&bitmap->counts.lock, flags); + spin_unlock_irqrestore(bmclock, flags); + read_unlock(&bitmap->counts.mlock); return; } @@ -1781,7 +1859,8 @@ static void bitmap_endwrite(struct mddev *mddev, sector_t offset, md_bitmap_set_pending(&bitmap->counts, offset); bitmap->allclean = 0; } - spin_unlock_irqrestore(&bitmap->counts.lock, flags); + spin_unlock_irqrestore(bmclock, flags); + read_unlock(&bitmap->counts.mlock); offset += blocks; if (sectors > blocks) sectors -= blocks; @@ -1794,16 +1873,19 @@ static bool __bitmap_start_sync(struct bitmap *bitmap, sector_t offset, sector_t *blocks, bool degraded) { bitmap_counter_t *bmc; + spinlock_t *bmclock; bool rv; if (bitmap == NULL) {/* FIXME or bitmap set as 'failed' */ *blocks = 1024; return true; /* always resync if no bitmap */ } - spin_lock_irq(&bitmap->counts.lock); + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); + read_lock(&bitmap->counts.mlock); + spin_lock_irq(bmclock); rv = false; - bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0); + bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0, bmclock); if (bmc) { /* locked */ if (RESYNC(*bmc)) { @@ -1816,7 +1898,8 @@ static bool __bitmap_start_sync(struct bitmap *bitmap, sector_t offset, } } } - spin_unlock_irq(&bitmap->counts.lock); + spin_unlock_irq(bmclock); + read_unlock(&bitmap->counts.mlock); return rv; } @@ -1850,13 +1933,16 @@ static void __bitmap_end_sync(struct bitmap *bitmap, sector_t offset, { bitmap_counter_t *bmc; unsigned long flags; + spinlock_t *bmclock; if (bitmap == NULL) { *blocks = 1024; return; } - spin_lock_irqsave(&bitmap->counts.lock, flags); - bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0); + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); + read_lock(&bitmap->counts.mlock); + spin_lock_irqsave(bmclock, flags); + bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0, bmclock); if (bmc == NULL) goto unlock; /* locked */ @@ -1873,7 +1959,8 @@ static void __bitmap_end_sync(struct bitmap *bitmap, sector_t offset, } } unlock: - spin_unlock_irqrestore(&bitmap->counts.lock, flags); + spin_unlock_irqrestore(bmclock, flags); + read_unlock(&bitmap->counts.mlock); } static void bitmap_end_sync(struct mddev *mddev, sector_t offset, @@ -1961,10 +2048,15 @@ static void md_bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, in sector_t secs; bitmap_counter_t *bmc; - spin_lock_irq(&bitmap->counts.lock); - bmc = md_bitmap_get_counter(&bitmap->counts, offset, &secs, 1); + spinlock_t *bmclock; + + bmclock = md_bitmap_get_bmclock(&bitmap->counts, offset); + read_lock(&bitmap->counts.mlock); + spin_lock_irq(bmclock); + bmc = md_bitmap_get_counter(&bitmap->counts, offset, &secs, 1, bmclock); if (!bmc) { - spin_unlock_irq(&bitmap->counts.lock); + spin_unlock_irq(bmclock); + read_unlock(&bitmap->counts.mlock); return; } if (!*bmc) { @@ -1975,7 +2067,8 @@ static void md_bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, in } if (needed) *bmc |= NEEDED_MASK; - spin_unlock_irq(&bitmap->counts.lock); + spin_unlock_irq(bmclock); + read_unlock(&bitmap->counts.mlock); } /* dirty the memory and file bits for bitmap chunks "s" to "e" */ @@ -2030,6 +2123,7 @@ static void md_bitmap_free(void *data) unsigned long k, pages; struct bitmap_page *bp; struct bitmap *bitmap = data; + spinlock_t *bmclocks; if (!bitmap) /* there was no bitmap */ return; @@ -2050,6 +2144,7 @@ static void md_bitmap_free(void *data) bp = bitmap->counts.bp; pages = bitmap->counts.pages; + bmclocks = bitmap->counts.bmclocks; /* free all allocated memory */ @@ -2058,6 +2153,7 @@ static void md_bitmap_free(void *data) if (bp[k].map && !bp[k].hijacked) kfree(bp[k].map); kfree(bp); + kfree(bmclocks); kfree(bitmap); } @@ -2123,7 +2219,9 @@ static struct bitmap *__bitmap_create(struct mddev *mddev, int slot) if (!bitmap) return ERR_PTR(-ENOMEM); - spin_lock_init(&bitmap->counts.lock); + /* initialize metadata lock */ + rwlock_init(&bitmap->counts.mlock); + atomic_set(&bitmap->pending_writes, 0); init_waitqueue_head(&bitmap->write_wait); init_waitqueue_head(&bitmap->overflow_wait); @@ -2382,6 +2480,8 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, int ret = 0; long pages; struct bitmap_page *new_bp; + spinlock_t *new_bmclocks; + int num_bmclocks, i; if (bitmap->storage.file && !init) { pr_info("md: cannot resize file-based bitmap\n"); @@ -2450,7 +2550,7 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, memcpy(page_address(store.sb_page), page_address(bitmap->storage.sb_page), sizeof(bitmap_super_t)); - spin_lock_irq(&bitmap->counts.lock); + write_lock_irq(&bitmap->counts.mlock); md_bitmap_file_unmap(&bitmap->storage); bitmap->storage = store; @@ -2466,11 +2566,23 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, blocks = min(old_counts.chunks << old_counts.chunkshift, chunks << chunkshift); + /* initialize bmc locks */ + num_bmclocks = DIV_ROUND_UP(chunks, BITMAP_COUNTER_LOCK_RATIO); + num_bmclocks = min(num_bmclocks, BITMAP_COUNTER_LOCK_MAX); + + new_bmclocks = kvcalloc(num_bmclocks, sizeof(*new_bmclocks), GFP_KERNEL); + bitmap->counts.bmclocks = new_bmclocks; + for (i = 0; i < num_bmclocks; ++i) { + spinlock_t *bmclock = &(bitmap->counts.bmclocks)[i]; + + spin_lock_init(bmclock); + } + /* For cluster raid, need to pre-allocate bitmap */ if (mddev_is_clustered(bitmap->mddev)) { unsigned long page; for (page = 0; page < pages; page++) { - ret = md_bitmap_checkpage(&bitmap->counts, page, 1, 1); + ret = md_bitmap_checkpage(&bitmap->counts, page, 1, 1, NULL); if (ret) { unsigned long k; @@ -2500,11 +2612,12 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, bitmap_counter_t *bmc_old, *bmc_new; int set; - bmc_old = md_bitmap_get_counter(&old_counts, block, &old_blocks, 0); + bmc_old = md_bitmap_get_counter(&old_counts, block, &old_blocks, 0, NULL); set = bmc_old && NEEDED(*bmc_old); if (set) { - bmc_new = md_bitmap_get_counter(&bitmap->counts, block, &new_blocks, 1); + bmc_new = md_bitmap_get_counter(&bitmap->counts, block, &new_blocks, + 1, NULL); if (bmc_new) { if (*bmc_new == 0) { /* need to set on-disk bits too. */ @@ -2540,7 +2653,7 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, int i; while (block < (chunks << chunkshift)) { bitmap_counter_t *bmc; - bmc = md_bitmap_get_counter(&bitmap->counts, block, &new_blocks, 1); + bmc = md_bitmap_get_counter(&bitmap->counts, block, &new_blocks, 1, NULL); if (bmc) { /* new space. It needs to be resynced, so * we set NEEDED_MASK. @@ -2556,7 +2669,7 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, for (i = 0; i < bitmap->storage.file_pages; i++) set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); } - spin_unlock_irq(&bitmap->counts.lock); + write_unlock_irq(&bitmap->counts.mlock); if (!init) { __bitmap_unplug(bitmap); diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h index 662e6fc141a77..93da6d836b579 100644 --- a/drivers/md/md-bitmap.h +++ b/drivers/md/md-bitmap.h @@ -2,7 +2,9 @@ /* * bitmap.h: Copyright (C) Peter T. Breuer (ptb@ot.uc3m.es) 2003 * - * additions: Copyright (C) 2003-2004, Paul Clements, SteelEye Technology, Inc. + * additions: + * Copyright (C) 2003-2004, Paul Clements, SteelEye Technology, Inc. + * Copyright (C) 2022-2023, Shushu Yi (firnyee@gmail.com) */ #ifndef BITMAP_H