[PATCHv4,01/17] zram: switch to non-atomic entry locking

Message ID	20250131090658.3386285-2-senozhatsky@chromium.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Sergey Senozhatsky <senozhatsky@chromium.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Sergey Senozhatsky <senozhatsky@chromium.org> Subject: [PATCHv4 01/17] zram: switch to non-atomic entry locking Date: Fri, 31 Jan 2025 18:06:00 +0900 Message-ID: <20250131090658.3386285-2-senozhatsky@chromium.org> In-Reply-To: <20250131090658.3386285-1-senozhatsky@chromium.org> References: <20250131090658.3386285-1-senozhatsky@chromium.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	zsmalloc/zram: there be preemption \| expand [PATCHv4,00/17] zsmalloc/zram: there be preemption [PATCHv4,01/17] zram: switch to non-atomic entry locking [PATCHv4,02/17] zram: do not use per-CPU compression streams [PATCHv4,03/17] zram: remove crypto include [PATCHv4,04/17] zram: remove max_comp_streams device attr [PATCHv4,05/17] zram: remove two-staged handle allocation [PATCHv4,06/17] zram: permit reclaim in zstd custom allocator [PATCHv4,07/17] zram: permit reclaim in recompression handle allocation [PATCHv4,08/17] zram: remove writestall zram_stats member [PATCHv4,09/17] zram: limit max recompress prio to num_active_comps [PATCHv4,10/17] zram: filter out recomp targets based on priority [PATCHv4,11/17] zram: unlock slot during recompression [PATCHv4,12/17] zsmalloc: factor out pool locking helpers [PATCHv4,13/17] zsmalloc: factor out size-class locking helpers [PATCHv4,14/17] zsmalloc: make zspage lock preemptible [PATCHv4,15/17] zsmalloc: introduce new object mapping API [PATCHv4,16/17] zram: switch to new zsmalloc object mapping API [PATCHv4,17/17] zram: add might_sleep to zcomp API

Sergey Senozhatsky Jan. 31, 2025, 9:06 a.m. UTC

Concurrent modifications of meta table entries is now handled
by per-entry spin-lock.  This has a number of shortcomings.

First, this imposes atomic requirements on compression backends.
zram can call both zcomp_compress() and zcomp_decompress() under
entry spin-lock, which implies that we can use only compression
algorithms that don't schedule/sleep/wait during compression and
decompression.  This, for instance, makes it impossible to use
some of the ASYNC compression algorithms (H/W compression, etc.)
implementations.

Second, this can potentially trigger watchdogs.  For example,
entry re-compression with secondary algorithms is performed
under entry spin-lock.  Given that we chain secondary
compression algorithms and that some of them can be configured
for best compression ratio (and worst compression speed) zram
can stay under spin-lock for quite some time.

Do not use per-entry spin-locks and instead convert it to an
atomic_t variable which open codes reader-writer type of lock.
This permits preemption from slot_lock section, also reduces
the sizeof() zram entry when lockdep is enabled.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 126 ++++++++++++++++++++--------------
 drivers/block/zram/zram_drv.h |   6 +-
 2 files changed, 79 insertions(+), 53 deletions(-)

Hillf Danton Jan. 31, 2025, 11:41 a.m. UTC | #1

On Fri, 31 Jan 2025 18:06:00 +0900 Sergey Senozhatsky
> Concurrent modifications of meta table entries is now handled
> by per-entry spin-lock.  This has a number of shortcomings.
> 
> First, this imposes atomic requirements on compression backends.
> zram can call both zcomp_compress() and zcomp_decompress() under
> entry spin-lock, which implies that we can use only compression
> algorithms that don't schedule/sleep/wait during compression and
> decompression.  This, for instance, makes it impossible to use
> some of the ASYNC compression algorithms (H/W compression, etc.)
> implementations.
> 
> Second, this can potentially trigger watchdogs.  For example,
> entry re-compression with secondary algorithms is performed
> under entry spin-lock.  Given that we chain secondary
> compression algorithms and that some of them can be configured
> for best compression ratio (and worst compression speed) zram
> can stay under spin-lock for quite some time.
> 
> Do not use per-entry spin-locks and instead convert it to an
> atomic_t variable which open codes reader-writer type of lock.
> This permits preemption from slot_lock section, also reduces
> the sizeof() zram entry when lockdep is enabled.
> 
Nope, the price of cut in size will be paid by extra hours in debugging,
given nothing is free.

> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  drivers/block/zram/zram_drv.c | 126 ++++++++++++++++++++--------------
>  drivers/block/zram/zram_drv.h |   6 +-
>  2 files changed, 79 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 9f5020b077c5..1c2df2341704 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -58,19 +58,50 @@ static void zram_free_page(struct zram *zram, size_t index);
>  static int zram_read_from_zspool(struct zram *zram, struct page *page,
>  				 u32 index);
>  
> -static int zram_slot_trylock(struct zram *zram, u32 index)
> +static bool zram_slot_try_write_lock(struct zram *zram, u32 index)
>  {
> -	return spin_trylock(&zram->table[index].lock);
> +	atomic_t *lock = &zram->table[index].lock;
> +	int old = ZRAM_ENTRY_UNLOCKED;
> +
> +	return atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED);
> +}
> +
> +static void zram_slot_write_lock(struct zram *zram, u32 index)
> +{
> +	atomic_t *lock = &zram->table[index].lock;
> +	int old = atomic_read(lock);
> +
> +	do {
> +		if (old != ZRAM_ENTRY_UNLOCKED) {
> +			cond_resched();
> +			old = atomic_read(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> +}
> +
> +static void zram_slot_write_unlock(struct zram *zram, u32 index)
> +{
> +	atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
>  }
>  
> -static void zram_slot_lock(struct zram *zram, u32 index)
> +static void zram_slot_read_lock(struct zram *zram, u32 index)
>  {
> -	spin_lock(&zram->table[index].lock);
> +	atomic_t *lock = &zram->table[index].lock;
> +	int old = atomic_read(lock);
> +
> +	do {
> +		if (old == ZRAM_ENTRY_WRLOCKED) {
> +			cond_resched();
> +			old = atomic_read(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
>  }
>  
> -static void zram_slot_unlock(struct zram *zram, u32 index)
> +static void zram_slot_read_unlock(struct zram *zram, u32 index)
>  {
> -	spin_unlock(&zram->table[index].lock);
> +	atomic_dec(&zram->table[index].lock);
>  }
>  
Given no boundaries of locking section marked in addition to lockdep, 
this is another usual case of inventing lock in 2025.

What sense could be made by exercising molar tooth extraction in the
kitchen because of pain afetr downing a pint of vodka instead of 
directly driving to see your dentist?

Andrew Morton Jan. 31, 2025, 10:55 p.m. UTC | #2

On Fri, 31 Jan 2025 18:06:00 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> +static void zram_slot_write_lock(struct zram *zram, u32 index)
> +{
> +	atomic_t *lock = &zram->table[index].lock;
> +	int old = atomic_read(lock);
> +
> +	do {
> +		if (old != ZRAM_ENTRY_UNLOCKED) {
> +			cond_resched();
> +			old = atomic_read(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> +}

I expect that if the calling userspace process has realtime policy (eg
SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
and this becomes a busy loop.  And if the machine is single-CPU, the
loop is infinite.

I do agree that for inventing new locking schemes, the bar is set
really high.

Sergey Senozhatsky Feb. 3, 2025, 3:21 a.m. UTC | #3

On (25/01/31 19:41), Hillf Danton wrote:
> On Fri, 31 Jan 2025 18:06:00 +0900 Sergey Senozhatsky
> > Concurrent modifications of meta table entries is now handled
> > by per-entry spin-lock.  This has a number of shortcomings.
> > 
> > First, this imposes atomic requirements on compression backends.
> > zram can call both zcomp_compress() and zcomp_decompress() under
> > entry spin-lock, which implies that we can use only compression
> > algorithms that don't schedule/sleep/wait during compression and
> > decompression.  This, for instance, makes it impossible to use
> > some of the ASYNC compression algorithms (H/W compression, etc.)
> > implementations.
> > 
> > Second, this can potentially trigger watchdogs.  For example,
> > entry re-compression with secondary algorithms is performed
> > under entry spin-lock.  Given that we chain secondary
> > compression algorithms and that some of them can be configured
> > for best compression ratio (and worst compression speed) zram
> > can stay under spin-lock for quite some time.
> > 
> > Do not use per-entry spin-locks and instead convert it to an
> > atomic_t variable which open codes reader-writer type of lock.
> > This permits preemption from slot_lock section, also reduces
> > the sizeof() zram entry when lockdep is enabled.
> > 
> Nope, the price of cut in size will be paid by extra hours in debugging,
> given nothing is free.

This has been a bit-spin-lock basically forever, until late last
year when it was switched to a spinlock, for reasons unrelated
to debugging (as far as I understand it).  See 9518e5bfaae19 (zram:
Replace bit spinlocks with a spinlock_t).

> > -static void zram_slot_unlock(struct zram *zram, u32 index)
> > +static void zram_slot_read_unlock(struct zram *zram, u32 index)
> >  {
> > -	spin_unlock(&zram->table[index].lock);
> > +	atomic_dec(&zram->table[index].lock);
> >  }
> >  
> Given no boundaries of locking section marked in addition to lockdep, 
> this is another usual case of inventing lock in 2025.

So zram entry has been memory-saving driver, pretty much always, and not
debug-ability driver, I'm afraid.

Would lockdep additional keep the dentist away? (so to speak)

---
 drivers/block/zram/zram_drv.c         |  38 ++++++++++++++++++++++++--
 drivers/block/zram/zram_drv.h         |   3 ++
 scripts/selinux/genheaders/genheaders | Bin 90112 -> 0 bytes
 3 files changed, 39 insertions(+), 2 deletions(-)
 delete mode 100755 scripts/selinux/genheaders/genheaders

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f85502ae7dce..165b50927d13 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -59,12 +59,30 @@ static void zram_free_page(struct zram *zram, size_t index);
 static int zram_read_from_zspool(struct zram *zram, struct page *page,
 				 u32 index);

+static void zram_slot_lock_init(struct zram *zram, u32 index)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key key;
+
+	lockdep_init_map(&zram->table[index].lockdep_map, "zram-entry", &key,
+			 0);
+#endif
+
+	atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
+}
+
 static bool zram_slot_try_write_lock(struct zram *zram, u32 index)
 {
 	atomic_t *lock = &zram->table[index].lock;
 	int old = ZRAM_ENTRY_UNLOCKED;

-	return atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED);
+	if (atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED)) {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+		rwsem_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
+#endif
+		return true;
+	}
+	return false;
 }

 static void zram_slot_write_lock(struct zram *zram, u32 index)
@@ -79,11 +97,19 @@ static void zram_slot_write_lock(struct zram *zram, u32 index)
 			continue;
 		}
 	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
+#endif
 }

 static void zram_slot_write_unlock(struct zram *zram, u32 index)
 {
 	atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zram->table[index].lockdep_map, _RET_IP_);
+#endif
 }

 static void zram_slot_read_lock(struct zram *zram, u32 index)
@@ -98,11 +124,19 @@ static void zram_slot_read_lock(struct zram *zram, u32 index)
 			continue;
 		}
 	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_acquire_read(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
+#endif
 }

 static void zram_slot_read_unlock(struct zram *zram, u32 index)
 {
 	atomic_dec(&zram->table[index].lock);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zram->table[index].lockdep_map, _RET_IP_);
+#endif
 }

 static inline bool init_done(struct zram *zram)
@@ -1482,7 +1516,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		huge_class_size = zs_huge_class_size(zram->mem_pool);

 	for (index = 0; index < num_pages; index++)
-		atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
+		zram_slot_lock_init(zram, index);

 	return true;
 }
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 219d405fc26e..86d1e412f9c7 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -64,6 +64,9 @@ struct zram_table_entry {
 	unsigned long handle;
 	unsigned int flags;
 	atomic_t lock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map lockdep_map;
+#endif
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	ktime_t ac_time;
 #endif
diff --git a/scripts/selinux/genheaders/genheaders b/scripts/selinux/genheaders/genheaders
deleted file mode 100755
index 3fc32a664a7930b12a38d02449aec78d49690dfe..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 90112
zcmeI54VYWidFQXqM}!HsAV3BA;1X~VGz`MnRD8IRv5n=#7zLYDVcSUZjK)$tBMC`k
z%cLnHP^yXRL`~8@lP2BHlcn8hyWI-frUOqCluw3sHxX$*)FdTJmIS#ZsnShSku1!<
z=ic`(>;7Xu*(8vziNu57dw%zvbI*P6@7#MeGt#|t>y8^&u2^B=b&~ZfmMPUg;gX21
zh{TW9iCAIl3@c(?V7<aRRo0&@{}1Y+zQ*=ScLC9-{3MB{UBE0HBfiYV79zH8qG@-$
zSK|S94Wi|D%ck8aX0d7hkyq3CcMw<Tzz~PqIooS#eTua;E=L@0XL8ee!d=htFZa28
zNh8{sbeVRYo7_GAp{R-IXhc5E7|s7-%_ql*tTV5O^RH!byNb5sXls2$Cl|tYeXTt4
zlWZ@h?c$HS9dlf-+e5_mUMKBLUjAQSdf2U7bbFkCltY?-L`|!8#Z-3B)$6aAnz(!_
zo13XzUI|})`PJ*kO8K#M&JfJFLh`?HYTM3Rt(8@)X_%%_=FkeKjQ&<?S*kcMQ`d~q
ztT*k=%e2+$;>5G0Y}Pybz2YuevQ`@Q68ZnJ^e3mU`L{!u9%h~AW!jm#{CcH;WAUBG
z;qN*Q|DteP{^!@70*=-HHOJxi9*6(Ca9jT8*Es@?)&Et;;g=kT|ITrE<T(6PNvJLB
z`1L9Q#~RQ6<M6%0%`xCtQoxzk`BpV18<H31r<F*Imk$=wiE@6XP)HXOQ~7aWg<>{W
zo=l8q_F0p~blOT+vSq7OE>6r8tjT;Koil|g35mqyP~AjAv>s2C%Bf;GF`deqY7^;V
z(QKuXN=#;Rsj2K8Y13F}zGzL3PvuK#YqBs?9=9eB6tm^D)RlF)yctQVoXzJf)2Epk
zS6kb5Y`<Z1V%^wPW9yGDA6;BGw&5tY65Ds*l#s2b_hw7wbaD4ho2T-*^zPK2DYK2e
z)A^hZG~re~W<@h&PLea<NitJTa{gN(bJATpWx&kZldbzCuQ)H1&&!@>X77Jj_Se7r
z`!dZ=wjR>*%=u>hN!G`;{E}?;4707rq-<U|f8aP39&vg52{(d=ly`}rrycc(2U>3X
zK5<*?`@ZJw|Mbtf<pA-^w7yOJ3gtoK-!KzeULoSwYW*<ri1G;Wy<c+sStPzq>o<vG
zf0u~gto2*OZ&ThTo>kr<UQym9ey{Q#@pmfk6Mw&Q>+A0R;rRZgav$-Bl>3Q)M0tSt
zCzad8A6FhE{x8Z~#Gh8)Ccfg+?zlU|ga7FAF7Z>fevkNS<$dDkDz~2W=I0g4eZ+rJ
zxu5v9&$#Ua#4p$SHu0;K2Z_I0d5HLHl!uAmqC7(UHsw*`DdjQZ8Rc=}Ips;>`;}*i
z|EBT+@wX|j5dS^pRpRecUL*b|%Im}%%IAoGNclYRN0l!Se@uCU_!pHg5`RK@llZrl
zFA-l>-XVS(d?tRD@*eRE!Mk2>|3N+M6TeF9TSvV0uUGCP9#!rqzDs$4_#Wjp@!OSK
z|Ki>5Ta^2V*OdE-zg>BN_<NPx#Q$7*koW`2L&VS2;}RzRpw^ENZz_)xe?)nV_@|V|
ziGNvnl6X&fhWN9}3&eeT+$+TW%B#fBQ(hx}k@7n6G39f_!^-E0Z&toQe5djT@woCu
z;(L@giSJXsM7*rLMf@)1ZQ^fJ-XZ=T<z3=`q`XJ`kCpd{->=+y%Ip8b%KgM2Q63<E
z@Uw1voA{@-evtSVl!u7pd>$tLRjnT({w?KE;(g^Y;wyEWapI>dPZB>H`}GZPesF&n
zB!0fu4-wz2^CC=q&*$BKBh2-FBT5|ki4nK4U&PPWJS2%DpBduFXMuQ7`>zncLV1<=
zRmy9`Hz=<YzgGDi@lDF-iEmfFK>T&e8^qtBe3AI1@+R?p%9n`4hZgaY)^8KPQ+bDY
zO?j92obn#=`;_;I|Bdp{H@$v#l!uAK&j@k&86^%sW6aghIC1!yBo03_#NlUwIQ*;-
zho4pA@Uuo7e%6V@&pG1obDlW-Tp$iV8^qz~B60ZHBo059h{MkoaroIL4nI4@;b)gP
z{Ol2jpMB!+)B2Xzx2F2$BmSszKk?5h4-mgk&yzOs$F+Wh_!G*b#Gh0i`&aKcpHdzt
zj(L$Jj(L$Gj(JgFuJfWo9P^?|9P^?^9P^@19P?t1IOfGXam<Sa;+Pi=;+PkU#4#_L
z#4#_Hh+|%~h+|%~iDO=Lh+|%KiDO>$h+|&#iDO<^J+BX#7e3;c7k=WH7Xjj!7fIsZ
zQQtDe;ah<?e5(+LZ&l{%Ta7q;s}qNBbHw4>JaPE8Kpeg`h{Lx<;_$6W9KJ0Phi@(7
z@U2Z8zIBMhw=Qw`)*}wz`o!Uz^=+>|@XbdYzWIs6w*YbYW)p{RLE`YOK>Qi?twJ2W
zRf)s58gckmXa1PG|8vCQ+dOgjwm=-dHHgEvMdI+SNgTc{5r=Or;_$6a9KLmk!?!MR
z_|_v1-}=PioAqyAf8d*sIDGRHhi?Jm@XaO;--5*9TZlM(3loQLRpO`W&&g}V;ai<J
ze48T<-{zS==H_dGIDBgmhi{9-;aih9d|M(8-&(}sTbnq1>kx-;UE=VqM;yNOiNiPR
ze|!CbZ$9Gi%}*S@1&G5pn>c(65{GXg;_xj@9KJ<}!?!4L_*N(GSKsD{!?$_j@NI!O
zd}}aQ-xi6(w<dA;wnQAhwTQ#FHgWjYAr9ZV#Nk_yIDG39hi}$*y#By9A948RCl22N
z#NnGw9KHpK!?zG|_!cG(-y+1}Ta-9_ixG!!apLf8o_I-rf44v!zBP!$w?*Rct;t+{
zTOtnMTEyX7n>c*y5QlGF;_$6U9KQ95!#C@@UVpMWAAQ72%KgO8R~{gKp>mseP<fE}
z70N@zuT~x={%YkB;!)*M;x{XA5PyU6MdI+GNgO^b5r+>g=ITS6IDF_3hYwxi@S#T>
zKJ<yh2kU#@;|m{r#NmUVID7~YhYvP!_z)xxA40_8Lzp;xh!BSlQR46+MjSrGiNl8^
z@rlp5`N|N_DQ^;gv+^b4@TWx_{<Mk1pAK{Nr%N3E^oYZsK5_VCnLikk7k)1fe|*H@
zkDoaF2@r=rHu3D|-0cO4!=Dgw_!B06q4pCY4u7J=;ZKY>{D~8XKS|>7Cqo?m6o|v0
z7V&#^{B7d!q(dB@bcw^09&`1iPaK|DefK!vxWf}4ad_e<4o?Ea;fYNgo&<@*lMr!u
z5+)8$BE;cIlsG(z5r-#n;_xI%9G+x|!;=DWcv2w_PpZV>NryQ8E~rZ!+wBp@c6(2I
zk1zfnsZab}>Y??0Z~gaz|G>-NuiQr*-uj8d+W>KRYcp4GgT&!&h&a3r6Nk4E;_x;~
z9Nxx=!`nD<c$*{+Z!^T<ZGkwvtq_N|RpRirMjYPOiNo7D;_zplIQ&^44sZSc$LquW
z>O+7ye6WebhahqI5Mr)Ago(q42yyrjB@Q2A#Nk7nIDAMFhYuOz@S#8)K2(UqhbnRS
zP$Lc>>crv09C7$CPaHlh5Qh&9;_zXSIDD|5@%r$v`Vb@zA40_8Lzp;xh%i?lqQv1t
zj5vIV6Ne8;;_x9u96l6?!-ooS_)sMdA8N$mL!CH$m?I7!=840H1>*3bK^#6T5{C~>
z;_zXKID7~_>-FL9)rT-~_z)ouAELzJLyWol5GM{FlEmRdhB$mE5Qh&H;_#tL96r>D
z!-qO?_%KHtKFkw`4-3TMLxVVcSR@V~n#AG55^?y@A`Tzg#Nk7S_&=);k!7zZPb!ZR
zhbJ-O@FY$go+O#8CmG`Kq(B^=REWcqDsgyHBMwjM#No*tad<LM9G)x?hbIl<@MMuV
zJZTb#CriZPNsBl<X%mMh9pdn$OB|lW<dX*Gh4&**s}FJFd-OOai9e(DGsFkSm-yiL
z5{Fk+;_#|Q9Dj#aCw|K3UBBmuuTh@V&kkaLH^Wcj=WG1}@e9CLdh>RP@(OX}tx6nu
zs}V=u>co+^IpWCMJaOc0fjIKkAdb8(5=Y*e#9yWRTRq8}hxN*9#F2+OapYl+IPx%0
z9C=tEjyyDoBM*zjk%uO6<Y9?8^3WoV^HTj}Zyv&$hdJWN!#r{1VSzaE&>)UHED}c^
zn#7TZCF00Ki#YPoCXPJJpW@x$>vew@h~qlaAdYb^62~~3%yphF5l4Pn#F3viaU8D>
zaa>2b#Bm+z5yy3;PaM|~%X|QD=)8*Sh>tj~BYxtzjs%F~I${&YbtFg}*O3r$Tt~vh
zaUF>e$8{u1d~h8h{yOy`PW;!DH;8AHFA|48P2%upi8%aeF;{=u#NkheIQ;1nhd({y
z@TX54{#Y;a`T&1?#Nm&hIQ$6^hd(xP_!A@!e?r9JPnbCTi4ccBQR46?MjZaciNl{H
zarl!V4u6`&?^J)5h{K;2aro0F4u3k#)t@eL_|qc}fBMAXkLC0F0DpYM;g6p<{0R_;
zKQ?jr6C@6QLd4-um^l225Qjfe;_xR%9R9?K!=EH^_>&<He_F)ftv<Ag!-o!W_|PQ|
zA9~Eyhdy!mV4dbYzVN|E96tDo!-oKI_+S%<4?*JaAw(QLgo(q42yyrjB@Q2A#Nk7n
zIDAMFhYuOz@S#8)K2(UqhbnQLPin+*KIsr&P=C6_;ZKh^{OJ>iKh`R*54aEX5r;p1
z;_xRx9RAqE;ZKk_{0R|<KVjnVCqf+lM2W+n7;*R$Ck}s-#NkhdIQ%IPhd&kK@TW=~
z{?v%WpC0jt)rUTD_+XvxJ-+b4M;t!*nd|u^KpZ~U#Nk7bID7~ZhYw-m@F7ARK17Mb
zhZu4A5GM{FlEmRdhB$mE5Qh&H;_#tL96r>D!-qO?_%KHtK3FgI9^cQX4?g1X!A~4M
z1c<{2o4NWBBn}@!#Nk7jIDCi@hYwNW@F7MVKE#Q`ha_?MkRc8q3dG?<g*beu5{C~p
z;_#tP96rnuhY$0_;ll!P_|PDZ>&qf>Twj{RkElOO#J{V&Mf~Kp`*SV-8D0<1RvsV@
z4{hS`Fi0F8hM22|VdC&GLL44OiNnJfad;Rf4iA&W;bDe2JS-50hZW-Ruu2>r)`-Ky
zI&pY7M;so`6NiTj#NlCsI6Pb=4iD|sULVHPhahqI5F!pA!o=Z2gt__<B@Q2A#Nk7n
zIDAMFhYuOz@S#8)K2(UqhbnRSP$Lc>>crv09C7$CPaHlh5Qh&9;_zXSIDBXlhYw4{
z;X`PR*N1KDLzp;xh!BSlQR46+#$0`f6Ne8;;_x9u96l6?!-ooS_)sMdA8N$mL!CH$
zm?I7!=840H1>*3bK^#6T5{C~>;_zXKIDBXkhYxMy@FC*&`jAu~qQv1tj5vIV6Ne8;
z=ITR+ID9A&hYuCv@S#c^KGcZAhdOciFh?9d%oB$X3&i0=gE)LxBn}^%#Nopdarn?8
z4j<aY;X{WweCQH~53w`7K9tpmIC1ooB#!N7h*!0r0`Wu2lV^Fi`)=hK;_$FQ93EDP
z!^0|b^{_@99@dG&!#U#caGp3kTp$h)8^qz^B5`=wBn}Ukh{MAcad_A!4i7uT;bE6J
zJnRvNhkfGk!Fq}J_`)9_ad=oD{($;WAr2p^#Nk7YIDDuxS0Cnx!-sj|@L_>Cd}t7d
z4~xX%Lz6gsSRxJ|TEyW)n>c*v5Qh(4;_#tI96t1k!w2hZ@9~8XKH~7fPaHl3h{J~}
z@kiB%8gckgCk`Lxh{K0@=IX-&arn?64j&eY!-pnu_^?DAKD3C#hc<Ee&>;>Vy2Rl_
zk2rkj6NeAhIo{(7AAH2&gP%Bj2oQ%4HgWh6Bn}_y#J{dS%n^qV^Tgr90&)1zV6Hwa
z5{C~>;_zXKIDBXkhYxMy@S#H-K6Hu0haPeG&?gQbtaH7`7e4ri!v{Zc_z)ltA8g|A
zAxIoPgowk3Fmd=WPyGAp!vb;m&>#*U7Ky`$CUf;+i8y>{5r+?L;_#tE96of3!-pPm
z_|PW~AFN;S9$)z2BMu+@#Nk7LIDD{)!-pVo_z)rvAHu}pLxebdh!Ten4dVWPa(_Oy
zNE|*iiNl8_;_#uxTzzO0hYua%@S#f_KJ<vghdy!mU<JI#7e4ri!v{Zc_z)ltA8g|A
zAxIoPgowk3Fmd=0Ar2p+#Nk7XIDCi`hYwBS7pf0S#PN51E#g6~-zJXycZgrE^}EEw
z%3J4okJlHz==#|vzD4VIh{Mk=aroIIj_-Hs6CZp&{e16u;HQr`elOV}z8(8Z9P_J3
z9OLg3$M~(6ddGw9`iNtG`H5rv0pgfnHgU|aAaNYu5OK_}FmcSU2yx7>C~?fM7;((6
zIC0FcByr5I3~|h_0&&c*3USP@DsjxO8gb08I&pk%ZjLxUpW7pTtNPO?o>XqV-0Q=e
zl>3Ndp88+mt)JKW0pf2~ZWFI54-$W;@(}U&C=V0APkDs+1InYsKd3xL{Nu{w#2-~|
zU*O%}FDVZae@c0X_z%EsZ$GDh!96d9iDO<zh+|$yiDO>JnCo>jP8{<(NgVS!LmczE
zKpgYBLLBqDN*wdLMjZ3HP8{=kjyUG^JaNqH1>%_34dR&Bi^MUno5V4%mxyCtw}@k2
zw~1q3cZi>>K143`dUBESDDlgb$B198JWl*N<w@eNML)0fZg)5OA&&2hP7=rWMQ4cP
z`=SfP@qN)1;`qMkD)BQv?ar4P@dy9W<#potzUVpP_`c|Q;`qMk1>*R==mv3oU-Tkz
zd|z~vIKD4>i8#J5x<&lF&**&%aeQBNhxnM*?-CCy?-9rMMfZvC)cV#%ULWGheZ==D
z_Y=qWMF)tNwZ2XKF6BYu_`c{6@%L!`FmZfebc8s*FFH#6e(fhl9N!n6B%aWDks*%j
zU4b~RcNOBe-c^a?dRHTk>s_5Vu6J|9alM--j_ch5aa`{T7kiKUKHc96aa^aW#BrUf
z5yy3^&RnlkbHs6-nkSCy)B<r_ry9g@omwQ0>r|6Cu2W0Iah+-r$91Yr9M`E1aa^am
z#BrVK5yy3^PaM}NYpwTq;X36bj_Z`4IIdFx;<!%P#Bu)_B#!&f5OMsTt4jQUdQ~G1
zuj<6%)f{nnHP2kVS|ARu8pPq%B5`=tBo42Zh{LNEad_1x4zD`I;Z>J7yy_8$SAF8}
z%KAmGC-BNg9A5c}!>a&scx4lZS3%<NDnuM!)rsG&p3D)4C-cPN$pUeB(qOKhEE0z&
zP2%umi8wrI5r-#j;_#$H9G-NE!;>Cyc+w{hPpnJ4#~q&dh{F><ad;9S4o__2@FYka
zo`i_QlQ403GEe*+>cawY_|PB@9~OzjhbD9NVTm|=Xc31GZQ}5uLmWPIiNl8;arn?D
z4j-(b_xQpGA948LCk`J1#NmTY96khz!-o)Y_z)%zA0ouzLzFmtXb}Gc^<j}Xd}tDf
z4@<=1LyNik&?XKaI>g~ampFXr5r+?b;_$(`)O&p4gO50T@Dqm*0pjq%CJrBh#Nk7T
zID7~bhYu0r@F7YZKE#N_hd6Qg&?LT~J}eQ34=v*Gp-mh<beO9TUE=VeM;t!%iNgo$
zGVk$)4?g1X!A~4M1c<{2n>c(35{C~V;_x9%96m&d!-ptw_z)uwAL7K}Ly|ar$PkAQ
zE#mj64{hS`p+g)#bcw@<9&`1fPaHm2zvMl>@WDqMKKO~lhX8T-U=xQALE`WsL>xYZ
ziNl8oarh7=4j*E~;X|A_d`J?94;kX{p+FoyREWce4)G7D4_)H$p+_7(^ohd<>s8+4
zJNSJHarodT4j%%<;e$;aJ_L!whY)f25GD>EBE;cClsJ5d5r+?P;_x9!96n@-!-oQK
z_)sAZAF9ORLyb6m=n?;j`p_p1AFRv0#}_{Mh{Fdzb3H!<h{Fe)ID7~ahYum*@F7eb
zK17JahbVFQ5F-vB;>6)Yk~n<G5Qh&1;_#tD96nTu!-pDi_)sSfALfX|2W!lGd|T>+
zk2rkr6Ne80;_$&{u08~b!-o)Y_z)%zA0ouzLzFmth!KYmapLeHNgO_8h{J~harjUn
z4j-z-;X{o$e5ezL4|Bxf!#r{Lus|F>_^<H#@VNRAAPygF;_x9z96p4Ys}Et~@F7AR
zK17MbhZu4A5GM{FlEmRdhB$mE5Qh&H;_#tL96r>D!-qO?_%KHtKFkw`4-3TMLxVVc
zSR@V~?2y-oBkDttID7~ZhYw-m@FBuneTWi=4>98KAx<1VB#Fa^3~~5SAPyfY#Nk7g
zIDDuPhYxk)@L`TPe3&N=9~OwihX!%@ut*#}G>OB9CF1ZQbfwpa@2U@B;_x9t96m&e
z!-p7i^&w6iJ|v05hYWG}P#_K;D#YPKl{kE;5r+?T;_zXPIDD8V4j&eX!-ocO_^?PE
zJ~WBLhb7|hp+y`%w28xq$U3hNEA{WqqQv1tj5vIV6Ne8;=ITR+ID9A&hYuCv@S#c^
zKGcZAhdOciFh?9d%oB$X3&i0=gE)LxBn}^%#Nopdarn?84j<aY;X{WweCQH~53#Gf
zKAfRG#EHX)BysqVAr2o3%+-eqarjUr4j*d7;X|D`e3&B+ALfa}hXvyBp+OuzEE0zg
zP2%uji8y>{5r+?L;_#tE96of3!-pPm_|PW~ACl|6J_OW<3~~5SAPyfY#Nk7gx%yBe
z4j<~o;lmtp_%Kf#J}eN24-Mk*VUaj|XcC7HOT^(ri#U8}6Ne8S;_#tM96t1j!-qa`
z_+V}D9$)z2BMu)5#4lDKD#YPKl{kE;5r+?T=IX;7ariJ#96l@%hYt<n@L`cSd}tDf
z4@<=1LyI_kXcLDI9pdnzOB_D*h{K0Iarj_e?LEHm!ABfE_=&@Z0CD(GCB8v@s1b(`
zb>i@0jyQanXRbah5Qh&9;_zXSIDBXlhYw4{;X{i!d}tGg4;|w0p-UV-^oYZUK5_V9
z{j&G?!UrF5_~0iF9|FYTgH0Sh1c}3kI`K{F!yIw=Fi#vlED(ne4d&{@B60Z8Bn}^z
zh{J~#arn?C4j($i;X{`=eCQE}4}IeB!3ukiFMRM3hYx<@@F74PKG?+JLy$Op2oZ-5
zVdC&%p7>7nVSzY&Xb^`Fi^SnWlezk^L>xY}h{K0Aarn?74j;P2;X{u&eCQL057sr_
z;|m{r#NmUVID7~YhYvP!_z)xxA40_8Lzp;xh!BSlQR48SLHsuLVUaj|XcC7HOT^(r
zi@EyHCJrAu#Nk7iIDF_4hYx+?@WHy)dwk)8k2rkr6Ne80;_$&H4j+QV;X{Zxd<YYV
z4-w+<Axa!R#E8R(IC1#UB%V<pmWacL7IFB{CJrAu%+-f3arn?94j=l&;e&OZ_xQpG
zA948LCk`J1#NmTY96khz!-o)Y_z)%zA0ouzLzFmth!KYmapLeHNgO_8h{J~#@jKLq
zHgWjSAr2q9#Nk7ax%$v24j-)Ry~h_m_=v*?KXLdFAPygF;_x9z96p4I!-p_&_z)ou
zAELzJLyS0lh!ckoN#gJ!LmWO7h{J~parn?7ey{q_B@Q2Y#Nk7qIDD`+dXMkm?;nW6
z2S0K65Fid8Y~t`CNE|+dh{K04arh7+4j-b#;X{l#e25c=4@u(iAwwKK6o|uz3UT;Q
zB@Q2I#Nk7a`0uI@ed6%J`W5f-g%3XB@WIbq&kq6O@WCbyAA-c;Lx?zh2or}75#sP6
zN*q4Kh{K0CarlrV4j(ea;X{Eqe5eqI4^`sup++1&)QQ7~IpXladbRiXzE6Gd5r+?c
z;_x9r96s30)rTN)_z)rvAHu}pLxebdh!TenG2-wcP8>cYiNl8sarjUm4j(GS;X{=;
ze5etJ4|U@3VU9R_m?sV&7Kp<Kf5hv<pQ;Z5;_$&H4j+QV;X{bI`Vb}#A0ouzLzFmt
zh!KYmapLeHNgO_8h{J~harjUn4j-z-;X{o$e5ezL4|Bxf!#r{Lus|F>G>F58MdI+m
z-sJV+uhoYjarh7-4j;nA;X{PE`Vb`!A7aGeL!3B#ND_w+8RGDvKpZ|)h{J~}arjUp
z4j<~o;lmtp_%Kf#J}eN24-Mk*VUaj|XcC7HOT^(r=mxJ3A5|a1#Nk7PIDCi_hYvC4
z>O-73d`J?94;kX{p+FoyREWceDslKwBMu+x#NopnariJ#96l@%hYt<n@L`cSd}tDf
z4@<=1LyI_kXcLDIk<DHoKB+!Li6=hmex5zH#asVztsf`;ZRN?W-un3ba)$V_)-MoW
z^(A+^72+>ZUM2o=<u&4$C@<XT9Zv}T6Nmp*;_$yl9RAmttN(Mv;r~2w_`g6L{x^uj
z|3%{PzeybaFA<0TE#mOMO&tDrh{OLbaroaO4*&ba;lH)bd%WPkk2w7I6Nmo+;_$yp
z{5tqU96r>E!-qNI@L`_0`mjJ8J~W8KhehJ>p-CJ*ED?tfE#mN@O&mUSh{K02arn?9
z4j=l&;e!?R9$)z2BMu+@#Nk7LIDD{)!-pVo_)sUlO?{XnKB?!!dE&Qd{RQIZ>2Yrm
zf4$aUBo42d#0Oq&_vS5y?GlH#4dUp3kvRHqGS~4h5r?-e;_$Xj9Nu<_!`m)#c-tcm
zZ~MgIt@Rpjey-Bv<s**odGiy$R_h0dZ&Gd(->y7J{B_Dh#NVJiOdMWCi0{+-QQ~JS
zj}b2^j}yOBd6IZdd4~9$@&@rKcuRand6W1(%9n^ALjBizkJmd;pE!O#S%`V-qujjN
z<)_MR!%6O~gY)mw5I<#zw}$v>L%co2*9`H_5I<{(S9Lt5@0Sd5OZkh0xg&E|`iA(*
zq5A$Ie)13x3~{q>c-ceTYy+?05XUo{&AQMKUxmsN!$bV^As!jxFCOC2A%4aXj}7tF
zLp(mj{X;xC#Lpb!qs37MMj04oV3dJT21XeeWnh$nQ3ggC7-e9TfuAe`(YwFliyk`V
z<L|Jn=x;U4E0<f*yFcdp=%BOZ4Sz4Ime>9d`FG6)5m_?jj9GuAvn;Q*A2DU~7WPQX
zDL-V&=B?|Irc-{vl+9c2BMqngXQpi4!XBA-%I`O2^Y-{i-6{WpDVw*lN2*Twou+Kw
z!X7C&<+qu#dFy&4>6G7U%H}Qlk(g7y!<5Zi*dr0ATr_3#*7ZopDZj~-&0E$Zwp0GX
z+bwIdE@kJn=~p+se$%a+cHg>7G82p*IyX97zStj~J!f2+-Mzf>&X+%I_BDEV!=9H~
z)|JifYo!whbyKnwJ+wt?AKvg+@-u+wp^C42y@9hImE5km=Wk`Nqq84==vwLEL#CH2
zzq#Y^hBrw+I}Weiyw<Wd&3<#o?7#0gyz#f?Epqol)AaDhx5@hG>_0uU0j;_|S+Qew
z;$oky-7PiEaq0e*q;S*SA78O!_WRM<FUw}4vmf6&+jY0o+<w=`{IbRFKR;`R^RL~l
z61N}vyNi3K{9DU2%^inKf3uH2Y%;g$gJyKyq*;0R@&{jO2DovJY{}W`nhUmA)@u&^
z#y7>7<#Moljci#qDI1e*Z=d~`Y+JY5%}EbCW}obS?ityf3~u|~&%A8rV-Lz6xe1VF
zS2m-E&v{IKRN#!Dd%fAF?Cq}%svO?fJYNR$5?x`;a=MqD|Ee_WeqZ`FEzTKQ@rP!`
zSImmT8~>{;oAm*#?=HD3&XW~SU9{$cs;rP2|B4)4Y6g2WH|3q(Uv&=LBa+eX^|JEr
z{wcrtsp#CHbAI`-G(5cJl<!~TpZU<5zuoXlQl7k~aPrLk(b*%iTU(t?%Pu+t?k1(}
zQ@?M<vIb+3@vS$N4qv`eLU+uJOx7EnWkhbX^AwsRgFQE6?^?&I{nY<)yD&2hwZA92
zcR#Hw<S-6aeATRYP*)6Rwry6-n-zyQeoB_jpv;_eNA%lHi!XOCkSHgO>$|7Q!l7{Y
zj|@MwG59-!Eu6FEJj>cX+jQr}p{P%0%p0WE!)AEhE1df7jJZp4A~VL{T_te)?8luM
z^I`GOGh<|WbWf8CJ7$kOv<>5tF`gouktBZ0mR^tiK-M36i*?g)Ew8!fi&AOF>~i<r
zvUky0IfNhW9+dUbLvQgJdw*lo-(BL=_>GGLo1C57bm_;on7Os}$|r7`ee$U@)sO9W
zKO*)n+vPWQw@X)BW}h}^&K+XStkmzx!M<P4F-{Nv*tu!;E0VC-7C9LNWQQ|mKr#zj
z-(OyqdGm}vew&$d|IN(PyFTvB%fFFTN4Dt+B(Qz9FC88fW6eB1wDI$A)2^knmr7?3
zo6}|Y+tNmMJtW&T={S0Ly0St>)~!l(4$od$ki%m+JHBxiy~#HI_!m5f$1fXrNGd;U
z&cEHdEM2+Wot4nN*zMwj5@bL(4t4R?b3I*{BYZ$=J#2L9-elT#pLRBXmuY+WoLjv0
z{YPFZyFIz)f_tRCIVEnNee8|V*{??LexeuKz4glGm5<6!rlsxjIcsJu>pPQUsB6x9
zvhG2l(Zi=aV8S*1@;P!j`&LbsW!Fzz?&$IRDCYU$jk8Yu4R1B!(r2S`@IL&lS4ThG
zKPkH66Vb<>EuSqNd{8^^Eq7#STPzFhs~cY@?W~zAZjIi(@jpu~IYeJ8zgTkeO5xqV
zkX`L&WUQY!<x(lH_;s{HJo09<w}V}%JJ%Us_f~0V27G;V&4&VVp1J3X<x9HtzDc&6
zmAgk996I|w(Yu=~YO!Tg^PVrxe9K&~rOJQ$^^eLR{=$s=&>0^v;qW;Z`OGECowNT-
zI(yh$hPw~SlI2|YrRC8py{sC&Mj04oV3dJT21XeeW#H!`11qeR*4omwww<3T+xba*
zIz62)9$aOW(&ID5Z26#d(L{POn@iifwr<|MW7Do(iM5wsbxp}$yKZ9D@Vf0gx9{Gy
zearCLRYOg&dgrY-*=wccX?ybdDZ7*(-)C>MCsR|U^mVI*mJVdg;~6`+FI~(f%Hvb1
zQt49r&9Y)VRT>^b=~`i@Y4dV%Mp{dGPcfa^XO?8hshO#=sq_}pZ82S*DdwEb$?CVP
zvQ}M`&P`+|kM6<RQbLA)S;-Dn)`r5BTUS|T2<a(nCO4JZlb%XXSlL{*Y^8Ds6K1pN
zvXx7hvy<lE>C#?nBE3I1Gc{$6=cfxZ<#Zyof3P?%qbjEdMM?0UbkQ2mq{sIq#`C#y
zx>B~L@~MeLAwQKJKRBpZOgmd1lxIrPtduUNbCdbvc-kzQB$_A(tB@+Drp>a`aWQ>+
zp_ngQvgtv;`%_cd2^mSETukLkmaHF7mr9m&w$CbM_l{?#CQRIymEDk7D(4F(I=f|J
zYRcLxo09&e9u^FdUb6X|Sy|Ydb=Emi%Ht{NLPI81Oq)8Xa@pODN_yO^oGfXAa`}|m
zkr}Bbo6F`hW^HjQJ1uE<gJBZ~bI$m)WwTAw`Ao5x&XujS^e4G>VyT=O-{&l4((VxU
zr4Q<6q=`9zT9N_jCNv-GtX!_(rpVnyE?=6-P7IT5VLXwQgJ6{omZtK16DG-6p3V;z
z<=B?;Q)x@k%v4%x-9D2pl~FWD+>-pH_GG8bG#k(459F+g)OaGlKV2+lC(=&AIe?{f
zsyLpJBXuAr3QWo#xM|jG%kWGh3MSiA+1yMek)58NDW~?x#zoJ>o@{O+A@c0ciZu=>
z%Eo0zISwaM6Vut8!;7f{mf5)}$&a((ZcuhJ;qHXl%Y@X*I{Tg&&lK|_nSlj4D8@;%
zm>M5<q1&@zBC3mWW(-a*=~8~i(O6cOv(qjumk*jbF`3<KP4CM|E~nBm@vYR%M7Eqb
zAm<}2n~76&<)1YBJ(-<KOSfeapeSds84+S8=SH1Jlf}GQ=gg^c-jXv*emrX?=It~2
za>^VxXF-Ip4y0rjS=sz}*_mCAv8h5qP8rs;Y)eiF+1x(qE0>aC8}DSsyRAh@XQL9`
zF-tDYB+E-<*HGDx)4gQW4ANaL?a>7}{bdq`ba7hC=J1!Wi9{~1b60#X9xUXu(m`rs
zBH@g~?DqaNmgRI{CPi_2A}hvBSaR~PrgHldCQss8(OuXxrNqn#7fog-^X}4AR?;c^
zA$w3Lo0*%JAf7oPGgF>0!WPrx`<={|Bs-E;GZaTSt2|y9oGJ4Y3Z2z*>~d*QWM*Pe
z!#Fi4WOI9qCTVFmh_NQ$=uGsIvoA_Audz6j%T^MlvZQPpNVg>8a#l|4m2)K4l|+>Z
znUKZ&L3gb)jh%X~Xl9Sh{-v|~WfGV6-fpFHGKFWv)XcP*9%jxsBP+_aGmU<8Q&QJG
zC%B7lZ>TG2naHO04r*%`xk+;-IeKMr`mIbB2g61CQd(BY6sM}$(s(J$&{YsMbzD=a
zgVxb87K+*ZS~PRdX))NZ$?P6E7v%TN3=R!Op^42Gd6QB>6f38HXvIXXI5=pSdpLvZ
zMNqDP)6UGpR`%uhyh%h!%%t~=%)F_+X*nBa#~GVMI9G^5{^%CbV{u|S#SO+~DwXq~
zGS8>vA|(?;^XB^Ih;FWx2U5ieIr|S5<<MLEv*`oY#Q0I+<vL_8zee(DIb$U5NEh=~
zA#E=EgERg*IhPEb>`h^Cz7!Ya1h7wTBc@VvF4$vE$cfTS$vFkNN7b=a&J>f=a<17^
zm~?L_oO6Q6Cl_zIjJmflgHykp0}`py(d&zQ33MiVNmf|pnH+XnPI;;EOv2UOTt8&!
z*=gBaDW7w0W>V%RL8hTO56cZqVtQ}Ua<V?@?qtH;t;k)5?C&}(m&VHjk+S2{gXL6Z
z-3DwpTNqYbPT5$IDo+>o4-R%QFZ-egXd*o>hZiNOR~o7}QNoIBabj@j3xz^{u$`%?
zLAT?nK?cg^La1BVGc_Zpy}X>76=Y||_hY$ODITqvElmu1C}i?O<I1Mf>2PS>l^fQh
z;dm-LiRIMP(aAqK*zSI_IaJv<KFG!GGPYuQFvUvKgBEgIBDbk~QgXpB$f;58SqEh~
zo0%CQ*5y)rC3!P@WOU|wme5-vv*=i57TlATNW53h0CK0|EEQ*RR^{kPdwS@28WR)F
zG_}SHGe!luYPc7uTs|igMlJzzMNP<iiriicw>W0ga(ye^E>mCbou{4aRMEMHm1N$f
z<sH~eNk)6Ey|!de%DaJmV;MWO-?>6&ox9G`nDeVX%i5funVPV3d3h5fv&5;NFWP1|
zWP|edW^Kv*`oyxXxXj+2$x3f>A+#l?X3BXv>tx5p=!3R%P7%Q;#%y~_-svtaw;p5m
zWmmY(t&6ggG8-oC#ICJ7w(q<(p14uo6umYPO$^@c{O}snbd}tb%2w>Gxp_$2?%N-I
zkCn*6(uDJtCpdHE)%NgPAnD-f{*`8&dx}YUyK&GSpOTp%J2%+F9~$4L*C%3IZ@Fpv
zu3g)2-nr|T1O4GDtE}Dmye+C6w9Va!=GrdjC2wWf2U$qvZ1?8I-Y@4l$#tr12iKNf
zC4;$iY?Za*r6*mtUiRDkref(U%ga@X3twGcUXZvb@f#Au-R0$&#Nd(T<)%dQI@=A_
ziaTzxR#f~eUi#t}`RXfP<Tw0f@_*;a<>gmu9TSdS-?A>2b*6qqw%w6vt@huzI`En`
zryuZDtyiCa-DOu@>~v$=%a8Z1zSNVSmJeP*S^72kKPh88VCbgR{(Dz$S{=Ciq%EuM
zm627e16x-6H?8(<I^DWnR@dbJk4YZ~%nskM+W(G~H>?i4{iGXK+lNoSVRi6#PT90N
zbnmH~R)_C?(U#T8lkAgMXWqEFaNFw4rqxMlF?x+MFv`Fv1EUQ5lo=>I;&u?yi0<$?
zyFkk&wtj96>rek2mfU!muD?R#H_RUr$m?1yM>OvJlFNVA|8CRmV}EYe@@*Qk8Y>#_
z)%Z@0@7MU38Xwa55sjbJ__)S@(fG68&mVVtPwO~W=yT<R`W*RFwY*y6xf)-g@fS62
z`;5EA%e8#9##d{6jmBFv-lj37F{2Tmf84L--_-avjlZYyeH#BnV?*PIG(M{FF^yl;
z_=Lu9Yh2dI`8rMa|16CcYUFx9FXeyG<9n6n;d+fxjk`4N(RjPY7ixaqqWf3V_;!u&
z)%fQcAJBNFdiJ1}n;IX{_$iHF*4WedtVW;a->>mJjTdPg(-_vcS>sNPagBR4?$cP-
zc$dbvX?%~yKhpTe8t>Qmutpr;N3?uUKi~RkEq_5H<~2T-jL#wWb^S_hce+M=4j28O
zXc5;%d|n)%569=h@wr>%1)uN6=eapg_<S}#kB!e?<MY<|d^J8#jn7Zx^V0ZyG(HcF
z&p#v2_<S=y&y3G6<MYb*>2diRjX&=3=xDnaN<NypKaXnstj7CvzCNzy<K^iIUH{~d
zsCT^fFEsv?=HZ3n$4`5FzB9^)Cxy?9^5MzHsrtTaG+*4b_(%Eh_)*gL-OraF_}(IX
zPZ7SC2;W15?;XPT4B>l)@I6BK-XMHWkZ17EL*Wh2vmeiOJ|=X$IgQVCO5S{e=Ht1}
z1Nysnln<oBiTL>(&sR^SpQ8ePe)%r-<h{?+ygX4-`u-DazC7P~y?>MsPl_KN<-?PY
zzdyn9@y|M5&rUptg(pw2d_0G{e$1|()_maSg3oA~e?G`RN5s!hPx-vt?HY~8+{g(8
zH|y`C&e#39;03Z@Kj|U3MEC!>^5az}cs|!Z$N9j|-G9>K^B>(V{2b@wKYD+D+U;Mj
z`8ZL2{_#2;@2_e6Y4@MoXGZz(q~y*MEFYf1KM#d>pJ4fTzURxr36_uNK?zTn50CQU
z$;W3-uzc`*IilnLuEvwwp56Pg3ukNl6XkmQWAEUP*<fsx4^PInjq>5iM{<-8Pd>_;
z5By#bzjvrAKcvyq=mlJO_s9>=(S2Z)4^KWG9p%H5kFSsN;mODMNBQvN!~aiiaddy`
z=HWu*L+}5CTE^dHU#{h_Mo%(dfQ2vU-%D-L{pH`Y;O~0z_tyA(ZTvj`g_e@-nuil*
zzC7Q5FLkTtBl&YOAAXe2Z_+&Ayz!&zztGp`H6L%*Skw4UjqlNTpT-9?eo!Nx<NBzU
zPxO3!N%!L^jX!vS?AK3vTu%Rj+ww#?pPsAZ{Ym%tN42|1`@c-%)f%tU_*#uWs-G9i
z`rW#}qt6q%M>L`N7@beueC*SFoG9lHz8}NS_nxcydf)`j$8)v++|S)7SU!0C&vp47
zqkMQ$`Uj(Yc=E9@%7-T(_l)x4$;Ssq`S9f9BcptH^3fXQ!;_E4NBQvN<H#r<o_u_F
zln+lnR*pWO?a9X(qkMSs5g6selaGr>`S9dp!zdq~d~6!!!;_DlqkMSsaoZ>#o_u6R
z`S9f9j!`~5`M7tK4^KXRca#rLKHfLVhbJF@I?9J9AAdc{hbJE&9p%H5k56ho5~J@w
zb@KAK=HuHMJ<0kxEa30qmNg%%^mzd<(fD$Wm;47kN8o3DNJ6@wC(7S{Jm2}bZj=vC
zinnP#@crJmXc^D%c)gbS@7g_`ynqU+=h=^+_4lv;w1;L&^D(3G9*u`I{<Qs{Xzkyj
z`~5=w{KkoPJpO+h?+P8zeHvek0XE%*Gql{-=Ovw|<<=AK!b`P`XXBl(<+iSW8Lj{B
zv&(sd=1y`~n8)Fn*Ss#A>K3i0uGhyZTDA`KbX_0Nc#cK{ZDw_Eb{nkJ6=7|Eww7O{
z@fS2!?spf!!@7X|iu{?o{xq$R?f+oe-5<eD)n~Qb)g4*ZGV<Kha#Pnot>v?@VeS8C
z9XdB*gx)P*-E_-0YJ9!MNsaq8-mCHVHU5dl4`}>^#?NW|y2kHmwDeVox#QS%^XAuX
z-ED2#y!l!?xJ@4KE6<y}@~W|QV^<~$>q6_+%cD|5cHzqPmnyXZ(a~|D>HTEu+snLq
zwSbk@YODH)yXJsaFvod?b&l0hpF7IO9Dbfv(eVx+zGL;j&Z&Q{#d*HY>F-P{{<wCm
zVW(3+Fz^^_CuHF)6o%qkkHg<}9Nst%|DVEb`JZ3cNaAekJS(I7i}goj;Y{m%tF9dL
z<==$+rT(2%|7qdJ8vl!>|Fc%Ev2dP2|7RJF!cbg$9RAC~&31X5TZG#~?XkE^_?auu
zv)~<g(s1d?YPsZOU6B{=m;Ifi{n{w}Eb*uQ3AZAiL0WeDImc>#&E<B+jdy83RqE%k
z_T&GsThDo3s4TwI;Vv8g_s@^RKX@Ge$>Z>^2se4)@qF_*^<N|>R(p6Lx_qJVV;!&c
z$Kl(Bzw#LUOqlvB<!s;du6&bl;~T~a-{b-GxlB4G527xOja%}3>zRT)nYJQNIF={;
zrlu05>_oym#k=f0V7N3EvJzX~uyfN*+c#T@#I~KcCbmWsH{P=8rmczSmRqdEwjDR$
zuxUr)<{NL^wRLx5_of?mY)weT@$$h!TKaK1mnR>ON4D+Qe#7R(y0NRQ$znP^D6SjZ
zAl0X`d*so}<&->Vd^(lQ4c1<@URI`y#a!O>A<zBIA6@6P+`jv!gzQRsZ?+`QO5c6c
z<|%nr^zPK2skHN?<y>kyZ6(qZsdCDibRO|L=-O#+9_lPPC{Crl{Z1BU%14Lnj?r}D
zj#0+6SDqHG12(%mQOYMW@~~*xY-0P(vSK2eGf(TDkd5ai<N@E~&hv<6=L%ALPe~qA
zY<5<j8ZA#iPbB2w@wxJ3Vmz~NFubeA*2|;O<q6CKVn&g!WXm2f={#TBgAxfzLt^XB
zEeSKS$>a2$Fqu1M-Xt9pGX+nVg5)6NiykNkxssBHp660i**oZf8kr=oa%8-1tg+I;
z>2hk1#B$M%87xT((#3)`CXb#@kL}IPjLD<2#mw?SMVoHeet9{y*K*cnQl*SFHgPZ~
z{kgGRbl2<`u|*S4AtCFE>8X@ypi70RvNh%mb4(V;_U5H%9>8vmIZlif^NuuQ=}cm>
zC>A9$6VeGvZqr1nSWL<AObaZVR%Tn$fiteubaq_Ek}sRh2r(v(i2&Bvcz$|XF4x{!
z`;(W<{iU3O<l5}M0(ygoi2F-j?bOAxi5JQL=Gu(*VQr5Xk#)nb!TCht)e?u>SF}B%
zJ=|E9ah(9(A`9P>dpvA^PTL~}wIc3=dHd!*z})kqJ?@JUJ0>~uGPjdv5$$n*`g&nz
zlW33oY(!hzOBLruMeB`1%{?sI<GvM<$B%l5`=q_OCq^0f#fUL?vp+O`#nv6NU~-D~
zxGzV<eL3e3`+twNzd<|1eLf=I7nt^@4d?Ie!c0!|cHJ6WkD3DIh4&9`>4%U1tgIMr
zZ#UhVh#j{$_*d>YoEPG|r8L~$`mk2ei18{Nj34|DsXgv15T6|EnA-!ki~jx#wa5J#
zBHj<7KKuWt+8)Ob_fv@R)oAHPj2qOvEi?O%_P9?*T#t$xdHWAj`?}s|Blf91Z~swQ
zZ-%JT*24Q<M6QoAqWQJYnxXc1UyO))QpI_pjQACyX8*DM!TXRW-1=5lqXyesw7w-}
zbBG7+^?e@RhjdKm@`47~E<Bi%b;H|l>icTMpws&3KWLBm@6;ae%Mq(bn>tHqhiEoF
zynWo)Bl7$kw9|f1mBxl)|8d`l_a!(_U>G(QG@h>84-rTEtB9Min%8iIyY3qI-$lLe
v@9Dr^F4)W^j347N3}bg&y8kZhcDL{r>Kxle-7}BVe(kTiHP5C7RJ8stqpf*V

Sergey Senozhatsky Feb. 3, 2025, 3:26 a.m. UTC | #4

On (25/01/31 14:55), Andrew Morton wrote:
> > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > +{
> > +	atomic_t *lock = &zram->table[index].lock;
> > +	int old = atomic_read(lock);
> > +
> > +	do {
> > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > +			cond_resched();
> > +			old = atomic_read(lock);
> > +			continue;
> > +		}
> > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > +}
> 
> I expect that if the calling userspace process has realtime policy (eg
> SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> and this becomes a busy loop.  And if the machine is single-CPU, the
> loop is infinite.

So for that scenario to happen zram needs to see two writes() to the same
index (page) simultaneously?  Or read() and write() on the same index (page)
concurrently?

Sergey Senozhatsky Feb. 3, 2025, 3:52 a.m. UTC | #5

On (25/02/03 12:21), Sergey Senozhatsky wrote:
> So zram entry has been memory-saving driver, pretty much always, and not
> debug-ability driver, I'm afraid.

s/driver/driven/

> 
> Would lockdep additional keep the dentist away? (so to speak)

                addition

[..]
> diff --git a/scripts/selinux/genheaders/genheaders b/scripts/selinux/genheaders/genheaders
> deleted file mode 100755
> index 3fc32a664a7930b12a38d02449aec78d49690dfe..0000000000000000000000000000000000000000
> GIT binary patch
> literal 0
> HcmV?d00001

Sorry, was not meant to be there, no idea what that is but I get
that file after every `make` now.

Sergey Senozhatsky Feb. 3, 2025, 7:11 a.m. UTC | #6

On (25/02/03 12:26), Sergey Senozhatsky wrote:
> On (25/01/31 14:55), Andrew Morton wrote:
> > > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > > +{
> > > +	atomic_t *lock = &zram->table[index].lock;
> > > +	int old = atomic_read(lock);
> > > +
> > > +	do {
> > > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > > +			cond_resched();
> > > +			old = atomic_read(lock);
> > > +			continue;
> > > +		}
> > > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > > +}
> > 
> > I expect that if the calling userspace process has realtime policy (eg
> > SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> > and this becomes a busy loop.  And if the machine is single-CPU, the
> > loop is infinite.
> 
> So for that scenario to happen zram needs to see two writes() to the same
> index (page) simultaneously?  Or read() and write() on the same index (page)
> concurrently?

Just to put more details:

1) zram always works with only one particular zram entry index, which
   is provided by an upper layer (e.g. bi_sector)

2) for read

 read()
  zram_read(page, index)
   rlock entry[index]
   decompress entry zshandle page
   runlock entry[index]

   for write

 write()
  zram_write(page, index)
   len = compress page obj
   handle = zsmalloc len
   wlock entry[index]
   entry.handle = handle
   entry.len = len
   wunlock entry[index]

3) at no point zram locks more than entry index
  a) there is no entry cross-locking (entries are not hierarchical)
  b) there is no entry lock nesting (including recursion)

I guess where we actually need zram entry lock is writeback and
recompression.  Writeback moves object from zsmalloc pool to actual
physical storage, freeing zsmalloc memory after that and setting
zram entry[index] handle to the backikng device's block idx, which
needs synchronization.  Recompression does a similar thing, it frees
the old zsmalloc handle and stores recompressed objects under new
zsmalloc handle, it thus updates zram entry[index] handle to point
to the new location, which needs to be synchronized.

Sergey Senozhatsky Feb. 3, 2025, 7:33 a.m. UTC | #7

On (25/02/03 16:11), Sergey Senozhatsky wrote:
> I guess where we actually need zram entry lock is writeback and
> recompression.  Writeback moves object from zsmalloc pool to actual
> physical storage, freeing zsmalloc memory after that and setting
> zram entry[index] handle to the backikng device's block idx, which
> needs synchronization.  Recompression does a similar thing, it frees
> the old zsmalloc handle and stores recompressed objects under new
> zsmalloc handle, it thus updates zram entry[index] handle to point
> to the new location, which needs to be synchronized.

... Luckily there is a trivial solution

---

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 402b7b175863..dd7c5ae91cc0 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 config ZRAM
        tristate "Compressed RAM block device support"
-       depends on BLOCK && SYSFS && MMU
+       depends on BLOCK && SYSFS && MMU && !PREEMPT
        select ZSMALLOC
        help
          Creates virtual block devices called /dev/zramX (X = 0, 1, ...).

Sergey Senozhatsky Feb. 3, 2025, 12:39 p.m. UTC | #8

On (25/02/03 12:21), Sergey Senozhatsky wrote:
> On (25/01/31 19:41), Hillf Danton wrote:
> > On Fri, 31 Jan 2025 18:06:00 +0900 Sergey Senozhatsky
> > > Concurrent modifications of meta table entries is now handled
> > > by per-entry spin-lock.  This has a number of shortcomings.
> > > 
> > > First, this imposes atomic requirements on compression backends.
> > > zram can call both zcomp_compress() and zcomp_decompress() under
> > > entry spin-lock, which implies that we can use only compression
> > > algorithms that don't schedule/sleep/wait during compression and
> > > decompression.  This, for instance, makes it impossible to use
> > > some of the ASYNC compression algorithms (H/W compression, etc.)
> > > implementations.
> > > 
> > > Second, this can potentially trigger watchdogs.  For example,
> > > entry re-compression with secondary algorithms is performed
> > > under entry spin-lock.  Given that we chain secondary
> > > compression algorithms and that some of them can be configured
> > > for best compression ratio (and worst compression speed) zram
> > > can stay under spin-lock for quite some time.
> > > 
> > > Do not use per-entry spin-locks and instead convert it to an
> > > atomic_t variable which open codes reader-writer type of lock.
> > > This permits preemption from slot_lock section, also reduces
> > > the sizeof() zram entry when lockdep is enabled.
> > > 
> > Nope, the price of cut in size will be paid by extra hours in debugging,
> > given nothing is free.
> 
> This has been a bit-spin-lock basically forever, until late last
> year when it was switched to a spinlock, for reasons unrelated
> to debugging (as far as I understand it).  See 9518e5bfaae19 (zram:
> Replace bit spinlocks with a spinlock_t).

Just want to clarify a little:

That "also reduces sizeof()" thing was added last minute (I think before
sending v4 out) and it was not an intention of this patch.  I just recalled
that sizeof() zram entry under lockdep was brought by linux-rt folks when
they discussed the patch that converter zram entry bit-spinlock into a
spinlock, and then I just put that line.

> > > -static void zram_slot_unlock(struct zram *zram, u32 index)
> > > +static void zram_slot_read_unlock(struct zram *zram, u32 index)
> > >  {
> > > -	spin_unlock(&zram->table[index].lock);
> > > +	atomic_dec(&zram->table[index].lock);
> > >  }
> > >  
> > Given no boundaries of locking section marked in addition to lockdep, 
> > this is another usual case of inventing lock in 2025.
> 
> So zram entry has been memory-saving driver, pretty much always, and not
> debug-ability driver, I'm afraid.

Before zram per-entry bit-spinlock there was a zram table rwlock, that
protected all zram meta table entries.  And before that there was a
per-zram device rwsem that synchronized all operation and protected
the entire zram meta table, so zram was fully preemptible back then.
Kind of interesting, that's what I want it to be now again.

Andrew Morton Feb. 4, 2025, 12:19 a.m. UTC | #9

On Mon, 3 Feb 2025 12:26:12 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> On (25/01/31 14:55), Andrew Morton wrote:
> > > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > > +{
> > > +	atomic_t *lock = &zram->table[index].lock;
> > > +	int old = atomic_read(lock);
> > > +
> > > +	do {
> > > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > > +			cond_resched();
> > > +			old = atomic_read(lock);
> > > +			continue;
> > > +		}
> > > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > > +}
> > 
> > I expect that if the calling userspace process has realtime policy (eg
> > SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> > and this becomes a busy loop.  And if the machine is single-CPU, the
> > loop is infinite.
> 
> So for that scenario to happen zram needs to see two writes() to the same
> index (page) simultaneously?  Or read() and write() on the same index (page)
> concurrently?

Well, my point is that in the contended case, this "lock" operation can
get stuck forever.  If there are no contended cases, we don't need a
lock!

And I don't see how disabling the feature if PREEMPT=y will avoid this
situation.  cond_resched() won't schedule away from a realtime task to
a non-realtime one - a policy which isn't related to preemption.

Sergey Senozhatsky Feb. 4, 2025, 4:22 a.m. UTC | #10

On (25/02/03 16:19), Andrew Morton wrote:
> > On (25/01/31 14:55), Andrew Morton wrote:
> > > > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > > > +{
> > > > +	atomic_t *lock = &zram->table[index].lock;
> > > > +	int old = atomic_read(lock);
> > > > +
> > > > +	do {
> > > > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > > > +			cond_resched();
> > > > +			old = atomic_read(lock);
> > > > +			continue;
> > > > +		}
> > > > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > > > +}
> > > 
> > > I expect that if the calling userspace process has realtime policy (eg
> > > SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> > > and this becomes a busy loop.  And if the machine is single-CPU, the
> > > loop is infinite.
> > 
> > So for that scenario to happen zram needs to see two writes() to the same
> > index (page) simultaneously?  Or read() and write() on the same index (page)
> > concurrently?
> 
> Well, my point is that in the contended case, this "lock" operation can
> get stuck forever.  If there are no contended cases, we don't need a
> lock!

Let me see if I can come up with something, I don't have an awfully
a lot of ideas right now.

> And I don't see how disabling the feature if PREEMPT=y will avoid this

Oh, that was a silly joke: the series that enables preemption in zram
and zsmalloc ends up disabling PREEMPT.

Sergey Senozhatsky Feb. 6, 2025, 7:01 a.m. UTC | #11

Cc-ing Sebastian and Mike

On (25/01/31 14:55), Andrew Morton wrote:
> > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > +{
> > +	atomic_t *lock = &zram->table[index].lock;
> > +	int old = atomic_read(lock);
> > +
> > +	do {
> > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > +			cond_resched();
> > +			old = atomic_read(lock);
> > +			continue;
> > +		}
> > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > +}
> 
> I expect that if the calling userspace process has realtime policy (eg
> SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> and this becomes a busy loop.  And if the machine is single-CPU, the
> loop is infinite.
> 
> I do agree that for inventing new locking schemes, the bar is set
> really high.

So I completely reworked this bit and we don't have that problem
anymore, nor the problem of "inventing locking schemes in 2025".

In short - we are returning back to bit-locks, what zram has been using
until commit 9518e5bfaae19 ("zram: Replace bit spinlocks with a spinlock_t),
not bit-spinlock these time around, that won't work with linux-rt, but
wait_on_bit and friends.  Entry lock is exclusive, just like before,
but lock owner can sleep now, any task wishing to lock that same entry
will wait to be woken up by the current lock owner once it unlocks the
entry.  For cases when lock is taken from atomic context (e.g. slot-free
notification from softirq) we continue using TRY lock, which has been
introduced by commit 3c9959e025472 ("zram: fix lockdep warning of free block
handling"), so there's nothing new here.  In addition I added some lockdep
annotations, just to be safe.

There shouldn't be too many tasks competing for the same entry.  I can
only think of cases when read/write (or slot-free notification if zram
is used as a swap device) vs. writeback or recompression (we cannot have
writeback and recompression simultaneously).

It currently looks like this:

---

struct zram_table_entry {
        unsigned long handle;
        unsigned long flags;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        struct lockdep_map lockdep_map;
#endif
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
        ktime_t ac_time;
#endif
};

/*
 * entry locking rules:
 *
 * 1) Lock is exclusive
 *
 * 2) lock() function can sleep waiting for the lock
 *
 * 3) Lock owner can sleep
 *
 * 4) Use TRY lock variant when in atomic context
 *    - must check return value and handle locking failers
 */
static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
{
        unsigned long *lock = &zram->table[index].flags;

        if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
                mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
#endif
                return true;
        }
        return false;
}

static void zram_slot_lock(struct zram *zram, u32 index)
{
        unsigned long *lock = &zram->table[index].flags;

        WARN_ON_ONCE(!preemptible());

        wait_on_bit_lock(lock, ZRAM_ENTRY_LOCK, TASK_UNINTERRUPTIBLE);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
#endif
}

static void zram_slot_unlock(struct zram *zram, u32 index)
{
        unsigned long *lock = &zram->table[index].flags;

        clear_and_wake_up_bit(ZRAM_ENTRY_LOCK, lock);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        mutex_release(&zram->table[index].lockdep_map, _RET_IP_);
#endif
}

Sebastian Andrzej Siewior Feb. 6, 2025, 7:38 a.m. UTC | #12

On 2025-02-06 16:01:12 [+0900], Sergey Senozhatsky wrote:
> So I completely reworked this bit and we don't have that problem
> anymore, nor the problem of "inventing locking schemes in 2025".
> 
> In short - we are returning back to bit-locks, what zram has been using
> until commit 9518e5bfaae19 ("zram: Replace bit spinlocks with a spinlock_t),
> not bit-spinlock these time around, that won't work with linux-rt, but
> wait_on_bit and friends.  Entry lock is exclusive, just like before,
> but lock owner can sleep now, any task wishing to lock that same entry
> will wait to be woken up by the current lock owner once it unlocks the
> entry.  For cases when lock is taken from atomic context (e.g. slot-free
> notification from softirq) we continue using TRY lock, which has been
> introduced by commit 3c9959e025472 ("zram: fix lockdep warning of free block
> handling"), so there's nothing new here.  In addition I added some lockdep
> annotations, just to be safe.
> 
> There shouldn't be too many tasks competing for the same entry.  I can
> only think of cases when read/write (or slot-free notification if zram
> is used as a swap device) vs. writeback or recompression (we cannot have
> writeback and recompression simultaneously).

So if I understand, you want to get back to bit spinlocks but sleeping
instead of polling. But why? Do you intend to have more locks per entry
so that you use the additional bits with the "lock"?

> It currently looks like this:
> 
…
> static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
> {
>         unsigned long *lock = &zram->table[index].flags;
> 
>         if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>                 mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> #endif
>                 return true;
>         }
>         return false;
> }

I hope the caller does not poll.

> static void zram_slot_lock(struct zram *zram, u32 index)
> {
>         unsigned long *lock = &zram->table[index].flags;
> 
>         WARN_ON_ONCE(!preemptible());

you want might_sleep() here instead. preemptible() works only on
preemptible kernels. And might_sleep() is already provided by
wait_on_bit_lock(). So this can go.

>         wait_on_bit_lock(lock, ZRAM_ENTRY_LOCK, TASK_UNINTERRUPTIBLE);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> #endif

I would argue that you want this before the wait_on_bit_lock() simply
because you want to see a possible deadlock before it happens.

> }
> 
> static void zram_slot_unlock(struct zram *zram, u32 index)
> {
>         unsigned long *lock = &zram->table[index].flags;
> 
>         clear_and_wake_up_bit(ZRAM_ENTRY_LOCK, lock);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         mutex_release(&zram->table[index].lockdep_map, _RET_IP_);
> #endif
Also before. So it complains about release a not locked lock before it
happens.
> }

Sebastian

Sergey Senozhatsky Feb. 6, 2025, 7:47 a.m. UTC | #13

On (25/02/06 08:38), Sebastian Andrzej Siewior wrote:
> On 2025-02-06 16:01:12 [+0900], Sergey Senozhatsky wrote:
> > So I completely reworked this bit and we don't have that problem
> > anymore, nor the problem of "inventing locking schemes in 2025".
> > 
> > In short - we are returning back to bit-locks, what zram has been using
> > until commit 9518e5bfaae19 ("zram: Replace bit spinlocks with a spinlock_t),
> > not bit-spinlock these time around, that won't work with linux-rt, but
> > wait_on_bit and friends.  Entry lock is exclusive, just like before,
> > but lock owner can sleep now, any task wishing to lock that same entry
> > will wait to be woken up by the current lock owner once it unlocks the
> > entry.  For cases when lock is taken from atomic context (e.g. slot-free
> > notification from softirq) we continue using TRY lock, which has been
> > introduced by commit 3c9959e025472 ("zram: fix lockdep warning of free block
> > handling"), so there's nothing new here.  In addition I added some lockdep
> > annotations, just to be safe.
> > 
> > There shouldn't be too many tasks competing for the same entry.  I can
> > only think of cases when read/write (or slot-free notification if zram
> > is used as a swap device) vs. writeback or recompression (we cannot have
> > writeback and recompression simultaneously).
> 
> So if I understand, you want to get back to bit spinlocks but sleeping
> instead of polling. But why? Do you intend to have more locks per entry
> so that you use the additional bits with the "lock"?

zram is atomic right now, e.g.

zram_read()
	lock entry by index   # disables preemption
	map zsmalloc entry    # possibly memcpy
	decompress
	unmap zsmalloc
	unlock entry          # enables preemption

That's a pretty long time to keep preemption disabled (e.g. using slow
algorithm like zstd or deflate configured with high compression levels).
Apart from that that, difficult to use async algorithms, which can
e.g. wait for a H/W to become available, or algorithms that might want
to allocate memory internally during compression/decompression, e.g.
zstd).

Entry lock is not the only lock in zram currently that makes it
atomic, just one of.

> > It currently looks like this:
> > 
> …
> > static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
> > {
> >         unsigned long *lock = &zram->table[index].flags;
> > 
> >         if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >                 mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> > #endif
> >                 return true;
> >         }
> >         return false;
> > }
> 
> I hope the caller does not poll.

Yeah, we don't that in the code.

> > static void zram_slot_lock(struct zram *zram, u32 index)
> > {
> >         unsigned long *lock = &zram->table[index].flags;
> > 
> >         WARN_ON_ONCE(!preemptible());
> 
> you want might_sleep() here instead. preemptible() works only on
> preemptible kernels. And might_sleep() is already provided by
> wait_on_bit_lock(). So this can go.

wait_on_bit_lock() has might_sleep().

> >         wait_on_bit_lock(lock, ZRAM_ENTRY_LOCK, TASK_UNINTERRUPTIBLE);
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >         mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> > #endif
> 
> I would argue that you want this before the wait_on_bit_lock() simply
> because you want to see a possible deadlock before it happens.

Ack.

> > static void zram_slot_unlock(struct zram *zram, u32 index)
> > {
> >         unsigned long *lock = &zram->table[index].flags;
> > 
> >         clear_and_wake_up_bit(ZRAM_ENTRY_LOCK, lock);
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >         mutex_release(&zram->table[index].lockdep_map, _RET_IP_);
> > #endif
> Also before. So it complains about release a not locked lock before it
> happens.

OK.

Sebastian Andrzej Siewior Feb. 6, 2025, 8:13 a.m. UTC | #14

On 2025-02-06 16:47:02 [+0900], Sergey Senozhatsky wrote:
> zram is atomic right now, e.g.
> 
> zram_read()
> 	lock entry by index   # disables preemption
> 	map zsmalloc entry    # possibly memcpy
> 	decompress
> 	unmap zsmalloc
> 	unlock entry          # enables preemption
> 
> That's a pretty long time to keep preemption disabled (e.g. using slow
> algorithm like zstd or deflate configured with high compression levels).
> Apart from that that, difficult to use async algorithms, which can
> e.g. wait for a H/W to become available, or algorithms that might want
> to allocate memory internally during compression/decompression, e.g.
> zstd).
> 
> Entry lock is not the only lock in zram currently that makes it
> atomic, just one of.

Okay. So there are requirements for the sleeping lock. A mutex isn't
fitting the requirement because it is too large I guess.

> > > static void zram_slot_lock(struct zram *zram, u32 index)
> > > {
> > >         unsigned long *lock = &zram->table[index].flags;
> > > 
> > >         WARN_ON_ONCE(!preemptible());
> > 
> > you want might_sleep() here instead. preemptible() works only on
> > preemptible kernels. And might_sleep() is already provided by
> > wait_on_bit_lock(). So this can go.
> 
> wait_on_bit_lock() has might_sleep().

My point exactly. This makes the WARN_ON_ONCE() obsolete.

Sebastian

Sergey Senozhatsky Feb. 6, 2025, 8:17 a.m. UTC | #15

On (25/02/06 09:13), Sebastian Andrzej Siewior wrote:
> On 2025-02-06 16:47:02 [+0900], Sergey Senozhatsky wrote:
> > zram is atomic right now, e.g.
> > 
> > zram_read()
> > 	lock entry by index   # disables preemption
> > 	map zsmalloc entry    # possibly memcpy
> > 	decompress
> > 	unmap zsmalloc
> > 	unlock entry          # enables preemption
> > 
> > That's a pretty long time to keep preemption disabled (e.g. using slow
> > algorithm like zstd or deflate configured with high compression levels).
> > Apart from that that, difficult to use async algorithms, which can
> > e.g. wait for a H/W to become available, or algorithms that might want
> > to allocate memory internally during compression/decompression, e.g.
> > zstd).
> > 
> > Entry lock is not the only lock in zram currently that makes it
> > atomic, just one of.
> 
> Okay. So there are requirements for the sleeping lock. A mutex isn't
> fitting the requirement because it is too large I guess.

Correct.

> > > > static void zram_slot_lock(struct zram *zram, u32 index)
> > > > {
> > > >         unsigned long *lock = &zram->table[index].flags;
> > > > 
> > > >         WARN_ON_ONCE(!preemptible());
> > > 
> > > you want might_sleep() here instead. preemptible() works only on
> > > preemptible kernels. And might_sleep() is already provided by
> > > wait_on_bit_lock(). So this can go.
> > 
> > wait_on_bit_lock() has might_sleep().
> 
> My point exactly. This makes the WARN_ON_ONCE() obsolete.

Right, might_sleep() can be disabled, as far as I understand,
via CONFIG_DEBUG_ATOMIC_SLEEP, unlike WARN_ON_ONCE().  But I
can drop it and then just rely on might_sleep(), should be
enough.

Sebastian Andrzej Siewior Feb. 6, 2025, 8:26 a.m. UTC | #16

On 2025-02-06 17:17:41 [+0900], Sergey Senozhatsky wrote:
> > Okay. So there are requirements for the sleeping lock. A mutex isn't
> > fitting the requirement because it is too large I guess.
> 
> Correct.

I would nice to state this why a generic locking implementation can not
be used. From what I have seen it should play along with RT nicely.

> > > > > static void zram_slot_lock(struct zram *zram, u32 index)
> > > > > {
> > > > >         unsigned long *lock = &zram->table[index].flags;
> > > > > 
> > > > >         WARN_ON_ONCE(!preemptible());
> > > > 
> > > > you want might_sleep() here instead. preemptible() works only on
> > > > preemptible kernels. And might_sleep() is already provided by
> > > > wait_on_bit_lock(). So this can go.
> > > 
> > > wait_on_bit_lock() has might_sleep().
> > 
> > My point exactly. This makes the WARN_ON_ONCE() obsolete.
> 
> Right, might_sleep() can be disabled, as far as I understand,
> via CONFIG_DEBUG_ATOMIC_SLEEP, unlike WARN_ON_ONCE().  But I
> can drop it and then just rely on might_sleep(), should be
> enough.

It should be enough. mutex_lock(), down() and so on relies solely on it.
As I said, preemptible() only works on preemptible kernels if it comes
to the preemption counter on and !preemptible kernels with enabled
debugging.

Sebastian

Sergey Senozhatsky Feb. 6, 2025, 8:29 a.m. UTC | #17

On (25/02/06 09:26), Sebastian Andrzej Siewior wrote:
> On 2025-02-06 17:17:41 [+0900], Sergey Senozhatsky wrote:
> > > Okay. So there are requirements for the sleeping lock. A mutex isn't
> > > fitting the requirement because it is too large I guess.
> > 
> > Correct.
> 
> I would nice to state this why a generic locking implementation can not
> be used. From what I have seen it should play along with RT nicely.

Will do.

> > > > wait_on_bit_lock() has might_sleep().
> > > 
> > > My point exactly. This makes the WARN_ON_ONCE() obsolete.
> > 
> > Right, might_sleep() can be disabled, as far as I understand,
> > via CONFIG_DEBUG_ATOMIC_SLEEP, unlike WARN_ON_ONCE().  But I
> > can drop it and then just rely on might_sleep(), should be
> > enough.
> 
> It should be enough. mutex_lock(), down() and so on relies solely on it.
> As I said, preemptible() only works on preemptible kernels if it comes
> to the preemption counter on and !preemptible kernels with enabled
> debugging.

Ack.

[PATCHv4,01/17] zram: switch to non-atomic entry locking

Commit Message

Comments

Patch