x86/mce: Dynamically size space for machine check records

Message ID	Zd--PJp-NbXGrb39@agluck-desk3 (mailing list archive)
State	New
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B6A913D2F2; Wed, 28 Feb 2024 23:14:06 +0000 (UTC) Date: Wed, 28 Feb 2024 15:14:04 -0800 From: Tony Luck <tony.luck@intel.com> To: Borislav Petkov <bp@alien8.de> Cc: "Naik, Avadhut" <avadnaik@amd.com>, "Mehta, Sohil" <sohil.mehta@intel.com>, "x86@kernel.org" <x86@kernel.org>, "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "yazen.ghannam@amd.com" <yazen.ghannam@amd.com>, Avadhut Naik <avadhut.naik@amd.com> Subject: [PATCH] x86/mce: Dynamically size space for machine check records Message-ID: <Zd--PJp-NbXGrb39@agluck-desk3> References: <20240212093246.GBZcnlvkPKDC8C7rv5@fat_crate.local> <SJ1PR11MB6083B3511D18787BE823AB2CFC482@SJ1PR11MB6083.namprd11.prod.outlook.com> <20240212175408.GIZcpbQHVjEtwRKLS-@fat_crate.local> <SJ1PR11MB60830AF35FA89C7869B8C11EFC482@SJ1PR11MB6083.namprd11.prod.outlook.com> <20240212191401.GLZcpt-XHFqPg3cDw-@fat_crate.local> <SJ1PR11MB6083C60D7584B02E9CAF19D5FC482@SJ1PR11MB6083.namprd11.prod.outlook.com> <ZcqPhVO_DtD2x5N7@agluck-desk3> <20240212220833.GQZcqW4WxKH34i-oBR@fat_crate.local> <20240212221913.GRZcqZYRd6EPTTnN97@fat_crate.local> <20240212224220.GSZcqezMhPojxvIcvO@fat_crate.local> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240212224220.GSZcqezMhPojxvIcvO@fat_crate.local>
Series	x86/mce: Dynamically size space for machine check records \| expand x86/mce: Dynamically size space for machine check records

Luck, Tony Feb. 28, 2024, 11:14 p.m. UTC

Systems with a large number of CPUs may generate a large
number of machine check records when things go seriously
wrong. But Linux has a fixed buffer that can only capture
a few dozen errors.

Allocate space based on the number of CPUs (with a minimum
value based on the historical fixed buffer that could store
80 records).

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Discussion earlier concluded with the realization that it is
safe to dynamically allocate the mce_evt_pool at boot time.
So here's a patch to do that. Scaling algorithm here is a
simple linear "4 records per possible CPU" with a minimum
of 80 to match the legacy behavior. I'm open to other
suggestions.

Note that I threw in a "+1" to the return from ilog2() when
calling gen_pool_create(). From reading code, and running
some tests, it appears that the min_alloc_order argument
needs to be large enough to allocate one of the mce_evt_llist
structures.

Some other gen_pool users in Linux may also need this "+1".

 arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

Sohil Mehta Feb. 29, 2024, 12:39 a.m. UTC | #1

On 2/28/2024 3:14 PM, Tony Luck wrote:

>  static int mce_gen_pool_create(void)
>  {
> +	int mce_numrecords, mce_poolsz;
>  	struct gen_pool *tmpp;
>  	int ret = -ENOMEM;
> +	void *mce_pool;
> +	int order;
>  
> -	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> +	order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> +	tmpp = gen_pool_create(order, -1);
>  	if (!tmpp)
>  		goto out;
>  
> -	ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> +	mce_numrecords = max(80, num_possible_cpus() * 4);
> +	mce_poolsz = mce_numrecords * (1 << order);
> +	mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
> +	if (!mce_pool) {
> +		gen_pool_destroy(tmpp);
> +		goto out;
> +	}
> +	ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
>  	if (ret) {
>  		gen_pool_destroy(tmpp);

Is this missing a kfree(mce_pool) here?

>  		goto out;

Luck, Tony Feb. 29, 2024, 12:44 a.m. UTC | #2

>> +	ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
>>  	if (ret) {
>>  		gen_pool_destroy(tmpp);
>
> Is this missing a kfree(mce_pool) here?
>
>>  		goto out;

Yes. Will add.

Thanks

-Tony

Sohil Mehta Feb. 29, 2024, 1:56 a.m. UTC | #3

A few other nits.

On 2/28/2024 3:14 PM, Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> index fbe8b61c3413..a1f0a8f29cf5 100644
> --- a/arch/x86/kernel/cpu/mce/genpool.c
> +++ b/arch/x86/kernel/cpu/mce/genpool.c
> @@ -16,14 +16,13 @@
>   * used to save error information organized in a lock-less list.
>   *
>   * This memory pool is only to be used to save MCE records in MCE context.
> - * MCE events are rare, so a fixed size memory pool should be enough. Use
> - * 2 pages to save MCE events for now (~80 MCE records at most).
> + * MCE events are rare, so a fixed size memory pool should be enough.
> + * Allocate on a sliding scale based on number of CPUs.
>   */
> -#define MCE_POOLSZ	(2 * PAGE_SIZE)
> +#define MCE_MIN_ENTRIES	80
>  
>  static struct gen_pool *mce_evt_pool;
>  static LLIST_HEAD(mce_event_llist);
> -static char gen_pool_buf[MCE_POOLSZ];
>  
>  /*
>   * Compare the record "t" with each of the records on list "l" to see if
> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
>  
>  static int mce_gen_pool_create(void)
>  {
> +	int mce_numrecords, mce_poolsz;

Should order be also declared in this line? That way we can have all the
uninitialized 'int's together.

>  	struct gen_pool *tmpp;
>  	int ret = -ENOMEM;
> +	void *mce_pool;
> +	int order;
>  
> -	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> +	order = ilog2(sizeof(struct mce_evt_llist)) + 1;

I didn't exactly understand why a +1 is needed here. Do you have a
pointer to somewhere to help understand this?

Also, I think, a comment on top might be useful since this isn't obvious.

> +	tmpp = gen_pool_create(order, -1);
>  	if (!tmpp)
>  		goto out;
>  
> -	ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> +	mce_numrecords = max(80, num_possible_cpus() * 4);

How about using MCE_MIN_ENTRIES here?

> +	mce_poolsz = mce_numrecords * (1 << order);
> +	mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
> +	if (!mce_pool) {
> +		gen_pool_destroy(tmpp);
> +		goto out;
> +	}
> +	ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
>  	if (ret) {
>  		gen_pool_destroy(tmpp);
>  		goto out;

Naik, Avadhut Feb. 29, 2024, 6:42 a.m. UTC | #4

Hi,

On 2/28/2024 17:14, Tony Luck wrote:
> Systems with a large number of CPUs may generate a large
> number of machine check records when things go seriously
> wrong. But Linux has a fixed buffer that can only capture
> a few dozen errors.
> 
> Allocate space based on the number of CPUs (with a minimum
> value based on the historical fixed buffer that could store
> 80 records).
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> Discussion earlier concluded with the realization that it is
> safe to dynamically allocate the mce_evt_pool at boot time.
> So here's a patch to do that. Scaling algorithm here is a
> simple linear "4 records per possible CPU" with a minimum
> of 80 to match the legacy behavior. I'm open to other
> suggestions.
> 
> Note that I threw in a "+1" to the return from ilog2() when
> calling gen_pool_create(). From reading code, and running
> some tests, it appears that the min_alloc_order argument
> needs to be large enough to allocate one of the mce_evt_llist
> structures.
> 
> Some other gen_pool users in Linux may also need this "+1".
> 

Somewhat confused here. Weren't we also exploring ways to avoid
duplicate records from being added to the genpool? Has something
changed in that regard?

>  arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> index fbe8b61c3413..a1f0a8f29cf5 100644
> --- a/arch/x86/kernel/cpu/mce/genpool.c
> +++ b/arch/x86/kernel/cpu/mce/genpool.c
> @@ -16,14 +16,13 @@
>   * used to save error information organized in a lock-less list.
>   *
>   * This memory pool is only to be used to save MCE records in MCE context.
> - * MCE events are rare, so a fixed size memory pool should be enough. Use
> - * 2 pages to save MCE events for now (~80 MCE records at most).
> + * MCE events are rare, so a fixed size memory pool should be enough.
> + * Allocate on a sliding scale based on number of CPUs.
>   */
> -#define MCE_POOLSZ	(2 * PAGE_SIZE)
> +#define MCE_MIN_ENTRIES	80
>  
>  static struct gen_pool *mce_evt_pool;
>  static LLIST_HEAD(mce_event_llist);
> -static char gen_pool_buf[MCE_POOLSZ];
>  
>  /*
>   * Compare the record "t" with each of the records on list "l" to see if
> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
>  
>  static int mce_gen_pool_create(void)
>  {
> +	int mce_numrecords, mce_poolsz;
>  	struct gen_pool *tmpp;
>  	int ret = -ENOMEM;
> +	void *mce_pool;
> +	int order;
>  
> -	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> +	order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> +	tmpp = gen_pool_create(order, -1);
>  	if (!tmpp)
>  		goto out;
>  
> -	ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> +	mce_numrecords = max(80, num_possible_cpus() * 4);
> +	mce_poolsz = mce_numrecords * (1 << order);
> +	mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);

To err on the side of caution, wouldn't kzalloc() be a safer choice here?

Borislav Petkov Feb. 29, 2024, 8:39 a.m. UTC | #5

On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
> Somewhat confused here. Weren't we also exploring ways to avoid
> duplicate records from being added to the genpool? Has something
> changed in that regard?

You can always send patches proposing how *you* think this duplicate
elimination should look like and we can talk. :)

I don't think anyone would mind it if done properly but first you'd need
a real-life use case. As in, do we log sooo many duplicates such that
we'd want to dedup?

Hmmm.

Yazen Ghannam Feb. 29, 2024, 3:49 p.m. UTC | #6

On 2/28/2024 8:56 PM, Sohil Mehta wrote:
> A few other nits.
> 
> On 2/28/2024 3:14 PM, Tony Luck wrote:
>> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
>> index fbe8b61c3413..a1f0a8f29cf5 100644
>> --- a/arch/x86/kernel/cpu/mce/genpool.c
>> +++ b/arch/x86/kernel/cpu/mce/genpool.c
>> @@ -16,14 +16,13 @@
>>    * used to save error information organized in a lock-less list.
>>    *
>>    * This memory pool is only to be used to save MCE records in MCE context.
>> - * MCE events are rare, so a fixed size memory pool should be enough. Use
>> - * 2 pages to save MCE events for now (~80 MCE records at most).
>> + * MCE events are rare, so a fixed size memory pool should be enough.
>> + * Allocate on a sliding scale based on number of CPUs.
>>    */
>> -#define MCE_POOLSZ	(2 * PAGE_SIZE)
>> +#define MCE_MIN_ENTRIES	80
>>   
>>   static struct gen_pool *mce_evt_pool;
>>   static LLIST_HEAD(mce_event_llist);
>> -static char gen_pool_buf[MCE_POOLSZ];
>>   
>>   /*
>>    * Compare the record "t" with each of the records on list "l" to see if
>> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
>>   
>>   static int mce_gen_pool_create(void)
>>   {
>> +	int mce_numrecords, mce_poolsz;
> 
> Should order be also declared in this line? That way we can have all the
> uninitialized 'int's together.
> 
>>   	struct gen_pool *tmpp;
>>   	int ret = -ENOMEM;
>> +	void *mce_pool;
>> +	int order;
>>   
>> -	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
>> +	order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> 
> I didn't exactly understand why a +1 is needed here. Do you have a
> pointer to somewhere to help understand this?
> 
> Also, I think, a comment on top might be useful since this isn't obvious.
> 

Would order_base_2() work here? It automatically rounds up to the next power.

Thanks,
Yazen

Luck, Tony Feb. 29, 2024, 5:21 p.m. UTC | #7

On Wed, Feb 28, 2024 at 05:56:26PM -0800, Sohil Mehta wrote:
> A few other nits.
> 
> On 2/28/2024 3:14 PM, Tony Luck wrote:
> > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> > index fbe8b61c3413..a1f0a8f29cf5 100644
> > --- a/arch/x86/kernel/cpu/mce/genpool.c
> > +++ b/arch/x86/kernel/cpu/mce/genpool.c
> > @@ -16,14 +16,13 @@
> >   * used to save error information organized in a lock-less list.
> >   *
> >   * This memory pool is only to be used to save MCE records in MCE context.
> > - * MCE events are rare, so a fixed size memory pool should be enough. Use
> > - * 2 pages to save MCE events for now (~80 MCE records at most).
> > + * MCE events are rare, so a fixed size memory pool should be enough.
> > + * Allocate on a sliding scale based on number of CPUs.
> >   */
> > -#define MCE_POOLSZ	(2 * PAGE_SIZE)
> > +#define MCE_MIN_ENTRIES	80
> >  
> >  static struct gen_pool *mce_evt_pool;
> >  static LLIST_HEAD(mce_event_llist);
> > -static char gen_pool_buf[MCE_POOLSZ];
> >  
> >  /*
> >   * Compare the record "t" with each of the records on list "l" to see if
> > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
> >  
> >  static int mce_gen_pool_create(void)
> >  {
> > +	int mce_numrecords, mce_poolsz;
> 
> Should order be also declared in this line? That way we can have all the
> uninitialized 'int's together.

Sure. Even with the addition of "order" the line is still short enough.

> >  	struct gen_pool *tmpp;
> >  	int ret = -ENOMEM;
> > +	void *mce_pool;
> > +	int order;
> >  
> > -	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> > +	order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> 
> I didn't exactly understand why a +1 is needed here. Do you have a
> pointer to somewhere to help understand this?
> 
> Also, I think, a comment on top might be useful since this isn't obvious.

gen_pool works in power-of-two blocks. The user must pick a specific
size with the gen_pool_create() call. Internally the gen_pool code uses
a bitmap to track which blocks in the pool are allocated/free.

Looking at this specific case, sizeof(struct mce_evt_llist) is 136. So
the original version of this code picks order 7 to allocate in 128 byte
units. But this means that every allocation of a mce_evt_llist will take
two 128-byte blocks.

Net result is that the comment at the top of arch/x86/kernel/cpu/mce/genpool.c
that two pages are enough for ~80 records was wrong when written. At
that point struct mce_evt_llist was below 128, so order was 6, and each
allocation took two blocks. So two pages = 8192 bytes divided by (2 * 64)
results in 64 possible allocations.

But over time Intel and AMD added to the structure. So the current math
comes out at just 32 allocations before the pool is out of space.

Yazen provided the right answer for this. Change to use order_base_2()

> > +	tmpp = gen_pool_create(order, -1);
> >  	if (!tmpp)
> >  		goto out;
> >  
> > -	ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> > +	mce_numrecords = max(80, num_possible_cpus() * 4);
> 
> How about using MCE_MIN_ENTRIES here?

Oops. I meant to do that when I added the #define!

I've also added a "#define MCE_PER_CPU 4" instead of
a raw "4" in that expression.

-Tony

Luck, Tony Feb. 29, 2024, 5:22 p.m. UTC | #8

On Thu, Feb 29, 2024 at 10:49:18AM -0500, Yazen Ghannam wrote:
> On 2/28/2024 8:56 PM, Sohil Mehta wrote:
> > > +	order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> > 
> > I didn't exactly understand why a +1 is needed here. Do you have a
> > pointer to somewhere to help understand this?
> > 
> > Also, I think, a comment on top might be useful since this isn't obvious.
> > 
> 
> Would order_base_2() work here? It automatically rounds up to the next power.

Yes!  That's exactly what is needed here. Thanks!

-Tony

Luck, Tony Feb. 29, 2024, 5:26 p.m. UTC | #9

On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
> Hi,
> 
> On 2/28/2024 17:14, Tony Luck wrote:
> > Systems with a large number of CPUs may generate a large
> > number of machine check records when things go seriously
> > wrong. But Linux has a fixed buffer that can only capture
> > a few dozen errors.
> > 
> > Allocate space based on the number of CPUs (with a minimum
> > value based on the historical fixed buffer that could store
> > 80 records).
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> > 
> > Discussion earlier concluded with the realization that it is
> > safe to dynamically allocate the mce_evt_pool at boot time.
> > So here's a patch to do that. Scaling algorithm here is a
> > simple linear "4 records per possible CPU" with a minimum
> > of 80 to match the legacy behavior. I'm open to other
> > suggestions.
> > 
> > Note that I threw in a "+1" to the return from ilog2() when
> > calling gen_pool_create(). From reading code, and running
> > some tests, it appears that the min_alloc_order argument
> > needs to be large enough to allocate one of the mce_evt_llist
> > structures.
> > 
> > Some other gen_pool users in Linux may also need this "+1".
> > 
> 
> Somewhat confused here. Weren't we also exploring ways to avoid
> duplicate records from being added to the genpool? Has something
> changed in that regard?

I'm going to cover this in the reply to Boris.

> > +	mce_numrecords = max(80, num_possible_cpus() * 4);
> > +	mce_poolsz = mce_numrecords * (1 << order);
> > +	mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
> 
> To err on the side of caution, wouldn't kzalloc() be a safer choice here?

Seems like too much caution. When mce_gen_pool_add() allocates
an entry from the pool it does:

	memcpy(&node->mce, mce, sizeof(*mce));
	llist_add(&node->llnode, &mce_event_llist);

between those two lines, every field in the struct mce_evt_llist
is written (including any holes in the struct mce part of the structure).

-Tony

Luck, Tony Feb. 29, 2024, 5:47 p.m. UTC | #10

On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote:
> On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
> > Somewhat confused here. Weren't we also exploring ways to avoid
> > duplicate records from being added to the genpool? Has something
> > changed in that regard?
> 
> You can always send patches proposing how *you* think this duplicate
> elimination should look like and we can talk. :)
> 
> I don't think anyone would mind it if done properly but first you'd need
> a real-life use case. As in, do we log sooo many duplicates such that
> we'd want to dedup?

There are definitly cases where dedup will not help. If a row fails in a
DIMM there will be a flood of correctable errors with different addresses
(depending on number of channels in the interleave schema for a system
this may be dozens or hundreds of distinct addresses).

Same for other failures in structures like column and rank.

As to "real-life" use cases. A search on Lore for "MCE records pool
full!" only finds threads about modifications to this code. So the
general population of Linux developers isn't seeing this.

But a search in my internal e-mail box kicks up a dozen or so distinct
hits from internal validation teams in just the past year. But those
folks are super-dedicated to finding corner cases. Just this morning I
got a triumphant e-mail from someone who reproduced an issue "after 1.6
million error injections".

I'd bet that large cloud providers with system numbered in the hundreds
of thousands see the MCE pool exhausted from time to time.

Open question is "How many error records do you need to diagnose the
cause of various floods of errors?"

I'm going to say that the current 32 (see earlier e-mail today to
Sohil https://lore.kernel.org/all/ZeC8_jzdFnkpPVPf@agluck-desk3/ )
isn't enough for big systems. It may be hard to distinguish between
the various bulk fail modes with just that many error logs.

-Tony

Naik, Avadhut Feb. 29, 2024, 6:28 p.m. UTC | #11

On 2/29/2024 11:47, Tony Luck wrote:
> On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote:
>> On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
>>> Somewhat confused here. Weren't we also exploring ways to avoid
>>> duplicate records from being added to the genpool? Has something
>>> changed in that regard?
>>
>> You can always send patches proposing how *you* think this duplicate
>> elimination should look like and we can talk. :)
>>
>> I don't think anyone would mind it if done properly but first you'd need
>> a real-life use case. As in, do we log sooo many duplicates such that
>> we'd want to dedup?
> 
> There are definitly cases where dedup will not help. If a row fails in a
> DIMM there will be a flood of correctable errors with different addresses
> (depending on number of channels in the interleave schema for a system
> this may be dozens or hundreds of distinct addresses).
> 
> Same for other failures in structures like column and rank.
> 

Wouldn't having dedup actually increase the time we spend #MC context?
Comparing the new MCE record against each existing record in the
genpool.

AFAIK, MCEs cannot be nested. Correct me if I am wrong here.

In a flood situation, like the one described above, that is exactly
what may happen: An MCE coming in while the dedup mechanism is
underway (in #MC context).

Luck, Tony Feb. 29, 2024, 6:38 p.m. UTC | #12

> Wouldn't having dedup actually increase the time we spend #MC context?
> Comparing the new MCE record against each existing record in the
> genpool.

Yes, dedup would take extra time (increasing linearly with the number
of pending errors that were not filtered out by the dedup process).

> AFAIK, MCEs cannot be nested. Correct me if I am wrong here.

Can't be nested on the same CPU. But multiple CPUs may take
a local machine check simultaneously. Local machine check is
opt-in on Intel, I believe it is default on AMD.

Errors can also be signaled with CMCI.

> In a flood situation, like the one described above, that is exactly
> what may happen: An MCE coming in while the dedup mechanism is
> underway (in #MC context).

In a flood of errors it would be complicated to synchronize dedup filtering
on multiple CPUs. The trade-off between trying to get that code right,
and just allocating a few extra Kbytes of memory would seem to favor
allocating more memory.

Sohil Mehta Feb. 29, 2024, 11:56 p.m. UTC | #13

On 2/29/2024 9:21 AM, Tony Luck wrote:

> Looking at this specific case, sizeof(struct mce_evt_llist) is 136. So
> the original version of this code picks order 7 to allocate in 128 byte
> units. But this means that every allocation of a mce_evt_llist will take
> two 128-byte blocks.
> 
> Net result is that the comment at the top of arch/x86/kernel/cpu/mce/genpool.c
> that two pages are enough for ~80 records was wrong when written. At
> that point struct mce_evt_llist was below 128, so order was 6, and each
> allocation took two blocks. So two pages = 8192 bytes divided by (2 * 64)
> results in 64 possible allocations.
> 

Thanks for the explanation. The part that got me is that I somehow
expected ilog2() to round-up and not round-down.

> But over time Intel and AMD added to the structure. So the current math
> comes out at just 32 allocations before the pool is out of space.
> 
> Yazen provided the right answer for this. Change to use order_base_2()
> 

Yes, I agree. order_base_2() is better than doing ilog2(struct size) +
1. In the rare scenario of the size exactly being a power of 2 we don't
need to add the +1.

Naik, Avadhut March 6, 2024, 9:52 p.m. UTC | #14

Hi,

On 2/28/2024 17:14, Tony Luck wrote:
> Systems with a large number of CPUs may generate a large
> number of machine check records when things go seriously
> wrong. But Linux has a fixed buffer that can only capture
> a few dozen errors.
> 
> Allocate space based on the number of CPUs (with a minimum
> value based on the historical fixed buffer that could store
> 80 records).
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> Discussion earlier concluded with the realization that it is
> safe to dynamically allocate the mce_evt_pool at boot time.
> So here's a patch to do that. Scaling algorithm here is a
> simple linear "4 records per possible CPU" with a minimum
> of 80 to match the legacy behavior. I'm open to other
> suggestions.
> 
> Note that I threw in a "+1" to the return from ilog2() when
> calling gen_pool_create(). From reading code, and running
> some tests, it appears that the min_alloc_order argument
> needs to be large enough to allocate one of the mce_evt_llist
> structures.
> 
> Some other gen_pool users in Linux may also need this "+1".
> 
>  arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> index fbe8b61c3413..a1f0a8f29cf5 100644
> --- a/arch/x86/kernel/cpu/mce/genpool.c
> +++ b/arch/x86/kernel/cpu/mce/genpool.c
> @@ -16,14 +16,13 @@
>   * used to save error information organized in a lock-less list.
>   *
>   * This memory pool is only to be used to save MCE records in MCE context.
> - * MCE events are rare, so a fixed size memory pool should be enough. Use
> - * 2 pages to save MCE events for now (~80 MCE records at most).
> + * MCE events are rare, so a fixed size memory pool should be enough.
> + * Allocate on a sliding scale based on number of CPUs.
>   */
> -#define MCE_POOLSZ	(2 * PAGE_SIZE)
> +#define MCE_MIN_ENTRIES	80
>  
>  static struct gen_pool *mce_evt_pool;
>  static LLIST_HEAD(mce_event_llist);
> -static char gen_pool_buf[MCE_POOLSZ];
>  
>  /*
>   * Compare the record "t" with each of the records on list "l" to see if
> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
>  
>  static int mce_gen_pool_create(void)
>  {
> +	int mce_numrecords, mce_poolsz;
>  	struct gen_pool *tmpp;
>  	int ret = -ENOMEM;
> +	void *mce_pool;
> +	int order;
>  
> -	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> +	order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> +	tmpp = gen_pool_create(order, -1);
>  	if (!tmpp)
>  		goto out;
>  
> -	ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> +	mce_numrecords = max(80, num_possible_cpus() * 4);

Per Boris's below suggestion, shouldn't this be:
	mce_numrecords = max(80, num_possible_cpus() * 16);

>> 	min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
> 
> max() ofc.
> 
>> There's a sane minimum and one page pro logical CPU should be fine on
>> pretty much every configuration...

4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic
behavior you explained in the other subthread.

Apart from this, tested the patch on a couple of AMD systems. Didn't observe any
issues.

> +	mce_poolsz = mce_numrecords * (1 << order);
> +	mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
> +	if (!mce_pool) {
> +		gen_pool_destroy(tmpp);
> +		goto out;
> +	}
> +	ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
>  	if (ret) {
>  		gen_pool_destroy(tmpp);
>  		goto out;

Luck, Tony March 6, 2024, 10:07 p.m. UTC | #15

> > +   mce_numrecords = max(80, num_possible_cpus() * 4);
>
> Per Boris's below suggestion, shouldn't this be:
>       mce_numrecords = max(80, num_possible_cpus() * 16);
>
> >>    min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
> >
> > max() ofc.
> >
> >> There's a sane minimum and one page pro logical CPU should be fine on
> >> pretty much every configuration...
>
> 4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic
> behavior you explained in the other subthread.

Picking a good number of records-per-core may be more art than science. Boris
is right that a page per CPU shouldn't cause any significant issue to systems with
many CPUs, because they should have copious amounts of memory to make a
balanced configuration. But 16 records per CPU feels way too high to me. The
theoretical limit in a single scan of machine check banks on Intel is 32 (since
Intel never has more than 32 banks). But those banks cover diverse h/w devices
and it seems improbable that all, or even most, of them would log errors at the
same time, with all CPUs on all sockets doing the same.

After I posted the version with num_possible_cpus() * 4 I began to wonder whether
"2" would be enough.

> Apart from this, tested the patch on a couple of AMD systems. Didn't observe any
> issues.

Thanks very much for testing.

-Tony

Naik, Avadhut March 6, 2024, 11:21 p.m. UTC | #16

On 3/6/2024 16:07, Luck, Tony wrote:
>>> +   mce_numrecords = max(80, num_possible_cpus() * 4);
>>
>> Per Boris's below suggestion, shouldn't this be:
>>       mce_numrecords = max(80, num_possible_cpus() * 16);
>>
>>>>    min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
>>>
>>> max() ofc.
>>>
>>>> There's a sane minimum and one page pro logical CPU should be fine on
>>>> pretty much every configuration...
>>
>> 4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic
>> behavior you explained in the other subthread.
> 
> Picking a good number of records-per-core may be more art than science. Boris
> is right that a page per CPU shouldn't cause any significant issue to systems with
> many CPUs, because they should have copious amounts of memory to make a
> balanced configuration. But 16 records per CPU feels way too high to me. The
> theoretical limit in a single scan of machine check banks on Intel is 32 (since
> Intel never has more than 32 banks). But those banks cover diverse h/w devices
> and it seems improbable that all, or even most, of them would log errors at the
> same time, with all CPUs on all sockets doing the same.
> 
> After I posted the version with num_possible_cpus() * 4 I began to wonder whether
> "2" would be enough.
> 

Was thinking along the same lines that 16 MCE records per thread might be too high.
But since Boris made the suggestion, I thought there might be a use case that I am
unaware of. Perhaps, some issue that had been debugged in the past. Hence, my
earlier question if it should be 16 instead of 4.
I think 2 records should also be good. IIRC, the patch that I submitted reserved
space of 2 records per logical CPU in the genpool.

>> Apart from this, tested the patch on a couple of AMD systems. Didn't observe any
>> issues.
> 
> Thanks very much for testing.
> 
> -Tony

x86/mce: Dynamically size space for machine check records

Commit Message

Comments

Patch