diff mbox

[2/3] mm, x86: Remove region_is_ram() call from ioremap

Message ID 1434750245-6304-3-git-send-email-toshi.kani@hp.com (mailing list archive)
State Superseded
Headers show

Commit Message

Toshi Kani June 19, 2015, 9:44 p.m. UTC
__ioremap_caller() calls region_is_ram() to look up the resource
to check if a target range is RAM, which was added as an additinal
check to improve the lookup performance over page_is_ram() (commit
906e36c5c717 "x86: use optimized ioresource lookup in ioremap
function").

__ioremap_caller() then calls walk_system_ram_range(), which had
replaced page_is_ram() to improve the lookup performance (commit
c81c8a1eeede "x86, ioremap: Speed up check for RAM pages").

Since both functions walk through the resource table, there is
no need to call the two functions.  Furthermore, region_is_ram()
has bugs and always returns with -1.  This makes
walk_system_ram_range() as the only check being used.

Hence, remove the call to region_is_ram() from __ioremap_caller().

Note, removing the call to region_is_ram() is also necessary
to fix the bugs in region_is_ram().  walk_system_ram_range()
requires RAM ranges aligned by the page size in the resource
table.  e820_reserve_setup_data() updates the e820 table by
allocating a separate entry to each data region in setup_data,
which is not page-aligned.  Therefore, walk_system_ram_range()
is unable to detect the RAM ranges in setup_data.  This
restriction has allowed multiple uses of ioremap() to map
setup_data.  Using fixed region_is_ram() will cause these callers
to start failing.  After all ioremap to setup_data are converted,
__ioremap_caller() may call region_is_ram() instead.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
---
 arch/x86/mm/ioremap.c |   24 ++++++------------------
 1 file changed, 6 insertions(+), 18 deletions(-)

Comments

Mike Travis June 22, 2015, 4:21 p.m. UTC | #1
On 6/19/2015 2:44 PM, Toshi Kani wrote:
> __ioremap_caller() calls region_is_ram() to look up the resource
> to check if a target range is RAM, which was added as an additinal
> check to improve the lookup performance over page_is_ram() (commit
> 906e36c5c717 "x86: use optimized ioresource lookup in ioremap
> function").
> 
> __ioremap_caller() then calls walk_system_ram_range(), which had
> replaced page_is_ram() to improve the lookup performance (commit
> c81c8a1eeede "x86, ioremap: Speed up check for RAM pages").
> 
> Since both functions walk through the resource table, there is
> no need to call the two functions.  Furthermore, region_is_ram()
> has bugs and always returns with -1.  This makes
> walk_system_ram_range() as the only check being used.

Do you have an example of a failing case?  Also, I didn't know that
IOREMAP'd addresses were allowed to be on non-page boundaries?

Here's the comment and reason for the patches from Patch 0:

<<<
We have a large university system in the UK that is experiencing
very long delays modprobing the driver for a specific I/O device.
The delay is from 8-10 minutes per device and there are 31 devices
in the system.  This 4 to 5 hour delay in starting up those I/O
devices is very much a burden on the customer.
...
The problem was tracked down to a very slow IOREMAP operation and
the excessively long ioresource lookup to insure that the user is
not attempting to ioremap RAM.  These patches provide a speed up
to that function.
>>>

The speed up was pretty dramatic, I think to about 15-20 minutes
(the test was done by our local CS person in the UK).  I think this
would prove the function was working since it would have fallen
back to the previous page_is_ram function and the 4 to 5 hour
startup.

If there is a failure, it would be better for all to fix the specific
bug and not re-introduce the original problem.  Perhaps drop to
page is ram if the address is not page aligned?

> Hence, remove the call to region_is_ram() from __ioremap_caller().
> 
> Note, removing the call to region_is_ram() is also necessary
> to fix the bugs in region_is_ram().  walk_system_ram_range()
> requires RAM ranges aligned by the page size in the resource
> table.  e820_reserve_setup_data() updates the e820 table by
> allocating a separate entry to each data region in setup_data,
> which is not page-aligned.  Therefore, walk_system_ram_range()
> is unable to detect the RAM ranges in setup_data.  This
> restriction has allowed multiple uses of ioremap() to map
> setup_data.  Using fixed region_is_ram() will cause these callers
> to start failing.  After all ioremap to setup_data are converted,
> __ioremap_caller() may call region_is_ram() instead.
> 
> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> ---
>  arch/x86/mm/ioremap.c |   24 ++++++------------------
>  1 file changed, 6 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 56f8af7..928867e 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -89,7 +89,6 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
>  	pgprot_t prot;
>  	int retval;
>  	void __iomem *ret_addr;
> -	int ram_region;
>  
>  	/* Don't allow wraparound or zero size */
>  	last_addr = phys_addr + size - 1;
> @@ -112,26 +111,15 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
>  	/*
>  	 * Don't allow anybody to remap normal RAM that we're using..
>  	 */
> -	/* First check if whole region can be identified as RAM or not */
> -	ram_region = region_is_ram(phys_addr, size);
> -	if (ram_region > 0) {
> -		WARN_ONCE(1, "ioremap on RAM at 0x%lx - 0x%lx\n",
> -				(unsigned long int)phys_addr,
> -				(unsigned long int)last_addr);
> -		return NULL;
> -	}
> -
> -	/* If could not be identified(-1), check page by page */
> -	if (ram_region < 0) {
> -		pfn      = phys_addr >> PAGE_SHIFT;
> -		last_pfn = last_addr >> PAGE_SHIFT;
> -		if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,
> +	pfn      = phys_addr >> PAGE_SHIFT;
> +	last_pfn = last_addr >> PAGE_SHIFT;
> +	if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,
>  					  __ioremap_check_ram) == 1) {
> -			WARN_ONCE(1, "ioremap on RAM at 0x%llx - 0x%llx\n",
> +		WARN_ONCE(1, "ioremap on RAM at 0x%llx - 0x%llx\n",
>  					phys_addr, last_addr);
> -			return NULL;
> -		}
> +		return NULL;
>  	}
> +
>  	/*
>  	 * Mappings have to be page-aligned
>  	 */
>
Toshi Kani June 22, 2015, 5:23 p.m. UTC | #2
On Mon, 2015-06-22 at 09:21 -0700, Mike Travis wrote:
> 
> On 6/19/2015 2:44 PM, Toshi Kani wrote:
> > __ioremap_caller() calls region_is_ram() to look up the resource
> > to check if a target range is RAM, which was added as an additinal
> > check to improve the lookup performance over page_is_ram() (commit
> > 906e36c5c717 "x86: use optimized ioresource lookup in ioremap
> > function").
> > 
> > __ioremap_caller() then calls walk_system_ram_range(), which had
> > replaced page_is_ram() to improve the lookup performance (commit
> > c81c8a1eeede "x86, ioremap: Speed up check for RAM pages").
> > 
> > Since both functions walk through the resource table, there is
> > no need to call the two functions.  Furthermore, region_is_ram()
> > has bugs and always returns with -1.  This makes
> > walk_system_ram_range() as the only check being used.
> 
> Do you have an example of a failing case?  

Well, region_is_ram() fails with -1 for every case... This is because it
breaks the for-loop at 'if (p->end < start)' in the first entry (i.e.
the lowest address range) of the resource table.  For example, the first
entry of 'p->end' is 0xfff on my system, which is certainly smaller than
'start'.

  # cat /proc/iomem
  00000000-00000fff : reserved
  00001000-00092fff : System RAM
  00093000-00093fff : reserved
	:

> Also, I didn't know that
> IOREMAP'd addresses were allowed to be on non-page boundaries?

Yes, that is allowed.  Here is a comment in __ioremap_caller(). 

 * NOTE! We need to allow non-page-aligned mappings too: we will
obviously
 * have to convert them into an offset in a page-aligned mapping, but
the
 * caller shouldn't need to know that small detail.

> Here's the comment and reason for the patches from Patch 0:
> 
> <<<
> We have a large university system in the UK that is experiencing
> very long delays modprobing the driver for a specific I/O device.
> The delay is from 8-10 minutes per device and there are 31 devices
> in the system.  This 4 to 5 hour delay in starting up those I/O
> devices is very much a burden on the customer.
> ...
> The problem was tracked down to a very slow IOREMAP operation and
> the excessively long ioresource lookup to insure that the user is
> not attempting to ioremap RAM.  These patches provide a speed up
> to that function.
> >>>
> 
> The speed up was pretty dramatic, I think to about 15-20 minutes
> (the test was done by our local CS person in the UK).  I think this
> would prove the function was working since it would have fallen
> back to the previous page_is_ram function and the 4 to 5 hour
> startup.

I wonder how this test was conducted.  When the region_is_ram() change
got in, the kernel already had the other speed up change (c81c8a1eeede),
which had replaced page_is_ram().

> If there is a failure, it would be better for all to fix the specific
> bug and not re-introduce the original problem.  Perhaps drop to
> page is ram if the address is not page aligned?

walk_system_ram_range() is the one that has the issue with
page-unaligned address.  region_is_ram() after fixed by patch 3/3 does
not have this issue.  ioremap() does not call page_is_ram() anymore.  

pcibios_add_device(), ksysfs.c, kdebugfs.c (and possibly more) call
ioremap for setup_data, which is page-unaligned.  If we are going to
disallow such callers, they all need to be converted with a different
mapping interface, such as vmap().  But vmap() takes an array of page
pointers as an argument, and is not convenient for them to use.  Since
setup_data is a special range, if they need to be changed may be
arguable.  I think issue 3 described in patch 0/3 needs further
discussion.  For now, I'd like to fix issue 1 & 2.

Thanks,
-Toshi
Mike Travis June 22, 2015, 6:22 p.m. UTC | #3
On 6/22/2015 10:23 AM, Toshi Kani wrote:
> On Mon, 2015-06-22 at 09:21 -0700, Mike Travis wrote:
>>
>> On 6/19/2015 2:44 PM, Toshi Kani wrote:
>>> __ioremap_caller() calls region_is_ram() to look up the resource
>>> to check if a target range is RAM, which was added as an additinal
>>> check to improve the lookup performance over page_is_ram() (commit
>>> 906e36c5c717 "x86: use optimized ioresource lookup in ioremap
>>> function").
>>>
>>> __ioremap_caller() then calls walk_system_ram_range(), which had
>>> replaced page_is_ram() to improve the lookup performance (commit
>>> c81c8a1eeede "x86, ioremap: Speed up check for RAM pages").
>>>
>>> Since both functions walk through the resource table, there is
>>> no need to call the two functions.  Furthermore, region_is_ram()
>>> has bugs and always returns with -1.  This makes
>>> walk_system_ram_range() as the only check being used.
>>
>> Do you have an example of a failing case?  
> 
> Well, region_is_ram() fails with -1 for every case... This is because it
> breaks the for-loop at 'if (p->end < start)' in the first entry (i.e.
> the lowest address range) of the resource table.  For example, the first
> entry of 'p->end' is 0xfff on my system, which is certainly smaller than
> 'start'.
> 
>   # cat /proc/iomem
>   00000000-00000fff : reserved
>   00001000-00092fff : System RAM
>   00093000-00093fff : reserved
> 	:

That is a small entry.  But I don't understand that when it
returned the -1, the page_is_ram function was not used instead?
Or am I missing something?

-	/* If could not be identified(-1), check page by page */
-	if (ram_region < 0) {
-		pfn      = phys_addr >> PAGE_SHIFT;
-		last_pfn = last_addr >> PAGE_SHIFT;
-		if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,

>> Also, I didn't know that
>> IOREMAP'd addresses were allowed to be on non-page boundaries?
> 
> Yes, that is allowed.  Here is a comment in __ioremap_caller(). 
> 
>  * NOTE! We need to allow non-page-aligned mappings too: we will
> obviously
>  * have to convert them into an offset in a page-aligned mapping, but
> the
>  * caller shouldn't need to know that small detail.

You're right, I forgot about that realignment.  The original
intent was to try to optimize and if that failed, fall back
to the original method.  I think this case would have been 
caught as well?

> 
>> Here's the comment and reason for the patches from Patch 0:
>>
>> <<<
>> We have a large university system in the UK that is experiencing
>> very long delays modprobing the driver for a specific I/O device.
>> The delay is from 8-10 minutes per device and there are 31 devices
>> in the system.  This 4 to 5 hour delay in starting up those I/O
>> devices is very much a burden on the customer.
>> ...
>> The problem was tracked down to a very slow IOREMAP operation and
>> the excessively long ioresource lookup to insure that the user is
>> not attempting to ioremap RAM.  These patches provide a speed up
>> to that function.
>>>>>
>>
>> The speed up was pretty dramatic, I think to about 15-20 minutes
>> (the test was done by our local CS person in the UK).  I think this
>> would prove the function was working since it would have fallen
>> back to the previous page_is_ram function and the 4 to 5 hour
>> startup.
> 
> I wonder how this test was conducted.  When the region_is_ram() change
> got in, the kernel already had the other speed up change (c81c8a1eeede),
> which had replaced page_is_ram().

The onsite system was not running the latest kernel (these large
system customers are very reluctant to disrupt their running
environments.)

> 
>> If there is a failure, it would be better for all to fix the specific
>> bug and not re-introduce the original problem.  Perhaps drop to
>> page is ram if the address is not page aligned?
> 
> walk_system_ram_range() is the one that has the issue with
> page-unaligned address.  region_is_ram() after fixed by patch 3/3 does
> not have this issue.  ioremap() does not call page_is_ram() anymore.

It appears ioremap does not call region_is_ram either.

-       /* First check if whole region can be identified as RAM or not */
-       ram_region = region_is_ram(phys_addr, size);
-       if (ram_region > 0) {
...
+       pfn      = phys_addr >> PAGE_SHIFT;
+       last_pfn = last_addr >> PAGE_SHIFT;
+       if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,

It appears that the walk_system_ram_range() patch does supersede
the region_is_ram patch.  Perhaps we can add a caveat to this
patch that you should not use this patch without also using
the patch from c81c8a1eeede?  Otherwise the excessive slowdown
in ioremap will be reintroduced?

(I'm thinking about back ports to distro kernels that customers use.)
 
> 
> pcibios_add_device(), ksysfs.c, kdebugfs.c (and possibly more) call
> ioremap for setup_data, which is page-unaligned.  If we are going to
> disallow such callers, they all need to be converted with a different
> mapping interface, such as vmap().  But vmap() takes an array of page
> pointers as an argument, and is not convenient for them to use.  Since
> setup_data is a special range, if they need to be changed may be
> arguable.  I think issue 3 described in patch 0/3 needs further
> discussion.  For now, I'd like to fix issue 1 & 2.

Since __ioremap_caller() was the only caller of region_is_ram, then
the function can be removed instead of being fixed.

Thanks,
Mike
Toshi Kani June 22, 2015, 7:06 p.m. UTC | #4
On Mon, 2015-06-22 at 11:22 -0700, Mike Travis wrote:
> 
> On 6/22/2015 10:23 AM, Toshi Kani wrote:
> > On Mon, 2015-06-22 at 09:21 -0700, Mike Travis wrote:
> >>
> >> On 6/19/2015 2:44 PM, Toshi Kani wrote:
> >>> __ioremap_caller() calls region_is_ram() to look up the resource
> >>> to check if a target range is RAM, which was added as an additinal
> >>> check to improve the lookup performance over page_is_ram() (commit
> >>> 906e36c5c717 "x86: use optimized ioresource lookup in ioremap
> >>> function").
> >>>
> >>> __ioremap_caller() then calls walk_system_ram_range(), which had
> >>> replaced page_is_ram() to improve the lookup performance (commit
> >>> c81c8a1eeede "x86, ioremap: Speed up check for RAM pages").
> >>>
> >>> Since both functions walk through the resource table, there is
> >>> no need to call the two functions.  Furthermore, region_is_ram()
> >>> has bugs and always returns with -1.  This makes
> >>> walk_system_ram_range() as the only check being used.
> >>
> >> Do you have an example of a failing case?  
> > 
> > Well, region_is_ram() fails with -1 for every case... This is because it
> > breaks the for-loop at 'if (p->end < start)' in the first entry (i.e.
> > the lowest address range) of the resource table.  For example, the first
> > entry of 'p->end' is 0xfff on my system, which is certainly smaller than
> > 'start'.
> > 
> >   # cat /proc/iomem
> >   00000000-00000fff : reserved
> >   00001000-00092fff : System RAM
> >   00093000-00093fff : reserved
> > 	:
> 
> That is a small entry.  But I don't understand that when it
> returned the -1, the page_is_ram function was not used instead?
> Or am I missing something?
> 
> -	/* If could not be identified(-1), check page by page */
> -	if (ram_region < 0) {
> -		pfn      = phys_addr >> PAGE_SHIFT;
> -		last_pfn = last_addr >> PAGE_SHIFT;
> -		if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,

I am afraid that you are missing something...  After region_is_ram()
failed, __ioremap_call() calls walk_system_ram_range(), not
page_is_ram().  This patch 2/3 removes the call to region_is_ram(), and
calls walk_system_ram_range() unconditionally without checking
'ram_region', which is set by region_is_ram().  Please note that ioremap
does not call page_is_ram() any more.  It had been removed by
c81c8a1eeede.

> >> Also, I didn't know that
> >> IOREMAP'd addresses were allowed to be on non-page boundaries?
> > 
> > Yes, that is allowed.  Here is a comment in __ioremap_caller(). 
> > 
> >  * NOTE! We need to allow non-page-aligned mappings too: we will
> > obviously
> >  * have to convert them into an offset in a page-aligned mapping, but
> > the
> >  * caller shouldn't need to know that small detail.
> 
> You're right, I forgot about that realignment.  The original
> intent was to try to optimize and if that failed, fall back
> to the original method.  I think this case would have been 
> caught as well?

Yes, this case would have been caught if region_is_ram() would have
worked.
 
> >> Here's the comment and reason for the patches from Patch 0:
> >>
> >> <<<
> >> We have a large university system in the UK that is experiencing
> >> very long delays modprobing the driver for a specific I/O device.
> >> The delay is from 8-10 minutes per device and there are 31 devices
> >> in the system.  This 4 to 5 hour delay in starting up those I/O
> >> devices is very much a burden on the customer.
> >> ...
> >> The problem was tracked down to a very slow IOREMAP operation and
> >> the excessively long ioresource lookup to insure that the user is
> >> not attempting to ioremap RAM.  These patches provide a speed up
> >> to that function.
> >>>>>
> >>
> >> The speed up was pretty dramatic, I think to about 15-20 minutes
> >> (the test was done by our local CS person in the UK).  I think this
> >> would prove the function was working since it would have fallen
> >> back to the previous page_is_ram function and the 4 to 5 hour
> >> startup.
> > 
> > I wonder how this test was conducted.  When the region_is_ram() change
> > got in, the kernel already had the other speed up change (c81c8a1eeede),
> > which had replaced page_is_ram().
> 
> The onsite system was not running the latest kernel (these large
> system customers are very reluctant to disrupt their running
> environments.)

Then you had probably replaced page_is_ram() with region_is_ram() and
ignored the error in this test...
 
> >> If there is a failure, it would be better for all to fix the specific
> >> bug and not re-introduce the original problem.  Perhaps drop to
> >> page is ram if the address is not page aligned?
> > 
> > walk_system_ram_range() is the one that has the issue with
> > page-unaligned address.  region_is_ram() after fixed by patch 3/3 does
> > not have this issue.  ioremap() does not call page_is_ram() anymore.
> 
> It appears ioremap does not call region_is_ram either.
> 
> -       /* First check if whole region can be identified as RAM or not */
> -       ram_region = region_is_ram(phys_addr, size);
> -       if (ram_region > 0) {
> ...
> +       pfn      = phys_addr >> PAGE_SHIFT;
> +       last_pfn = last_addr >> PAGE_SHIFT;
> +       if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,

That's the case after this patch 2/3 is applied.  ioremap calls
region_is_ram() today.

> It appears that the walk_system_ram_range() patch does supersede
> the region_is_ram patch.  Perhaps we can add a caveat to this
> patch that you should not use this patch without also using
> the patch from c81c8a1eeede?  Otherwise the excessive slowdown
> in ioremap will be reintroduced?
> 
> (I'm thinking about back ports to distro kernels that customers use.)

This patch changes ioremap() to call walk_system_ram_range() only.
Since walk_system_ram_range() walks through the resource table, there is
no such slowdown issue.  In other words, there is no behavioral change
in this patch, except it'd be slightly faster.
 
> > pcibios_add_device(), ksysfs.c, kdebugfs.c (and possibly more) call
> > ioremap for setup_data, which is page-unaligned.  If we are going to
> > disallow such callers, they all need to be converted with a different
> > mapping interface, such as vmap().  But vmap() takes an array of page
> > pointers as an argument, and is not convenient for them to use.  Since
> > setup_data is a special range, if they need to be changed may be
> > arguable.  I think issue 3 described in patch 0/3 needs further
> > discussion.  For now, I'd like to fix issue 1 & 2.
> 
> Since __ioremap_caller() was the only caller of region_is_ram, then
> the function can be removed instead of being fixed.

Well, there is a new caller, memremap_valid(), in Dan's patchset.
https://lkml.org/lkml/2015/6/22/100

Thanks,
-Toshi
Ingo Molnar June 23, 2015, 9:01 a.m. UTC | #5
* Mike Travis <travis@sgi.com> wrote:

> <<<
> We have a large university system in the UK that is experiencing
> very long delays modprobing the driver for a specific I/O device.
> The delay is from 8-10 minutes per device and there are 31 devices
> in the system.  This 4 to 5 hour delay in starting up those I/O
> devices is very much a burden on the customer.
> ...
> The problem was tracked down to a very slow IOREMAP operation and
> the excessively long ioresource lookup to insure that the user is
> not attempting to ioremap RAM.  These patches provide a speed up
> to that function.
> >>>
> 
> The speed up was pretty dramatic, I think to about 15-20 minutes
> (the test was done by our local CS person in the UK).  I think this
> would prove the function was working since it would have fallen
> back to the previous page_is_ram function and the 4 to 5 hour
> startup.

Btw., I think even 15-20 minutes is still in the 'ridiculously slow' category.
Any chance to fix all of this properly, not just hack by hack?

Thanks,

	Ingo
Toshi Kani June 23, 2015, 3:19 p.m. UTC | #6
On Tue, 2015-06-23 at 11:01 +0200, Ingo Molnar wrote:
> * Mike Travis <travis@sgi.com> wrote:
> 
> > <<<
> > We have a large university system in the UK that is experiencing
> > very long delays modprobing the driver for a specific I/O device.
> > The delay is from 8-10 minutes per device and there are 31 devices
> > in the system.  This 4 to 5 hour delay in starting up those I/O
> > devices is very much a burden on the customer.
> > ...
> > The problem was tracked down to a very slow IOREMAP operation and
> > the excessively long ioresource lookup to insure that the user is
> > not attempting to ioremap RAM.  These patches provide a speed up
> > to that function.
> > >>>
> > 
> > The speed up was pretty dramatic, I think to about 15-20 minutes
> > (the test was done by our local CS person in the UK).  I think this
> > would prove the function was working since it would have fallen
> > back to the previous page_is_ram function and the 4 to 5 hour
> > startup.
> 
> Btw., I think even 15-20 minutes is still in the 'ridiculously slow' category.
> Any chance to fix all of this properly, not just hack by hack?

I agree that 15-20 minutes is till slow, but this slowness did not come
from this ioremap RAM check because region_is_ram() used in this test
had bugs, which led it return immediately.  The slowness came from other
places, such as page table initialization.

I do not think the number of the resource table entries increase
significantly on large systems.  So, walk_system_ram_range() and fixed
region_is_ram() should still perform fine on large systems.

Thanks,
-Toshi
Mike Travis June 23, 2015, 6:57 p.m. UTC | #7
On 6/23/2015 2:01 AM, Ingo Molnar wrote:
> 
> * Mike Travis <travis@sgi.com> wrote:
> 
>> <<<
>> We have a large university system in the UK that is experiencing
>> very long delays modprobing the driver for a specific I/O device.
>> The delay is from 8-10 minutes per device and there are 31 devices
>> in the system.  This 4 to 5 hour delay in starting up those I/O
>> devices is very much a burden on the customer.
>> ...
>> The problem was tracked down to a very slow IOREMAP operation and
>> the excessively long ioresource lookup to insure that the user is
>> not attempting to ioremap RAM.  These patches provide a speed up
>> to that function.
>>>>>
>>
>> The speed up was pretty dramatic, I think to about 15-20 minutes
>> (the test was done by our local CS person in the UK).  I think this
>> would prove the function was working since it would have fallen
>> back to the previous page_is_ram function and the 4 to 5 hour
>> startup.
> 
> Btw., I think even 15-20 minutes is still in the 'ridiculously slow' category.
> Any chance to fix all of this properly, not just hack by hack?
> 
> Thanks,
> 
> 	Ingo
> 


The current primary cause of the slow start up now lies within
the loading of the kernel and other software to 31 Co-processors
in a serial fashion.  We have suggested to the vendor that they
look at booting and starting these in parallel.

The problem is there are not a whole lot of systems that can
handle more than 4 of them let alone 32.  So it's mostly the
interaction between the customers and the vendor directing
these optimizations.

Any speed up of the kernel startup helps here as well.

[off topic]
Btw, this ~20 minutes time is just for the start up of the
co-processors.  The entire system takes much longer as this is
a huge UV system.  Most of the time is still due to memory
initialization.  Mel's "defer page init" patches help here
tremendously, though it's not clear they will trickle back
down to SLES11 which the customer is running.

Thanks,
Mike
diff mbox

Patch

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 56f8af7..928867e 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -89,7 +89,6 @@  static void __iomem *__ioremap_caller(resource_size_t phys_addr,
 	pgprot_t prot;
 	int retval;
 	void __iomem *ret_addr;
-	int ram_region;
 
 	/* Don't allow wraparound or zero size */
 	last_addr = phys_addr + size - 1;
@@ -112,26 +111,15 @@  static void __iomem *__ioremap_caller(resource_size_t phys_addr,
 	/*
 	 * Don't allow anybody to remap normal RAM that we're using..
 	 */
-	/* First check if whole region can be identified as RAM or not */
-	ram_region = region_is_ram(phys_addr, size);
-	if (ram_region > 0) {
-		WARN_ONCE(1, "ioremap on RAM at 0x%lx - 0x%lx\n",
-				(unsigned long int)phys_addr,
-				(unsigned long int)last_addr);
-		return NULL;
-	}
-
-	/* If could not be identified(-1), check page by page */
-	if (ram_region < 0) {
-		pfn      = phys_addr >> PAGE_SHIFT;
-		last_pfn = last_addr >> PAGE_SHIFT;
-		if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,
+	pfn      = phys_addr >> PAGE_SHIFT;
+	last_pfn = last_addr >> PAGE_SHIFT;
+	if (walk_system_ram_range(pfn, last_pfn - pfn + 1, NULL,
 					  __ioremap_check_ram) == 1) {
-			WARN_ONCE(1, "ioremap on RAM at 0x%llx - 0x%llx\n",
+		WARN_ONCE(1, "ioremap on RAM at 0x%llx - 0x%llx\n",
 					phys_addr, last_addr);
-			return NULL;
-		}
+		return NULL;
 	}
+
 	/*
 	 * Mappings have to be page-aligned
 	 */