diff mbox

[RFC] Reset PCIe devices to address DMA problem on kdump with iommu

Message ID 501BB4EF.7080909@jp.fujitsu.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Takao Indoh Aug. 3, 2012, 11:24 a.m. UTC
Hi all,

This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
devices at boot time to address DMA problem on kdump with iommu. When
this parameter is specified, a hot reset is triggered on each PCIe root
port and downstream port to reset its downstream endpoint.

Background:
A kdump problem about DMA has been discussed for a long time. That is,
when a kernel is switched to the kdump kernel DMA derived from first
kernel affects second kernel. Recently this problem surfaces when iommu
is used for PCI passthrough on KVM guest. In the case of the machine I
use, when intel_iommu=on is specified, DMAR error is detected in kdump
kernel and PCI SERR is also detected. Finally kdump fails because some
devices does not work correctly.

The root cause is that ongoing DMA from first kernel causes DMAR fault
because page table of DMAR is initialized while kdump kernel is booting
up. Therefore to address this problem DMA needs to be stopped before DMAR
is initialized at kdump kernel boot time. By this patch, PCIe devices
are reset by hot reset and its DMA is stopped when reset_pcie_devices is
specified. One problem of this solution is that VGA is reset and the
monitor blacks out when the link between the port and VGA controller was
reset. So this patch does not reset the port whose child endpoint is VGA
device.

Any comments would be appreciated.

Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    4 +
 drivers/pci/quirks.c                |   59 ++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Vivek Goyal Aug. 3, 2012, 11:46 a.m. UTC | #1
On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
> Hi all,
> 
> This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
> devices at boot time to address DMA problem on kdump with iommu. When
> this parameter is specified, a hot reset is triggered on each PCIe root
> port and downstream port to reset its downstream endpoint.

Hi Takao,

Why not use existing "reset_devices" parameter instead of introducing
a new one?

Thanks
Vivek

> 
> Background:
> A kdump problem about DMA has been discussed for a long time. That is,
> when a kernel is switched to the kdump kernel DMA derived from first
> kernel affects second kernel. Recently this problem surfaces when iommu
> is used for PCI passthrough on KVM guest. In the case of the machine I
> use, when intel_iommu=on is specified, DMAR error is detected in kdump
> kernel and PCI SERR is also detected. Finally kdump fails because some
> devices does not work correctly.
> 
> The root cause is that ongoing DMA from first kernel causes DMAR fault
> because page table of DMAR is initialized while kdump kernel is booting
> up. Therefore to address this problem DMA needs to be stopped before DMAR
> is initialized at kdump kernel boot time. By this patch, PCIe devices
> are reset by hot reset and its DMA is stopped when reset_pcie_devices is
> specified. One problem of this solution is that VGA is reset and the
> monitor blacks out when the link between the port and VGA controller was
> reset. So this patch does not reset the port whose child endpoint is VGA
> device.
> 
> Any comments would be appreciated.
> 
> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
> ---
>  Documentation/kernel-parameters.txt |    4 +
>  drivers/pci/quirks.c                |   59 ++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index e714a02..e694e9f 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2489,6 +2489,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  	reset_devices	[KNL] Force drivers to reset the underlying device
>  			during initialization.
>  
> +	reset_pcie_devices
> +			[PCIE] Reset PCIe endpoint at boot time by sending a
> +			hot reset to root port and downstream port
> +
>  	resume=		[SWSUSP]
>  			Specify the partition device for software suspend
>  			Format:
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 5155317..7f7fc02 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -32,6 +32,65 @@
>  #include "pci.h"
>  
>  /*
> + * Reset PCIe endpoint by sending a hot reset to root port and downstream port
> + */
> +unsigned int reset_pcie_devices;
> +EXPORT_SYMBOL(reset_pcie_devices);
> +static int __init set_reset_pcie_devices(char *str)
> +{
> +	reset_pcie_devices = 1;
> +	return 1;
> +}
> +__setup("reset_pcie_devices", set_reset_pcie_devices);
> +
> +static void __devinit quirk_pcie_device_reset(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +	u16 ctrl;
> +
> +	if (!reset_pcie_devices || !pci_is_pcie(dev) || !dev->subordinate ||
> +	    ((dev->pcie_type != PCI_EXP_TYPE_ROOT_PORT) &&
> +	     (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)))
> +		return;
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		if ((child->pcie_type == PCI_EXP_TYPE_UPSTREAM) ||
> +		    (child->pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) ||
> +		    ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> +		/* Don't reset switch, bridge, VGA device */
> +		return;
> +	}
> +
> +	dev_info(&dev->dev, "Reset Secondary bus\n");
> +
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		dev_info(&child->dev, "save state\n");
> +		pci_save_state(child);
> +	}
> +
> +	/* Assert Secondary Bus Reset */
> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> +	msleep(2);
> +
> +	/* De-assert Secondary Bus Reset */
> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> +	msleep(200);
> +
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		dev_info(&child->dev, "restore state\n");
> +		pci_restore_state(child);
> +	}
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_pcie_device_reset);
> +
> +/*
>   * Decoding should be disabled for a PCI device during BAR sizing to avoid
>   * conflict. But doing so may cause problems on host bridge and perhaps other
>   * key system devices. For devices that need to have mmio decoding always-on,
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Donald Dutile Aug. 6, 2012, 4:09 a.m. UTC | #2
On 08/03/2012 07:24 AM, Takao Indoh wrote:
> Hi all,
> 
> This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
> devices at boot time to address DMA problem on kdump with iommu. When
> this parameter is specified, a hot reset is triggered on each PCIe root
> port and downstream port to reset its downstream endpoint.
> 
> Background:
> A kdump problem about DMA has been discussed for a long time. That is,
> when a kernel is switched to the kdump kernel DMA derived from first
> kernel affects second kernel. Recently this problem surfaces when iommu
> is used for PCI passthrough on KVM guest. In the case of the machine I
> use, when intel_iommu=on is specified, DMAR error is detected in kdump
> kernel and PCI SERR is also detected. Finally kdump fails because some
> devices does not work correctly.
> 
> The root cause is that ongoing DMA from first kernel causes DMAR fault
> because page table of DMAR is initialized while kdump kernel is booting
> up. Therefore to address this problem DMA needs to be stopped before DMAR
> is initialized at kdump kernel boot time. By this patch, PCIe devices
> are reset by hot reset and its DMA is stopped when reset_pcie_devices is
> specified. One problem of this solution is that VGA is reset and the
> monitor blacks out when the link between the port and VGA controller was
> reset. So this patch does not reset the port whose child endpoint is VGA
> device.
> 
> Any comments would be appreciated.
> 
> Signed-off-by: Takao Indoh<indou.takao@jp.fujitsu.com>
> ---
Have you considered something less disruptive such as clearing the 
Master Enable in each endpoint's PCI Command Register ?
That should prevent DMA transactions from endpoints during the kdump and
kexec, and when the driver for the device gets reconfigured,
Master Enable will be set back on, but after the driver has had the
opportunity to do a device-specific reset.

- Don

>   Documentation/kernel-parameters.txt |    4 +
>   drivers/pci/quirks.c                |   59 ++++++++++++++++++++++++++
>   2 files changed, 63 insertions(+)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index e714a02..e694e9f 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2489,6 +2489,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>   	reset_devices	[KNL] Force drivers to reset the underlying device
>   			during initialization.
> 
> +	reset_pcie_devices
> +			[PCIE] Reset PCIe endpoint at boot time by sending a
> +			hot reset to root port and downstream port
> +
>   	resume=		[SWSUSP]
>   			Specify the partition device for software suspend
>   			Format:
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 5155317..7f7fc02 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -32,6 +32,65 @@
>   #include "pci.h"
> 
>   /*
> + * Reset PCIe endpoint by sending a hot reset to root port and downstream port
> + */
> +unsigned int reset_pcie_devices;
> +EXPORT_SYMBOL(reset_pcie_devices);
> +static int __init set_reset_pcie_devices(char *str)
> +{
> +	reset_pcie_devices = 1;
> +	return 1;
> +}
> +__setup("reset_pcie_devices", set_reset_pcie_devices);
> +
> +static void __devinit quirk_pcie_device_reset(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +	u16 ctrl;
> +
> +	if (!reset_pcie_devices || !pci_is_pcie(dev) || !dev->subordinate ||
> +	    ((dev->pcie_type != PCI_EXP_TYPE_ROOT_PORT)&&
> +	     (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)))
> +		return;
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child,&subordinate->devices, bus_list) {
> +		if ((child->pcie_type == PCI_EXP_TYPE_UPSTREAM) ||
> +		    (child->pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) ||
> +		    ((child->class>>  16) == PCI_BASE_CLASS_DISPLAY))
> +		/* Don't reset switch, bridge, VGA device */
> +		return;
> +	}
> +
> +	dev_info(&dev->dev, "Reset Secondary bus\n");
> +
> +	list_for_each_entry(child,&subordinate->devices, bus_list) {
> +		dev_info(&child->dev, "save state\n");
> +		pci_save_state(child);
> +	}
> +
> +	/* Assert Secondary Bus Reset */
> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl);
> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> +	msleep(2);
> +
> +	/* De-assert Secondary Bus Reset */
> +	ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> +	msleep(200);
> +
> +	list_for_each_entry(child,&subordinate->devices, bus_list) {
> +		dev_info(&child->dev, "restore state\n");
> +		pci_restore_state(child);
> +	}
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_pcie_device_reset);
> +
> +/*
>    * Decoding should be disabled for a PCI device during BAR sizing to avoid
>    * conflict. But doing so may cause problems on host bridge and perhaps other
>    * key system devices. For devices that need to have mmio decoding always-on,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Aug. 6, 2012, 4:30 a.m. UTC | #3
Hi Vivek,

(2012/08/03 20:46), Vivek Goyal wrote:
> On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
>> Hi all,
>>
>> This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
>> devices at boot time to address DMA problem on kdump with iommu. When
>> this parameter is specified, a hot reset is triggered on each PCIe root
>> port and downstream port to reset its downstream endpoint.
>
> Hi Takao,
>
> Why not use existing "reset_devices" parameter instead of introducing
> a new one?

"reset_devices" is used for each driver to reset their own device, and
this patch resets all devices forcibly, so I thought they were different
things.

Thanks,
Takao Indoh

>
> Thanks
> Vivek
>
>>
>> Background:
>> A kdump problem about DMA has been discussed for a long time. That is,
>> when a kernel is switched to the kdump kernel DMA derived from first
>> kernel affects second kernel. Recently this problem surfaces when iommu
>> is used for PCI passthrough on KVM guest. In the case of the machine I
>> use, when intel_iommu=on is specified, DMAR error is detected in kdump
>> kernel and PCI SERR is also detected. Finally kdump fails because some
>> devices does not work correctly.
>>
>> The root cause is that ongoing DMA from first kernel causes DMAR fault
>> because page table of DMAR is initialized while kdump kernel is booting
>> up. Therefore to address this problem DMA needs to be stopped before DMAR
>> is initialized at kdump kernel boot time. By this patch, PCIe devices
>> are reset by hot reset and its DMA is stopped when reset_pcie_devices is
>> specified. One problem of this solution is that VGA is reset and the
>> monitor blacks out when the link between the port and VGA controller was
>> reset. So this patch does not reset the port whose child endpoint is VGA
>> device.
>>
>> Any comments would be appreciated.
>>
>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>> ---
>>   Documentation/kernel-parameters.txt |    4 +
>>   drivers/pci/quirks.c                |   59 ++++++++++++++++++++++++++
>>   2 files changed, 63 insertions(+)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index e714a02..e694e9f 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2489,6 +2489,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>   	reset_devices	[KNL] Force drivers to reset the underlying device
>>   			during initialization.
>>
>> +	reset_pcie_devices
>> +			[PCIE] Reset PCIe endpoint at boot time by sending a
>> +			hot reset to root port and downstream port
>> +
>>   	resume=		[SWSUSP]
>>   			Specify the partition device for software suspend
>>   			Format:
>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>> index 5155317..7f7fc02 100644
>> --- a/drivers/pci/quirks.c
>> +++ b/drivers/pci/quirks.c
>> @@ -32,6 +32,65 @@
>>   #include "pci.h"
>>
>>   /*
>> + * Reset PCIe endpoint by sending a hot reset to root port and downstream port
>> + */
>> +unsigned int reset_pcie_devices;
>> +EXPORT_SYMBOL(reset_pcie_devices);
>> +static int __init set_reset_pcie_devices(char *str)
>> +{
>> +	reset_pcie_devices = 1;
>> +	return 1;
>> +}
>> +__setup("reset_pcie_devices", set_reset_pcie_devices);
>> +
>> +static void __devinit quirk_pcie_device_reset(struct pci_dev *dev)
>> +{
>> +	struct pci_bus *subordinate;
>> +	struct pci_dev *child;
>> +	u16 ctrl;
>> +
>> +	if (!reset_pcie_devices || !pci_is_pcie(dev) || !dev->subordinate ||
>> +	    ((dev->pcie_type != PCI_EXP_TYPE_ROOT_PORT) &&
>> +	     (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)))
>> +		return;
>> +
>> +	subordinate = dev->subordinate;
>> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +		if ((child->pcie_type == PCI_EXP_TYPE_UPSTREAM) ||
>> +		    (child->pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) ||
>> +		    ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>> +		/* Don't reset switch, bridge, VGA device */
>> +		return;
>> +	}
>> +
>> +	dev_info(&dev->dev, "Reset Secondary bus\n");
>> +
>> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +		dev_info(&child->dev, "save state\n");
>> +		pci_save_state(child);
>> +	}
>> +
>> +	/* Assert Secondary Bus Reset */
>> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +	msleep(2);
>> +
>> +	/* De-assert Secondary Bus Reset */
>> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +	msleep(200);
>> +
>> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +		dev_info(&child->dev, "restore state\n");
>> +		pci_restore_state(child);
>> +	}
>> +}
>> +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_pcie_device_reset);
>> +
>> +/*
>>    * Decoding should be disabled for a PCI device during BAR sizing to avoid
>>    * conflict. But doing so may cause problems on host bridge and perhaps other
>>    * key system devices. For devices that need to have mmio decoding always-on,
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal Aug. 6, 2012, 8:39 p.m. UTC | #4
On Mon, Aug 06, 2012 at 01:30:47PM +0900, Takao Indoh wrote:
> Hi Vivek,
> 
> (2012/08/03 20:46), Vivek Goyal wrote:
> > On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
> >> Hi all,
> >>
> >> This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
> >> devices at boot time to address DMA problem on kdump with iommu. When
> >> this parameter is specified, a hot reset is triggered on each PCIe root
> >> port and downstream port to reset its downstream endpoint.
> >
> > Hi Takao,
> >
> > Why not use existing "reset_devices" parameter instead of introducing
> > a new one?
> 
> "reset_devices" is used for each driver to reset their own device, and
> this patch resets all devices forcibly, so I thought they were different
> things.

Yes reset_devices currently is used for driver to reset its device. I
thought one could very well extend its reach to reset pci express devices
at bus level.

Having them separate is not going to be much useful from kdump
perspective. We will end up passing both reset_devices and
reset_pcie_devices to second kernel whill lead to bus level reset as well
as device level reset.

Ideal situation would be that somehow detect that bus level reset has been
done and skip device level reset (assuming bus level reset obviates the
need of device level reset, please correct me if that's not the case). 

After pcie reset, can we store the state in a variable and drivers can
use that variable to check if PCIe level reset was done or not. If yes,
skip device level reset (Assuming driver knows that device is on a
PCIe slot).

In that case we will not have to introduce new kernel command line, and
also avoid double reset?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Aug. 7, 2012, 9:02 a.m. UTC | #5
(2012/08/07 5:39), Vivek Goyal wrote:
> On Mon, Aug 06, 2012 at 01:30:47PM +0900, Takao Indoh wrote:
>> Hi Vivek,
>>
>> (2012/08/03 20:46), Vivek Goyal wrote:
>>> On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
>>>> Hi all,
>>>>
>>>> This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
>>>> devices at boot time to address DMA problem on kdump with iommu. When
>>>> this parameter is specified, a hot reset is triggered on each PCIe root
>>>> port and downstream port to reset its downstream endpoint.
>>>
>>> Hi Takao,
>>>
>>> Why not use existing "reset_devices" parameter instead of introducing
>>> a new one?
>>
>> "reset_devices" is used for each driver to reset their own device, and
>> this patch resets all devices forcibly, so I thought they were different
>> things.
>
> Yes reset_devices currently is used for driver to reset its device. I
> thought one could very well extend its reach to reset pci express devices
> at bus level.
>
> Having them separate is not going to be much useful from kdump
> perspective. We will end up passing both reset_devices and
> reset_pcie_devices to second kernel whill lead to bus level reset as well
> as device level reset.
>
> Ideal situation would be that somehow detect that bus level reset has been
> done and skip device level reset (assuming bus level reset obviates the
> need of device level reset, please correct me if that's not the case).
>
> After pcie reset, can we store the state in a variable and drivers can
> use that variable to check if PCIe level reset was done or not. If yes,
> skip device level reset (Assuming driver knows that device is on a
> PCIe slot).
>
> In that case we will not have to introduce new kernel command line, and
> also avoid double reset?

Actually I'm not sure whether the driver does not need to do their reset after
bus level reset, but I agree with you, now I'm thinking that using reset_devices
is better rather than adding narrow one which is limited to PCI express, otherwise
we may have to add new parameter every time when adding new reset method, such as
reset_pcie_devices, reset_pci_legacy_devices, etc.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Sept. 5, 2012, 11:09 a.m. UTC | #6
(2012/08/07 5:39), Vivek Goyal wrote:
> On Mon, Aug 06, 2012 at 01:30:47PM +0900, Takao Indoh wrote:
>> Hi Vivek,
>>
>> (2012/08/03 20:46), Vivek Goyal wrote:
>>> On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
>>>> Hi all,
>>>>
>>>> This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
>>>> devices at boot time to address DMA problem on kdump with iommu. When
>>>> this parameter is specified, a hot reset is triggered on each PCIe root
>>>> port and downstream port to reset its downstream endpoint.
>>>
>>> Hi Takao,
>>>
>>> Why not use existing "reset_devices" parameter instead of introducing
>>> a new one?
>>
>> "reset_devices" is used for each driver to reset their own device, and
>> this patch resets all devices forcibly, so I thought they were different
>> things.
>
> Yes reset_devices currently is used for driver to reset its device. I
> thought one could very well extend its reach to reset pci express devices
> at bus level.
>
> Having them separate is not going to be much useful from kdump
> perspective. We will end up passing both reset_devices and
> reset_pcie_devices to second kernel whill lead to bus level reset as well
> as device level reset.
>
> Ideal situation would be that somehow detect that bus level reset has been
> done and skip device level reset (assuming bus level reset obviates the
> need of device level reset, please correct me if that's not the case).
>
> After pcie reset, can we store the state in a variable and drivers can
> use that variable to check if PCIe level reset was done or not. If yes,
> skip device level reset (Assuming driver knows that device is on a
> PCIe slot).
>
> In that case we will not have to introduce new kernel command line, and
> also avoid double reset?
  
I found a problem when testing my patch on some machines.

Originally there are two problems in kdump kernel when iommu is enabled;
DMAR error and PCI SERR. I thought they are fixed by my patch, but I
noticed that PCI SERR is still detected after applying the patch. It
seems that something happens when Interrupt Remapping is initialized in
kdump kernel.

Therefore resetting devices has to be done before enable_IR() is
called. I have three ideas for it.

  (i) Resetting devices in 1st kernel(panic kernel)
  We can reset devices before jumping into 2nd kernel. Of course it may
  be dangerous to scan pci device tree and call PCI functions in panic'd
  kernel. Beforehand we need to collect device information so that only
  minimal code could run on panic.

  (ii) Resetting devices in purgatory
  It seems to be be appropriate place to do this, but I'm not sure
  where I can save/restore PCI config when resetting devices in
  purgatory.

  (iii) Resetting devices in 2nd kernel(kdump kernel)
  Important point is to do reset before enable_IR() is called as I wrote
  above. I think I should add new function to do reset into
  arch/x86/pci/early.c and call it in setup_arch like
  early_dump_pci_devices() or early_quirks().

Any comments?

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kenji Kaneshige Sept. 10, 2012, 2:34 a.m. UTC | #7
> -----Original Message-----
> From: linux-pci-owner@vger.kernel.org
> [mailto:linux-pci-owner@vger.kernel.org] On Behalf Of Takao Indoh
> Sent: Wednesday, September 05, 2012 8:10 PM
> To: vgoyal@redhat.com
> Cc: kexec@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-pci@vger.kernel.org; bhelgaas@google.com; hbabu@us.ibm.com; Ishii,
> Hironobu/?? ??; martin.wilck@ts.fujitsu.com
> Subject: Re: [RFC][PATCH] Reset PCIe devices to address DMA problem on kdump
> with iommu
> 
> (2012/08/07 5:39), Vivek Goyal wrote:
> > On Mon, Aug 06, 2012 at 01:30:47PM +0900, Takao Indoh wrote:
> >> Hi Vivek,
> >>
> >> (2012/08/03 20:46), Vivek Goyal wrote:
> >>> On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
> >>>> Hi all,
> >>>>
> >>>> This patch adds kernel parameter "reset_pcie_devices" which resets
> PCIe
> >>>> devices at boot time to address DMA problem on kdump with iommu. When
> >>>> this parameter is specified, a hot reset is triggered on each PCIe
> root
> >>>> port and downstream port to reset its downstream endpoint.
> >>>
> >>> Hi Takao,
> >>>
> >>> Why not use existing "reset_devices" parameter instead of introducing
> >>> a new one?
> >>
> >> "reset_devices" is used for each driver to reset their own device, and
> >> this patch resets all devices forcibly, so I thought they were different
> >> things.
> >
> > Yes reset_devices currently is used for driver to reset its device. I
> > thought one could very well extend its reach to reset pci express devices
> > at bus level.
> >
> > Having them separate is not going to be much useful from kdump
> > perspective. We will end up passing both reset_devices and
> > reset_pcie_devices to second kernel whill lead to bus level reset as well
> > as device level reset.
> >
> > Ideal situation would be that somehow detect that bus level reset has
> been
> > done and skip device level reset (assuming bus level reset obviates the
> > need of device level reset, please correct me if that's not the case).
> >
> > After pcie reset, can we store the state in a variable and drivers can
> > use that variable to check if PCIe level reset was done or not. If yes,
> > skip device level reset (Assuming driver knows that device is on a
> > PCIe slot).
> >
> > In that case we will not have to introduce new kernel command line, and
> > also avoid double reset?
> 
> I found a problem when testing my patch on some machines.
> 
> Originally there are two problems in kdump kernel when iommu is enabled;
> DMAR error and PCI SERR. I thought they are fixed by my patch, but I
> noticed that PCI SERR is still detected after applying the patch. It
> seems that something happens when Interrupt Remapping is initialized in
> kdump kernel.

I'm not sure, but I guess the PCI SERR might be caused as follows.

- The 1st kernel enables interrupt remapping. The MSI(-X) address and
  data registers of PCI devices are programmed in remappable format.

- At the beginning of 2nd kernel, interrupt remapping is still active.
  And then it is disabled by enable_IR() function for initialization.

- PCI device generate an interrupt. At this moment, interrupt remapping
  is not enabled yet. On the other hand, MSI(-X) address and data of this
  interrupt is in remappable format because those are programmed by 1st
  kernel. I guess this might be a cause of PCI SERR.

I guess clearing command register or disabling MSI before interrupt
remapping initialization in 2nd kernel might solve the PCI SERR problem.

Regards,
Kenji Kaneshige


> 
> Therefore resetting devices has to be done before enable_IR() is
> called. I have three ideas for it.
> 
>   (i) Resetting devices in 1st kernel(panic kernel)
>   We can reset devices before jumping into 2nd kernel. Of course it may
>   be dangerous to scan pci device tree and call PCI functions in panic'd
>   kernel. Beforehand we need to collect device information so that only
>   minimal code could run on panic.
> 
>   (ii) Resetting devices in purgatory
>   It seems to be be appropriate place to do this, but I'm not sure
>   where I can save/restore PCI config when resetting devices in
>   purgatory.
> 
>   (iii) Resetting devices in 2nd kernel(kdump kernel)
>   Important point is to do reset before enable_IR() is called as I wrote
>   above. I think I should add new function to do reset into
>   arch/x86/pci/early.c and call it in setup_arch like
>   early_dump_pci_devices() or early_quirks().
> 
> Any comments?
> 
> Thanks,
> Takao Indoh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Sept. 10, 2012, 6:35 a.m. UTC | #8
(2012/09/10 11:34), Kaneshige, Kenji wrote:
>> -----Original Message-----
>> From: linux-pci-owner@vger.kernel.org
>> [mailto:linux-pci-owner@vger.kernel.org] On Behalf Of Takao Indoh
>> Sent: Wednesday, September 05, 2012 8:10 PM
>> To: vgoyal@redhat.com
>> Cc: kexec@lists.infradead.org; linux-kernel@vger.kernel.org;
>> linux-pci@vger.kernel.org; bhelgaas@google.com; hbabu@us.ibm.com; Ishii,
>> Hironobu/?? ??; martin.wilck@ts.fujitsu.com
>> Subject: Re: [RFC][PATCH] Reset PCIe devices to address DMA problem on kdump
>> with iommu
>>
>> (2012/08/07 5:39), Vivek Goyal wrote:
>>> On Mon, Aug 06, 2012 at 01:30:47PM +0900, Takao Indoh wrote:
>>>> Hi Vivek,
>>>>
>>>> (2012/08/03 20:46), Vivek Goyal wrote:
>>>>> On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> This patch adds kernel parameter "reset_pcie_devices" which resets
>> PCIe
>>>>>> devices at boot time to address DMA problem on kdump with iommu. When
>>>>>> this parameter is specified, a hot reset is triggered on each PCIe
>> root
>>>>>> port and downstream port to reset its downstream endpoint.
>>>>>
>>>>> Hi Takao,
>>>>>
>>>>> Why not use existing "reset_devices" parameter instead of introducing
>>>>> a new one?
>>>>
>>>> "reset_devices" is used for each driver to reset their own device, and
>>>> this patch resets all devices forcibly, so I thought they were different
>>>> things.
>>>
>>> Yes reset_devices currently is used for driver to reset its device. I
>>> thought one could very well extend its reach to reset pci express devices
>>> at bus level.
>>>
>>> Having them separate is not going to be much useful from kdump
>>> perspective. We will end up passing both reset_devices and
>>> reset_pcie_devices to second kernel whill lead to bus level reset as well
>>> as device level reset.
>>>
>>> Ideal situation would be that somehow detect that bus level reset has
>> been
>>> done and skip device level reset (assuming bus level reset obviates the
>>> need of device level reset, please correct me if that's not the case).
>>>
>>> After pcie reset, can we store the state in a variable and drivers can
>>> use that variable to check if PCIe level reset was done or not. If yes,
>>> skip device level reset (Assuming driver knows that device is on a
>>> PCIe slot).
>>>
>>> In that case we will not have to introduce new kernel command line, and
>>> also avoid double reset?
>>
>> I found a problem when testing my patch on some machines.
>>
>> Originally there are two problems in kdump kernel when iommu is enabled;
>> DMAR error and PCI SERR. I thought they are fixed by my patch, but I
>> noticed that PCI SERR is still detected after applying the patch. It
>> seems that something happens when Interrupt Remapping is initialized in
>> kdump kernel.
> 
> I'm not sure, but I guess the PCI SERR might be caused as follows.
> 
> - The 1st kernel enables interrupt remapping. The MSI(-X) address and
>    data registers of PCI devices are programmed in remappable format.
> 
> - At the beginning of 2nd kernel, interrupt remapping is still active.
>    And then it is disabled by enable_IR() function for initialization.
> 
> - PCI device generate an interrupt. At this moment, interrupt remapping
>    is not enabled yet. On the other hand, MSI(-X) address and data of this
>    interrupt is in remappable format because those are programmed by 1st
>    kernel. I guess this might be a cause of PCI SERR.
> 
> I guess clearing command register or disabling MSI before interrupt
> remapping initialization in 2nd kernel might solve the PCI SERR problem.

Thank you for your comment. That makes sense.

Though I think clearing bus master bit is enough, do you think I need to clear
all command register bit not only bus master? 

Thanks,
Takao Indoh


> Regards,
> Kenji Kaneshige
> 
> 
>>
>> Therefore resetting devices has to be done before enable_IR() is
>> called. I have three ideas for it.
>>
>>    (i) Resetting devices in 1st kernel(panic kernel)
>>    We can reset devices before jumping into 2nd kernel. Of course it may
>>    be dangerous to scan pci device tree and call PCI functions in panic'd
>>    kernel. Beforehand we need to collect device information so that only
>>    minimal code could run on panic.
>>
>>    (ii) Resetting devices in purgatory
>>    It seems to be be appropriate place to do this, but I'm not sure
>>    where I can save/restore PCI config when resetting devices in
>>    purgatory.
>>
>>    (iii) Resetting devices in 2nd kernel(kdump kernel)
>>    Important point is to do reset before enable_IR() is called as I wrote
>>    above. I think I should add new function to do reset into
>>    arch/x86/pci/early.c and call it in setup_arch like
>>    early_dump_pci_devices() or early_quirks().
>>
>> Any comments?
>>
>> Thanks,
>> Takao Indoh
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal Sept. 10, 2012, 2:36 p.m. UTC | #9
On Wed, Sep 05, 2012 at 08:09:58PM +0900, Takao Indoh wrote:
> (2012/08/07 5:39), Vivek Goyal wrote:
> >On Mon, Aug 06, 2012 at 01:30:47PM +0900, Takao Indoh wrote:
> >>Hi Vivek,
> >>
> >>(2012/08/03 20:46), Vivek Goyal wrote:
> >>>On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
> >>>>Hi all,
> >>>>
> >>>>This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
> >>>>devices at boot time to address DMA problem on kdump with iommu. When
> >>>>this parameter is specified, a hot reset is triggered on each PCIe root
> >>>>port and downstream port to reset its downstream endpoint.
> >>>
> >>>Hi Takao,
> >>>
> >>>Why not use existing "reset_devices" parameter instead of introducing
> >>>a new one?
> >>
> >>"reset_devices" is used for each driver to reset their own device, and
> >>this patch resets all devices forcibly, so I thought they were different
> >>things.
> >
> >Yes reset_devices currently is used for driver to reset its device. I
> >thought one could very well extend its reach to reset pci express devices
> >at bus level.
> >
> >Having them separate is not going to be much useful from kdump
> >perspective. We will end up passing both reset_devices and
> >reset_pcie_devices to second kernel whill lead to bus level reset as well
> >as device level reset.
> >
> >Ideal situation would be that somehow detect that bus level reset has been
> >done and skip device level reset (assuming bus level reset obviates the
> >need of device level reset, please correct me if that's not the case).
> >
> >After pcie reset, can we store the state in a variable and drivers can
> >use that variable to check if PCIe level reset was done or not. If yes,
> >skip device level reset (Assuming driver knows that device is on a
> >PCIe slot).
> >
> >In that case we will not have to introduce new kernel command line, and
> >also avoid double reset?
> I found a problem when testing my patch on some machines.
> 
> Originally there are two problems in kdump kernel when iommu is enabled;
> DMAR error and PCI SERR. I thought they are fixed by my patch, but I
> noticed that PCI SERR is still detected after applying the patch. It
> seems that something happens when Interrupt Remapping is initialized in
> kdump kernel.
> 
> Therefore resetting devices has to be done before enable_IR() is
> called. I have three ideas for it.
> 
>  (i) Resetting devices in 1st kernel(panic kernel)
>  We can reset devices before jumping into 2nd kernel. Of course it may
>  be dangerous to scan pci device tree and call PCI functions in panic'd
>  kernel. Beforehand we need to collect device information so that only
>  minimal code could run on panic.
> 
>  (ii) Resetting devices in purgatory
>  It seems to be be appropriate place to do this, but I'm not sure
>  where I can save/restore PCI config when resetting devices in
>  purgatory.
> 
>  (iii) Resetting devices in 2nd kernel(kdump kernel)
>  Important point is to do reset before enable_IR() is called as I wrote
>  above. I think I should add new function to do reset into
>  arch/x86/pci/early.c and call it in setup_arch like
>  early_dump_pci_devices() or early_quirks().

I would not claim that I understand hte PCI SERR issue. But whatever
resettings needs to happen, should happen early in second kernel.

Doing it in first kernel is not a good idea as it is crashed kernel and
we want to as little as possible.

Doing it in purgatory is not a good idea either as purgatory does not
konw anything about kernel as such. We don't want to bloat purgatory
with reset code and embedding the device knowledge there.

Keeping it in second kernel makes sense so that code remains with kernel
and can be maintained there.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Sept. 11, 2012, 10:32 a.m. UTC | #10
(2012/09/10 23:36), Vivek Goyal wrote:
> On Wed, Sep 05, 2012 at 08:09:58PM +0900, Takao Indoh wrote:
>> (2012/08/07 5:39), Vivek Goyal wrote:
>>> On Mon, Aug 06, 2012 at 01:30:47PM +0900, Takao Indoh wrote:
>>>> Hi Vivek,
>>>>
>>>> (2012/08/03 20:46), Vivek Goyal wrote:
>>>>> On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> This patch adds kernel parameter "reset_pcie_devices" which resets PCIe
>>>>>> devices at boot time to address DMA problem on kdump with iommu. When
>>>>>> this parameter is specified, a hot reset is triggered on each PCIe root
>>>>>> port and downstream port to reset its downstream endpoint.
>>>>>
>>>>> Hi Takao,
>>>>>
>>>>> Why not use existing "reset_devices" parameter instead of introducing
>>>>> a new one?
>>>>
>>>> "reset_devices" is used for each driver to reset their own device, and
>>>> this patch resets all devices forcibly, so I thought they were different
>>>> things.
>>>
>>> Yes reset_devices currently is used for driver to reset its device. I
>>> thought one could very well extend its reach to reset pci express devices
>>> at bus level.
>>>
>>> Having them separate is not going to be much useful from kdump
>>> perspective. We will end up passing both reset_devices and
>>> reset_pcie_devices to second kernel whill lead to bus level reset as well
>>> as device level reset.
>>>
>>> Ideal situation would be that somehow detect that bus level reset has been
>>> done and skip device level reset (assuming bus level reset obviates the
>>> need of device level reset, please correct me if that's not the case).
>>>
>>> After pcie reset, can we store the state in a variable and drivers can
>>> use that variable to check if PCIe level reset was done or not. If yes,
>>> skip device level reset (Assuming driver knows that device is on a
>>> PCIe slot).
>>>
>>> In that case we will not have to introduce new kernel command line, and
>>> also avoid double reset?
>> I found a problem when testing my patch on some machines.
>>
>> Originally there are two problems in kdump kernel when iommu is enabled;
>> DMAR error and PCI SERR. I thought they are fixed by my patch, but I
>> noticed that PCI SERR is still detected after applying the patch. It
>> seems that something happens when Interrupt Remapping is initialized in
>> kdump kernel.
>>
>> Therefore resetting devices has to be done before enable_IR() is
>> called. I have three ideas for it.
>>
>>   (i) Resetting devices in 1st kernel(panic kernel)
>>   We can reset devices before jumping into 2nd kernel. Of course it may
>>   be dangerous to scan pci device tree and call PCI functions in panic'd
>>   kernel. Beforehand we need to collect device information so that only
>>   minimal code could run on panic.
>>
>>   (ii) Resetting devices in purgatory
>>   It seems to be be appropriate place to do this, but I'm not sure
>>   where I can save/restore PCI config when resetting devices in
>>   purgatory.
>>
>>   (iii) Resetting devices in 2nd kernel(kdump kernel)
>>   Important point is to do reset before enable_IR() is called as I wrote
>>   above. I think I should add new function to do reset into
>>   arch/x86/pci/early.c and call it in setup_arch like
>>   early_dump_pci_devices() or early_quirks().
>
> I would not claim that I understand hte PCI SERR issue. But whatever
> resettings needs to happen, should happen early in second kernel.
>
> Doing it in first kernel is not a good idea as it is crashed kernel and
> we want to as little as possible.
>
> Doing it in purgatory is not a good idea either as purgatory does not
> konw anything about kernel as such. We don't want to bloat purgatory
> with reset code and embedding the device knowledge there.
>
> Keeping it in second kernel makes sense so that code remains with kernel
> and can be maintained there.

I'll post new patch which clears bus master bit and resets devices in
second kernel.

As to the boot parameter to enable this function, you suggested using
reset_devices. I found that on a certain platform resetting devices
caused PCIe error due to a hardware bug. Therefore I think we need
new parameter apart from reset_devices to disable this function on
such a machine.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kenji Kaneshige Sept. 11, 2012, 11:52 a.m. UTC | #11
> -----Original Message-----
> From: Takao Indoh [mailto:indou.takao@jp.fujitsu.com]
> Sent: Monday, September 10, 2012 3:35 PM
> To: Kaneshige, Kenji/?? ??
> Cc: vgoyal@redhat.com; kexec@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> bhelgaas@google.com; hbabu@us.ibm.com; Ishii, Hironobu/?? ??;
> martin.wilck@ts.fujitsu.com
> Subject: Re: [RFC][PATCH] Reset PCIe devices to address DMA problem on kdump
> with iommu
> 
> (2012/09/10 11:34), Kaneshige, Kenji wrote:
> >> -----Original Message-----
> >> From: linux-pci-owner@vger.kernel.org
> >> [mailto:linux-pci-owner@vger.kernel.org] On Behalf Of Takao Indoh
> >> Sent: Wednesday, September 05, 2012 8:10 PM
> >> To: vgoyal@redhat.com
> >> Cc: kexec@lists.infradead.org; linux-kernel@vger.kernel.org;
> >> linux-pci@vger.kernel.org; bhelgaas@google.com; hbabu@us.ibm.com;
> Ishii,
> >> Hironobu/?? ??; martin.wilck@ts.fujitsu.com
> >> Subject: Re: [RFC][PATCH] Reset PCIe devices to address DMA problem on
> kdump
> >> with iommu
> >>
> >> (2012/08/07 5:39), Vivek Goyal wrote:
> >>> On Mon, Aug 06, 2012 at 01:30:47PM +0900, Takao Indoh wrote:
> >>>> Hi Vivek,
> >>>>
> >>>> (2012/08/03 20:46), Vivek Goyal wrote:
> >>>>> On Fri, Aug 03, 2012 at 08:24:31PM +0900, Takao Indoh wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> This patch adds kernel parameter "reset_pcie_devices" which resets
> >> PCIe
> >>>>>> devices at boot time to address DMA problem on kdump with iommu.
> When
> >>>>>> this parameter is specified, a hot reset is triggered on each PCIe
> >> root
> >>>>>> port and downstream port to reset its downstream endpoint.
> >>>>>
> >>>>> Hi Takao,
> >>>>>
> >>>>> Why not use existing "reset_devices" parameter instead of introducing
> >>>>> a new one?
> >>>>
> >>>> "reset_devices" is used for each driver to reset their own device,
> and
> >>>> this patch resets all devices forcibly, so I thought they were different
> >>>> things.
> >>>
> >>> Yes reset_devices currently is used for driver to reset its device.
> I
> >>> thought one could very well extend its reach to reset pci express devices
> >>> at bus level.
> >>>
> >>> Having them separate is not going to be much useful from kdump
> >>> perspective. We will end up passing both reset_devices and
> >>> reset_pcie_devices to second kernel whill lead to bus level reset as
> well
> >>> as device level reset.
> >>>
> >>> Ideal situation would be that somehow detect that bus level reset has
> >> been
> >>> done and skip device level reset (assuming bus level reset obviates
> the
> >>> need of device level reset, please correct me if that's not the case).
> >>>
> >>> After pcie reset, can we store the state in a variable and drivers can
> >>> use that variable to check if PCIe level reset was done or not. If yes,
> >>> skip device level reset (Assuming driver knows that device is on a
> >>> PCIe slot).
> >>>
> >>> In that case we will not have to introduce new kernel command line,
> and
> >>> also avoid double reset?
> >>
> >> I found a problem when testing my patch on some machines.
> >>
> >> Originally there are two problems in kdump kernel when iommu is enabled;
> >> DMAR error and PCI SERR. I thought they are fixed by my patch, but I
> >> noticed that PCI SERR is still detected after applying the patch. It
> >> seems that something happens when Interrupt Remapping is initialized
> in
> >> kdump kernel.
> >
> > I'm not sure, but I guess the PCI SERR might be caused as follows.
> >
> > - The 1st kernel enables interrupt remapping. The MSI(-X) address and
> >    data registers of PCI devices are programmed in remappable format.
> >
> > - At the beginning of 2nd kernel, interrupt remapping is still active.
> >    And then it is disabled by enable_IR() function for initialization.
> >
> > - PCI device generate an interrupt. At this moment, interrupt remapping
> >    is not enabled yet. On the other hand, MSI(-X) address and data of
> this
> >    interrupt is in remappable format because those are programmed by 1st
> >    kernel. I guess this might be a cause of PCI SERR.
> >
> > I guess clearing command register or disabling MSI before interrupt
> > remapping initialization in 2nd kernel might solve the PCI SERR problem.
> 
> Thank you for your comment. That makes sense.
> 
> Though I think clearing bus master bit is enough, do you think I need to
> clear
> all command register bit not only bus master?

I thought clearing bus master bit. But we might also need to set the
disable INTx bit.

It would not be needed if we can reset devices before interrupt remapping
initialization. I guess resetting devices before interrupt remapping
initialization would need complex code change and be difficult to implement.
If so, clearing bus master bit would be one of the alternative solution
against PCI SERR issue.

Regards,
Kenji Kaneshige



> 
> Thanks,
> Takao Indoh
> 
> 
> > Regards,
> > Kenji Kaneshige
> >
> >
> >>
> >> Therefore resetting devices has to be done before enable_IR() is
> >> called. I have three ideas for it.
> >>
> >>    (i) Resetting devices in 1st kernel(panic kernel)
> >>    We can reset devices before jumping into 2nd kernel. Of course it
> may
> >>    be dangerous to scan pci device tree and call PCI functions in panic'd
> >>    kernel. Beforehand we need to collect device information so that only
> >>    minimal code could run on panic.
> >>
> >>    (ii) Resetting devices in purgatory
> >>    It seems to be be appropriate place to do this, but I'm not sure
> >>    where I can save/restore PCI config when resetting devices in
> >>    purgatory.
> >>
> >>    (iii) Resetting devices in 2nd kernel(kdump kernel)
> >>    Important point is to do reset before enable_IR() is called as I wrote
> >>    above. I think I should add new function to do reset into
> >>    arch/x86/pci/early.c and call it in setup_arch like
> >>    early_dump_pci_devices() or early_quirks().
> >>
> >> Any comments?
> >>
> >> Thanks,
> >> Takao Indoh
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-pci"
> in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal Sept. 11, 2012, 2:43 p.m. UTC | #12
On Tue, Sep 11, 2012 at 07:32:35PM +0900, Takao Indoh wrote:

[..]
> I'll post new patch which clears bus master bit and resets devices in
> second kernel.
> 
> As to the boot parameter to enable this function, you suggested using
> reset_devices. I found that on a certain platform resetting devices
> caused PCIe error due to a hardware bug. Therefore I think we need
> new parameter apart from reset_devices to disable this function on
> such a machine.

Can you explain a bit more how the error happens. I still don't think
that because of a bug in a platform somewhere we should be introducing
a separate command line parameter and not reuse the exisiting one. Also
you have not explained what's the bug and how a new parameter will 
avoid the bug.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Sept. 12, 2012, 9 a.m. UTC | #13
(2012/09/11 23:43), Vivek Goyal wrote:
> On Tue, Sep 11, 2012 at 07:32:35PM +0900, Takao Indoh wrote:
>
> [..]
>> I'll post new patch which clears bus master bit and resets devices in
>> second kernel.
>>
>> As to the boot parameter to enable this function, you suggested using
>> reset_devices. I found that on a certain platform resetting devices
>> caused PCIe error due to a hardware bug. Therefore I think we need
>> new parameter apart from reset_devices to disable this function on
>> such a machine.
>
> Can you explain a bit more how the error happens. I still don't think
> that because of a bug in a platform somewhere we should be introducing
> a separate command line parameter and not reuse the exisiting one. Also
> you have not explained what's the bug and how a new parameter will
> avoid the bug.

The bug I mentioned is that ACS Violation occurs at PCIe switch when
reading PCI configuration after device reset. I got information that
this violation is caused by PCIe switch bug. The machine becomes fatal
status by this error.

The reason why I try to introduce new parameter is that I want to avoid
regression by this patch. Let's say this patch was included in kernel
and its reset function was enabled by reset_devices as you said. AFAIK
reset_devices is always needed for kdump, so it means that devices are
always reset at kdump boot time. It causes a regression that system
always becomes abnormal status when we run kdump on the machine which has
a bug I mentioned.

To avoid this regression, I want to separate reset_devices from this
reset function. Or how about this?
- if user specify reset_devices, devices are reset by this patch, as you
   said.
- To avoid a regression I said, add new parameter like "pci=noreset".
   If this parameter is specified, the reset function I add is disabled
   and we can avoid regression.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal Sept. 14, 2012, 3:48 p.m. UTC | #14
On Wed, Sep 12, 2012 at 06:00:55PM +0900, Takao Indoh wrote:
> (2012/09/11 23:43), Vivek Goyal wrote:
> >On Tue, Sep 11, 2012 at 07:32:35PM +0900, Takao Indoh wrote:
> >
> >[..]
> >>I'll post new patch which clears bus master bit and resets devices in
> >>second kernel.
> >>
> >>As to the boot parameter to enable this function, you suggested using
> >>reset_devices. I found that on a certain platform resetting devices
> >>caused PCIe error due to a hardware bug. Therefore I think we need
> >>new parameter apart from reset_devices to disable this function on
> >>such a machine.
> >
> >Can you explain a bit more how the error happens. I still don't think
> >that because of a bug in a platform somewhere we should be introducing
> >a separate command line parameter and not reuse the exisiting one. Also
> >you have not explained what's the bug and how a new parameter will
> >avoid the bug.
> 
> The bug I mentioned is that ACS Violation occurs at PCIe switch when
> reading PCI configuration after device reset. I got information that
> this violation is caused by PCIe switch bug. The machine becomes fatal
> status by this error.
> 
> The reason why I try to introduce new parameter is that I want to avoid
> regression by this patch. Let's say this patch was included in kernel
> and its reset function was enabled by reset_devices as you said. AFAIK
> reset_devices is always needed for kdump, so it means that devices are
> always reset at kdump boot time. It causes a regression that system
> always becomes abnormal status when we run kdump on the machine which has
> a bug I mentioned.
> 
> To avoid this regression, I want to separate reset_devices from this
> reset function. Or how about this?
> - if user specify reset_devices, devices are reset by this patch, as you
>   said.
> - To avoid a regression I said, add new parameter like "pci=noreset".
>   If this parameter is specified, the reset function I add is disabled
>   and we can avoid regression.

Can we identify that particular switch in code and not reset it in code.
Introducing new paramenters to avoid bugs really feels odd.

Also, what was the conclusion to avoid double reset. I am assuming that
we don't want to do bus level reset as well as driver level reset based
on reset_devices. 

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konrad Rzeszutek Wilk Sept. 14, 2012, 8:03 p.m. UTC | #15
> As to the boot parameter to enable this function, you suggested using
> reset_devices. I found that on a certain platform resetting devices
> caused PCIe error due to a hardware bug. Therefore I think we need
> new parameter apart from reset_devices to disable this function on
> such a machine.

Wouldn't a DMI quirk be better for this? That way the second kernel
internally would know not to do this.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Sept. 19, 2012, 1:52 a.m. UTC | #16
(2012/09/15 5:03), Konrad Rzeszutek Wilk wrote:
>> As to the boot parameter to enable this function, you suggested using
>> reset_devices. I found that on a certain platform resetting devices
>> caused PCIe error due to a hardware bug. Therefore I think we need
>> new parameter apart from reset_devices to disable this function on
>> such a machine.
>
> Wouldn't a DMI quirk be better for this? That way the second kernel
> internally would know not to do this.

That sounds good. I'll try it, thanks.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Donald Dutile Sept. 21, 2012, 5:57 p.m. UTC | #17
On 09/14/2012 04:03 PM, Konrad Rzeszutek Wilk wrote:
>> As to the boot parameter to enable this function, you suggested using
>> reset_devices. I found that on a certain platform resetting devices
>> caused PCIe error due to a hardware bug. Therefore I think we need
>> new parameter apart from reset_devices to disable this function on
>> such a machine.
>
> Wouldn't a DMI quirk be better for this? That way the second kernel
> internally would know not to do this.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
a DMI quirk for a PCIe switch?  pci-quirk maybe?  there is a hook
in per-device pci reset code path, which could be setup for such
a case.

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Sept. 24, 2012, 11:12 a.m. UTC | #18
(2012/09/22 2:57), Don Dutile wrote:
> On 09/14/2012 04:03 PM, Konrad Rzeszutek Wilk wrote:
>>> As to the boot parameter to enable this function, you suggested using
>>> reset_devices. I found that on a certain platform resetting devices
>>> caused PCIe error due to a hardware bug. Therefore I think we need
>>> new parameter apart from reset_devices to disable this function on
>>> such a machine.
>>
>> Wouldn't a DMI quirk be better for this? That way the second kernel
>> internally would know not to do this.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> a DMI quirk for a PCIe switch?  pci-quirk maybe?  there is a hook
> in per-device pci reset code path, which could be setup for such
> a case.

Yep, I think early-quirk is better.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Sept. 24, 2012, 11:22 a.m. UTC | #19
(2012/09/15 0:48), Vivek Goyal wrote:
> On Wed, Sep 12, 2012 at 06:00:55PM +0900, Takao Indoh wrote:
>> (2012/09/11 23:43), Vivek Goyal wrote:
>>> On Tue, Sep 11, 2012 at 07:32:35PM +0900, Takao Indoh wrote:
>>>
>>> [..]
>>>> I'll post new patch which clears bus master bit and resets devices in
>>>> second kernel.
>>>>
>>>> As to the boot parameter to enable this function, you suggested using
>>>> reset_devices. I found that on a certain platform resetting devices
>>>> caused PCIe error due to a hardware bug. Therefore I think we need
>>>> new parameter apart from reset_devices to disable this function on
>>>> such a machine.
>>>
>>> Can you explain a bit more how the error happens. I still don't think
>>> that because of a bug in a platform somewhere we should be introducing
>>> a separate command line parameter and not reuse the exisiting one. Also
>>> you have not explained what's the bug and how a new parameter will
>>> avoid the bug.
>>
>> The bug I mentioned is that ACS Violation occurs at PCIe switch when
>> reading PCI configuration after device reset. I got information that
>> this violation is caused by PCIe switch bug. The machine becomes fatal
>> status by this error.
>>
>> The reason why I try to introduce new parameter is that I want to avoid
>> regression by this patch. Let's say this patch was included in kernel
>> and its reset function was enabled by reset_devices as you said. AFAIK
>> reset_devices is always needed for kdump, so it means that devices are
>> always reset at kdump boot time. It causes a regression that system
>> always becomes abnormal status when we run kdump on the machine which has
>> a bug I mentioned.
>>
>> To avoid this regression, I want to separate reset_devices from this
>> reset function. Or how about this?
>> - if user specify reset_devices, devices are reset by this patch, as you
>>    said.
>> - To avoid a regression I said, add new parameter like "pci=noreset".
>>    If this parameter is specified, the reset function I add is disabled
>>    and we can avoid regression.
>
> Can we identify that particular switch in code and not reset it in code.
> Introducing new paramenters to avoid bugs really feels odd.

Maybe we can do it using PCI quirk or DMI quirk as Konrad and Don said.
But I'm still thinking whether I can add boot parameter or something to
disable reset function so that we can use it as workaround untill we add
quirk when we find such a buggy hardware.

> Also, what was the conclusion to avoid double reset. I am assuming that
> we don't want to do bus level reset as well as driver level reset based
> on reset_devices.

I don't have a good idea to do it yet. Maybe we need to write bus:slot:func
information to somewhere when we reset device.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index e714a02..e694e9f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2489,6 +2489,10 @@  bytes respectively. Such letter suffixes can also be entirely omitted.
 	reset_devices	[KNL] Force drivers to reset the underlying device
 			during initialization.
 
+	reset_pcie_devices
+			[PCIE] Reset PCIe endpoint at boot time by sending a
+			hot reset to root port and downstream port
+
 	resume=		[SWSUSP]
 			Specify the partition device for software suspend
 			Format:
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 5155317..7f7fc02 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -32,6 +32,65 @@ 
 #include "pci.h"
 
 /*
+ * Reset PCIe endpoint by sending a hot reset to root port and downstream port
+ */
+unsigned int reset_pcie_devices;
+EXPORT_SYMBOL(reset_pcie_devices);
+static int __init set_reset_pcie_devices(char *str)
+{
+	reset_pcie_devices = 1;
+	return 1;
+}
+__setup("reset_pcie_devices", set_reset_pcie_devices);
+
+static void __devinit quirk_pcie_device_reset(struct pci_dev *dev)
+{
+	struct pci_bus *subordinate;
+	struct pci_dev *child;
+	u16 ctrl;
+
+	if (!reset_pcie_devices || !pci_is_pcie(dev) || !dev->subordinate ||
+	    ((dev->pcie_type != PCI_EXP_TYPE_ROOT_PORT) &&
+	     (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)))
+		return;
+
+	subordinate = dev->subordinate;
+	list_for_each_entry(child, &subordinate->devices, bus_list) {
+		if ((child->pcie_type == PCI_EXP_TYPE_UPSTREAM) ||
+		    (child->pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) ||
+		    ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
+		/* Don't reset switch, bridge, VGA device */
+		return;
+	}
+
+	dev_info(&dev->dev, "Reset Secondary bus\n");
+
+	list_for_each_entry(child, &subordinate->devices, bus_list) {
+		dev_info(&child->dev, "save state\n");
+		pci_save_state(child);
+	}
+
+	/* Assert Secondary Bus Reset */
+	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
+	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+
+	msleep(2);
+
+	/* De-assert Secondary Bus Reset */
+	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+
+	msleep(200);
+
+	list_for_each_entry(child, &subordinate->devices, bus_list) {
+		dev_info(&child->dev, "restore state\n");
+		pci_restore_state(child);
+	}
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_pcie_device_reset);
+
+/*
  * Decoding should be disabled for a PCI device during BAR sizing to avoid
  * conflict. But doing so may cause problems on host bridge and perhaps other
  * key system devices. For devices that need to have mmio decoding always-on,