[v2] PCI: Reset PCIe devices to stop ongoing DMA

Message ID	1368509365-2260-1-git-send-email-indou.takao@jp.fujitsu.com (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> From: Takao Indoh <indou.takao@jp.fujitsu.com> To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, bhelgaas@google.com Cc: iommu@lists.linux-foundation.org, kexec@lists.infradead.org, ishii.hironobu@jp.fujitsu.com, ddutile@redhat.com, bill.sumner@hp.com, alex.williamson@redhat.com Subject: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA Date: Tue, 14 May 2013 14:29:25 +0900 Message-Id: <1368509365-2260-1-git-send-email-indou.takao@jp.fujitsu.com> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Takao Indoh May 14, 2013, 5:29 a.m. UTC

This patch resets PCIe devices on boot to stop ongoing DMA. When
"pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
on each PCIe root port and downstream port to reset its downstream
endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
   and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
   dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
root port/downstream port whose child is PCIe endpoint, and then reset
link between them. If the endpoint is VGA device, it is skipped because
the monitor blacks out if VGA controller is reset.

Changelog:
v2:
- Read pci config before de-assert secondary bus reset to flush previous
  write
- Change function/variable name
- Make a list of devices to be reset

v1:
https://patchwork.kernel.org/patch/2482291/

Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    2 +
 drivers/pci/pci.c                   |  113 +++++++++++++++++++++++++++++++++++
 2 files changed, 115 insertions(+), 0 deletions(-)

Eric W. Biederman May 14, 2013, 10:04 p.m. UTC | #1

Takao Indoh <indou.takao@jp.fujitsu.com> writes:

> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
> on each PCIe root port and downstream port to reset its downstream
> endpoint.
>
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
>    and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>    dma-remapping errors.
>
> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
>
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
> root port/downstream port whose child is PCIe endpoint, and then reset
> link between them. If the endpoint is VGA device, it is skipped because
> the monitor blacks out if VGA controller is reset.

At a quick skim this patch looks reasonable.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

> Changelog:
> v2:
> - Read pci config before de-assert secondary bus reset to flush previous
>   write
> - Change function/variable name
> - Make a list of devices to be reset
>
> v1:
> https://patchwork.kernel.org/patch/2482291/
>
> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
> ---
>  Documentation/kernel-parameters.txt |    2 +
>  drivers/pci/pci.c                   |  113 +++++++++++++++++++++++++++++++++++
>  2 files changed, 115 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index c3bfacb..8c9e8e4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  				any pair of devices, possibly at the cost of
>  				reduced performance.  This also guarantees
>  				that hot-added devices will work.
> +		pcie_reset_endpoint_devices	Reset PCIe endpoint on boot by
> +				hot reset
>  		cbiosize=nn[KMG]	The fixed amount of bus space which is
>  				reserved for the CardBus bridge's IO window.
>  				The default value is 256 bytes.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a899d8b..70c1205 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>  }
>  EXPORT_SYMBOL(pci_fixup_cardbus);
>  
> +/*
> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
> + * endpoint except VGA device.
> + */
> +static int __init need_reset(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +
> +	if (!pci_is_pcie(dev) || !dev->subordinate ||
> +	    list_empty(&dev->subordinate->devices) ||
> +	    ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +	     (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> +		return 0;
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> +		    (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> +		    ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> +			/* Don't reset switch, bridge, VGA device */
> +			return 0;
> +	}
> +
> +	return 1;
> +}
> +
> +static void __init save_downstream_configs(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		dev_info(&child->dev, "save state\n");
> +		pci_save_state(child);
> +	}
> +}
> +
> +static void __init restore_downstream_configs(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		dev_info(&child->dev, "restore state\n");
> +		pci_restore_state(child);
> +	}
> +}
> +
> +static void __init do_downstream_device_reset(struct pci_dev *dev)
> +{
> +	u16 ctrl;
> +
> +	dev_info(&dev->dev, "Reset Secondary bus\n");
> +
> +	/* Assert Secondary Bus Reset */
> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> +	/* Read config again to flush previous write */
> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +
> +	msleep(2);
> +
> +	/* De-assert Secondary Bus Reset */
> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +}
> +
> +struct pci_dev_entry {
> +	struct list_head list;
> +	struct pci_dev *dev;
> +};
> +static int __initdata pcie_reset_endpoint_devices;
> +static int __init reset_pcie_endpoints(void)
> +{
> +	struct pci_dev *dev = NULL;
> +	struct pci_dev_entry *pdev_entry, *tmp;
> +	LIST_HEAD(pdev_list);
> +
> +	if (!pcie_reset_endpoint_devices)
> +		return 0;
> +
> +	for_each_pci_dev(dev)
> +		if (need_reset(dev)) {
> +			pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
> +			pdev_entry->dev = dev;
> +			list_add(&pdev_entry->list, &pdev_list);
> +		}
> +
> +	list_for_each_entry(pdev_entry, &pdev_list, list)
> +		save_downstream_configs(pdev_entry->dev);
> +
> +	list_for_each_entry(pdev_entry, &pdev_list, list)
> +		do_downstream_device_reset(pdev_entry->dev);
> +
> +	msleep(1000);
> +
> +	list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
> +		restore_downstream_configs(pdev_entry->dev);
> +		kfree(pdev_entry);
> +	}
> +
> +	return 0;
> +}
> +fs_initcall_sync(reset_pcie_endpoints);
> +
>  static int __init pci_setup(char *str)
>  {
>  	while (str) {
> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
>  				pcie_bus_config = PCIE_BUS_PEER2PEER;
>  			} else if (!strncmp(str, "pcie_scan_all", 13)) {
>  				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
> +			} else if (!strncmp(str, "pcie_reset_endpoint_devices",
> +						27)) {
> +				pcie_reset_endpoint_devices = 1;
>  			} else {
>  				printk(KERN_ERR "PCI: Unknown option `%s'\n",
>  						str);
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh May 21, 2013, 11:46 p.m. UTC | #2

Hi Bjorn,

Any comments, ack/nack?

Thanks,
Takao Indoh

(2013/05/15 7:04), Eric W. Biederman wrote:
> Takao Indoh <indou.takao@jp.fujitsu.com> writes:
> 
>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
>> on each PCIe root port and downstream port to reset its downstream
>> endpoint.
>>
>> Problem:
>> This patch solves the problem that kdump can fail when intel_iommu=on is
>> specified. When intel_iommu=on is specified, many dma-remapping errors
>> occur in second kernel and it causes problems like driver error or PCI
>> SERR, at last kdump fails. This problem is caused as follows.
>> 1) Devices are working on first kernel.
>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>     and its DMA continues during this switch.
>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>     dma-remapping errors.
>>
>> Solution:
>> All DMA transactions have to be stopped before iommu is initialized. By
>> this patch devices are reset and in-flight DMA is stopped before
>> pci_iommu_init.
>>
>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
>> root port/downstream port whose child is PCIe endpoint, and then reset
>> link between them. If the endpoint is VGA device, it is skipped because
>> the monitor blacks out if VGA controller is reset.
> 
> At a quick skim this patch looks reasonable.
> 
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
>> Changelog:
>> v2:
>> - Read pci config before de-assert secondary bus reset to flush previous
>>    write
>> - Change function/variable name
>> - Make a list of devices to be reset
>>
>> v1:
>> https://patchwork.kernel.org/patch/2482291/
>>
>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>> ---
>>   Documentation/kernel-parameters.txt |    2 +
>>   drivers/pci/pci.c                   |  113 +++++++++++++++++++++++++++++++++++
>>   2 files changed, 115 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index c3bfacb..8c9e8e4 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>   				any pair of devices, possibly at the cost of
>>   				reduced performance.  This also guarantees
>>   				that hot-added devices will work.
>> +		pcie_reset_endpoint_devices	Reset PCIe endpoint on boot by
>> +				hot reset
>>   		cbiosize=nn[KMG]	The fixed amount of bus space which is
>>   				reserved for the CardBus bridge's IO window.
>>   				The default value is 256 bytes.
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index a899d8b..70c1205 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>   }
>>   EXPORT_SYMBOL(pci_fixup_cardbus);
>>   
>> +/*
>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
>> + * endpoint except VGA device.
>> + */
>> +static int __init need_reset(struct pci_dev *dev)
>> +{
>> +	struct pci_bus *subordinate;
>> +	struct pci_dev *child;
>> +
>> +	if (!pci_is_pcie(dev) || !dev->subordinate ||
>> +	    list_empty(&dev->subordinate->devices) ||
>> +	    ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> +	     (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>> +		return 0;
>> +
>> +	subordinate = dev->subordinate;
>> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +		if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>> +		    (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>> +		    ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>> +			/* Don't reset switch, bridge, VGA device */
>> +			return 0;
>> +	}
>> +
>> +	return 1;
>> +}
>> +
>> +static void __init save_downstream_configs(struct pci_dev *dev)
>> +{
>> +	struct pci_bus *subordinate;
>> +	struct pci_dev *child;
>> +
>> +	subordinate = dev->subordinate;
>> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +		dev_info(&child->dev, "save state\n");
>> +		pci_save_state(child);
>> +	}
>> +}
>> +
>> +static void __init restore_downstream_configs(struct pci_dev *dev)
>> +{
>> +	struct pci_bus *subordinate;
>> +	struct pci_dev *child;
>> +
>> +	subordinate = dev->subordinate;
>> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +		dev_info(&child->dev, "restore state\n");
>> +		pci_restore_state(child);
>> +	}
>> +}
>> +
>> +static void __init do_downstream_device_reset(struct pci_dev *dev)
>> +{
>> +	u16 ctrl;
>> +
>> +	dev_info(&dev->dev, "Reset Secondary bus\n");
>> +
>> +	/* Assert Secondary Bus Reset */
>> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +	/* Read config again to flush previous write */
>> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> +
>> +	msleep(2);
>> +
>> +	/* De-assert Secondary Bus Reset */
>> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +}
>> +
>> +struct pci_dev_entry {
>> +	struct list_head list;
>> +	struct pci_dev *dev;
>> +};
>> +static int __initdata pcie_reset_endpoint_devices;
>> +static int __init reset_pcie_endpoints(void)
>> +{
>> +	struct pci_dev *dev = NULL;
>> +	struct pci_dev_entry *pdev_entry, *tmp;
>> +	LIST_HEAD(pdev_list);
>> +
>> +	if (!pcie_reset_endpoint_devices)
>> +		return 0;
>> +
>> +	for_each_pci_dev(dev)
>> +		if (need_reset(dev)) {
>> +			pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>> +			pdev_entry->dev = dev;
>> +			list_add(&pdev_entry->list, &pdev_list);
>> +		}
>> +
>> +	list_for_each_entry(pdev_entry, &pdev_list, list)
>> +		save_downstream_configs(pdev_entry->dev);
>> +
>> +	list_for_each_entry(pdev_entry, &pdev_list, list)
>> +		do_downstream_device_reset(pdev_entry->dev);
>> +
>> +	msleep(1000);
>> +
>> +	list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>> +		restore_downstream_configs(pdev_entry->dev);
>> +		kfree(pdev_entry);
>> +	}
>> +
>> +	return 0;
>> +}
>> +fs_initcall_sync(reset_pcie_endpoints);
>> +
>>   static int __init pci_setup(char *str)
>>   {
>>   	while (str) {
>> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
>>   				pcie_bus_config = PCIE_BUS_PEER2PEER;
>>   			} else if (!strncmp(str, "pcie_scan_all", 13)) {
>>   				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>> +			} else if (!strncmp(str, "pcie_reset_endpoint_devices",
>> +						27)) {
>> +				pcie_reset_endpoint_devices = 1;
>>   			} else {
>>   				printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>   						str);
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh June 6, 2013, 7:25 a.m. UTC | #3

Ping

Bjorn, could you please tell your impression on this idea?

This fix is very important for kdump. For exmaple, kdump does not work
if PCI passthrough is used on KVM guest. Please see patch description
for details.

If you don't agree this patch, please tell me what changes I should
make.

Thanks,
Takao Indoh

(2013/05/14 14:29), Takao Indoh wrote:
> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
> on each PCIe root port and downstream port to reset its downstream
> endpoint.
> 
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
>     and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>     dma-remapping errors.
> 
> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
> 
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
> root port/downstream port whose child is PCIe endpoint, and then reset
> link between them. If the endpoint is VGA device, it is skipped because
> the monitor blacks out if VGA controller is reset.
> 
> Changelog:
> v2:
> - Read pci config before de-assert secondary bus reset to flush previous
>    write
> - Change function/variable name
> - Make a list of devices to be reset
> 
> v1:
> https://patchwork.kernel.org/patch/2482291/
> 
> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
> ---
>   Documentation/kernel-parameters.txt |    2 +
>   drivers/pci/pci.c                   |  113 +++++++++++++++++++++++++++++++++++
>   2 files changed, 115 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index c3bfacb..8c9e8e4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>   				any pair of devices, possibly at the cost of
>   				reduced performance.  This also guarantees
>   				that hot-added devices will work.
> +		pcie_reset_endpoint_devices	Reset PCIe endpoint on boot by
> +				hot reset
>   		cbiosize=nn[KMG]	The fixed amount of bus space which is
>   				reserved for the CardBus bridge's IO window.
>   				The default value is 256 bytes.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a899d8b..70c1205 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>   }
>   EXPORT_SYMBOL(pci_fixup_cardbus);
>   
> +/*
> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
> + * endpoint except VGA device.
> + */
> +static int __init need_reset(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +
> +	if (!pci_is_pcie(dev) || !dev->subordinate ||
> +	    list_empty(&dev->subordinate->devices) ||
> +	    ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +	     (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> +		return 0;
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> +		    (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> +		    ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> +			/* Don't reset switch, bridge, VGA device */
> +			return 0;
> +	}
> +
> +	return 1;
> +}
> +
> +static void __init save_downstream_configs(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		dev_info(&child->dev, "save state\n");
> +		pci_save_state(child);
> +	}
> +}
> +
> +static void __init restore_downstream_configs(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		dev_info(&child->dev, "restore state\n");
> +		pci_restore_state(child);
> +	}
> +}
> +
> +static void __init do_downstream_device_reset(struct pci_dev *dev)
> +{
> +	u16 ctrl;
> +
> +	dev_info(&dev->dev, "Reset Secondary bus\n");
> +
> +	/* Assert Secondary Bus Reset */
> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> +	/* Read config again to flush previous write */
> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +
> +	msleep(2);
> +
> +	/* De-assert Secondary Bus Reset */
> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +}
> +
> +struct pci_dev_entry {
> +	struct list_head list;
> +	struct pci_dev *dev;
> +};
> +static int __initdata pcie_reset_endpoint_devices;
> +static int __init reset_pcie_endpoints(void)
> +{
> +	struct pci_dev *dev = NULL;
> +	struct pci_dev_entry *pdev_entry, *tmp;
> +	LIST_HEAD(pdev_list);
> +
> +	if (!pcie_reset_endpoint_devices)
> +		return 0;
> +
> +	for_each_pci_dev(dev)
> +		if (need_reset(dev)) {
> +			pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
> +			pdev_entry->dev = dev;
> +			list_add(&pdev_entry->list, &pdev_list);
> +		}
> +
> +	list_for_each_entry(pdev_entry, &pdev_list, list)
> +		save_downstream_configs(pdev_entry->dev);
> +
> +	list_for_each_entry(pdev_entry, &pdev_list, list)
> +		do_downstream_device_reset(pdev_entry->dev);
> +
> +	msleep(1000);
> +
> +	list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
> +		restore_downstream_configs(pdev_entry->dev);
> +		kfree(pdev_entry);
> +	}
> +
> +	return 0;
> +}
> +fs_initcall_sync(reset_pcie_endpoints);
> +
>   static int __init pci_setup(char *str)
>   {
>   	while (str) {
> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
>   				pcie_bus_config = PCIE_BUS_PEER2PEER;
>   			} else if (!strncmp(str, "pcie_scan_all", 13)) {
>   				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
> +			} else if (!strncmp(str, "pcie_reset_endpoint_devices",
> +						27)) {
> +				pcie_reset_endpoint_devices = 1;
>   			} else {
>   				printk(KERN_ERR "PCI: Unknown option `%s'\n",
>   						str);
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas June 7, 2013, 4:14 a.m. UTC | #4

Sorry it's taken me so long to look at this.  I've been putting this
off because the patch doesn't seem "obviously correct," and I don't
feel like I really understand the problem.  I'm naive about both
IOMMUs and kexec/kdump, so please pardon (and help me with) any silly
questions or assumptions below.

On Mon, May 13, 2013 at 11:29 PM, Takao Indoh
<indou.takao@jp.fujitsu.com> wrote:
> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
> on each PCIe root port and downstream port to reset its downstream
> endpoint.
>
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
>    and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>    dma-remapping errors.

If I understand correctly, the problem only happens on systems with an
IOMMU that's enabled in either the system or kdump kernel (or both).
For systems without an IOMMU (or if it is disabled in both the system
and kdump kernels),  any ongoing DMA should use addresses that target
system-kernel memory and should not affect the kdump kernel.

One thing I'm not sure about is that you are only resetting PCIe
devices, but I don't think the problem is actually specific to PCIe,
is it?  I think the same issue could occur on any system with an
IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
there are systems with IOMMUs for plain old PCI devices, e.g.,
PA-RISC.

I tried to make a list of the interesting scenarios and the events
that are relevant to this problem:

    Case 1: IOMMU off in system, off in kdump kernel
      system kernel leaves IOMMU off
        DMA targets system-kernel memory
      kexec to kdump kernel (IOMMU off, devices untouched)
        DMA targets system-kernel memory (harmless)
      kdump kernel re-inits device
        DMA targets kdump-kernel memory

    Case 2: IOMMU off in system kernel, on in kdump kernel
      system kernel leaves IOMMU off
        DMA targets system-kernel memory
      kexec to kdump kernel (IOMMU off, devices untouched)
        DMA targets system-kernel memory (harmless)
      kdump kernel enables IOMMU with no valid mappings
        DMA causes IOMMU errors (annoying but harmless)
      kdump kernel re-inits device
        DMA targets IOMMU-mapped kdump-kernel memory

    Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
      system kernel enables IOMMU
        DMA targets IOMMU-mapped system-kernel memory
      kexec to kdump kernel (IOMMU on, devices untouched)
        DMA targets IOMMU-mapped system-kernel memory
      kdump kernel doesn't know about IOMMU or doesn't touch it
        DMA targets IOMMU-mapped system-kernel memory
      kdump kernel re-inits device
        kernel assumes no IOMMU, so all new DMA mappings are invalid
because DMAs actually do go through the IOMMU
        (** corruption or other non-recoverable error likely **)

    Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
      system kernel enables IOMMU
        DMA targets IOMMU-mapped system-kernel memory
      kexec to kdump kernel (IOMMU on, devices untouched)
        DMA targets IOMMU-mapped system-kernel memory
      kdump kernel disables IOMMU
        DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
        (** corruption or other non-recoverable error likely **)
      kdump kernel re-inits device
        DMA targets kdump-kernel memory

    Case 4: IOMMU on in system kernel, on in kdump kernel
      system kernel enables IOMMU
        DMA targets IOMMU-mapped system-kernel memory
      kexec to kdump kernel (IOMMU on, devices untouched)
        DMA targets IOMMU-mapped system-kernel memory
      kdump kernel enables IOMMU with no valid mappings
        DMA causes IOMMU errors (annoying but harmless)
      kdump kernel re-inits device
        DMA targets IOMMU-mapped kdump-kernel memory

The problem cases I see are 3a and 3b, but that's not the problem
you're describing.  Obviously I'm missing something.

It sounds like you're seeing problems in case 2 or case 4, where the
IOMMU is enabled in the kdump kernel.  Maybe my assumption about the
IOMMU being enabled with no valid mappings is wrong?  Or maybe those
IOMMU errors are not actually harmless?

Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not
with case 3a.  Therefore, it seems like the kdump kernel *must*
contain IOMMU support unless it knows for certain that the system
kernel wasn't using the IOMMU.

Do you have any bugzilla references or problem report URLs you could
include here?

Obviously I'm very confused here, so please help educate me :)

Also, you mentioned PCI passthrough with a KVM guest as being a
problem.  Can you explain more about what makes that a problem?  I
don't know enough about passthrough to know why a device being used by
a KVM guest would cause more problems than a device being used
directly by the host.

Bjorn

> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
>
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
> root port/downstream port whose child is PCIe endpoint, and then reset
> link between them. If the endpoint is VGA device, it is skipped because
> the monitor blacks out if VGA controller is reset.
>
> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
> ---
>  Documentation/kernel-parameters.txt |    2 +
>  drivers/pci/pci.c                   |  113 +++++++++++++++++++++++++++++++++++
>  2 files changed, 115 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index c3bfacb..8c9e8e4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>                                 any pair of devices, possibly at the cost of
>                                 reduced performance.  This also guarantees
>                                 that hot-added devices will work.
> +               pcie_reset_endpoint_devices     Reset PCIe endpoint on boot by
> +                               hot reset
>                 cbiosize=nn[KMG]        The fixed amount of bus space which is
>                                 reserved for the CardBus bridge's IO window.
>                                 The default value is 256 bytes.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a899d8b..70c1205 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>  }
>  EXPORT_SYMBOL(pci_fixup_cardbus);
>
> +/*
> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
> + * endpoint except VGA device.
> + */
> +static int __init need_reset(struct pci_dev *dev)
> +{
> +       struct pci_bus *subordinate;
> +       struct pci_dev *child;
> +
> +       if (!pci_is_pcie(dev) || !dev->subordinate ||
> +           list_empty(&dev->subordinate->devices) ||
> +           ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +            (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> +               return 0;
> +
> +       subordinate = dev->subordinate;
> +       list_for_each_entry(child, &subordinate->devices, bus_list) {
> +               if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> +                   (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> +                   ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> +                       /* Don't reset switch, bridge, VGA device */
> +                       return 0;
> +       }
> +
> +       return 1;
> +}
> +
> +static void __init save_downstream_configs(struct pci_dev *dev)
> +{
> +       struct pci_bus *subordinate;
> +       struct pci_dev *child;
> +
> +       subordinate = dev->subordinate;
> +       list_for_each_entry(child, &subordinate->devices, bus_list) {
> +               dev_info(&child->dev, "save state\n");
> +               pci_save_state(child);
> +       }
> +}
> +
> +static void __init restore_downstream_configs(struct pci_dev *dev)
> +{
> +       struct pci_bus *subordinate;
> +       struct pci_dev *child;
> +
> +       subordinate = dev->subordinate;
> +       list_for_each_entry(child, &subordinate->devices, bus_list) {
> +               dev_info(&child->dev, "restore state\n");
> +               pci_restore_state(child);
> +       }
> +}
> +
> +static void __init do_downstream_device_reset(struct pci_dev *dev)
> +{
> +       u16 ctrl;
> +
> +       dev_info(&dev->dev, "Reset Secondary bus\n");
> +
> +       /* Assert Secondary Bus Reset */
> +       pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +       ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> +       pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> +       /* Read config again to flush previous write */
> +       pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +
> +       msleep(2);
> +
> +       /* De-assert Secondary Bus Reset */
> +       ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> +       pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +}
> +
> +struct pci_dev_entry {
> +       struct list_head list;
> +       struct pci_dev *dev;
> +};
> +static int __initdata pcie_reset_endpoint_devices;
> +static int __init reset_pcie_endpoints(void)
> +{
> +       struct pci_dev *dev = NULL;
> +       struct pci_dev_entry *pdev_entry, *tmp;
> +       LIST_HEAD(pdev_list);
> +
> +       if (!pcie_reset_endpoint_devices)
> +               return 0;
> +
> +       for_each_pci_dev(dev)
> +               if (need_reset(dev)) {
> +                       pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
> +                       pdev_entry->dev = dev;
> +                       list_add(&pdev_entry->list, &pdev_list);
> +               }
> +
> +       list_for_each_entry(pdev_entry, &pdev_list, list)
> +               save_downstream_configs(pdev_entry->dev);
> +
> +       list_for_each_entry(pdev_entry, &pdev_list, list)
> +               do_downstream_device_reset(pdev_entry->dev);
> +
> +       msleep(1000);
> +
> +       list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
> +               restore_downstream_configs(pdev_entry->dev);
> +               kfree(pdev_entry);
> +       }
> +
> +       return 0;
> +}
> +fs_initcall_sync(reset_pcie_endpoints);
> +
>  static int __init pci_setup(char *str)
>  {
>         while (str) {
> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
>                                 pcie_bus_config = PCIE_BUS_PEER2PEER;
>                         } else if (!strncmp(str, "pcie_scan_all", 13)) {
>                                 pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
> +                       } else if (!strncmp(str, "pcie_reset_endpoint_devices",
> +                                               27)) {
> +                               pcie_reset_endpoint_devices = 1;
>                         } else {
>                                 printk(KERN_ERR "PCI: Unknown option `%s'\n",
>                                                 str);
> --
> 1.7.1
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh June 7, 2013, 8:46 a.m. UTC | #5

Thank you for your comments!

(2013/06/07 13:14), Bjorn Helgaas wrote:
> Sorry it's taken me so long to look at this.  I've been putting this
> off because the patch doesn't seem "obviously correct," and I don't
> feel like I really understand the problem.  I'm naive about both
> IOMMUs and kexec/kdump, so please pardon (and help me with) any silly
> questions or assumptions below.
> 
> On Mon, May 13, 2013 at 11:29 PM, Takao Indoh
> <indou.takao@jp.fujitsu.com> wrote:
>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
>> on each PCIe root port and downstream port to reset its downstream
>> endpoint.
>>
>> Problem:
>> This patch solves the problem that kdump can fail when intel_iommu=on is
>> specified. When intel_iommu=on is specified, many dma-remapping errors
>> occur in second kernel and it causes problems like driver error or PCI
>> SERR, at last kdump fails. This problem is caused as follows.
>> 1) Devices are working on first kernel.
>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>     and its DMA continues during this switch.
>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>     dma-remapping errors.
> 
> If I understand correctly, the problem only happens on systems with an
> IOMMU that's enabled in either the system or kdump kernel (or both).
> For systems without an IOMMU (or if it is disabled in both the system
> and kdump kernels),  any ongoing DMA should use addresses that target
> system-kernel memory and should not affect the kdump kernel.

Yes, that's correct.

> One thing I'm not sure about is that you are only resetting PCIe
> devices, but I don't think the problem is actually specific to PCIe,
> is it?  I think the same issue could occur on any system with an
> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
> there are systems with IOMMUs for plain old PCI devices, e.g.,
> PA-RISC.

Right, this is not specific to PCIe. The reasons why the target is only
PCIe is just to make algorithm to reset simple. It is possible to reset
legacy PCI devices in my patch, but code becomes somewhat complicated. I
thought recently most systems used PCIe and there was little demand for
resetting legacy PCI. Therefore I decided not to reset legacy PCI
devices, but I'll do if there are requests :-)

> 
> I tried to make a list of the interesting scenarios and the events
> that are relevant to this problem:
> 
>      Case 1: IOMMU off in system, off in kdump kernel
>        system kernel leaves IOMMU off
>          DMA targets system-kernel memory
>        kexec to kdump kernel (IOMMU off, devices untouched)
>          DMA targets system-kernel memory (harmless)
>        kdump kernel re-inits device
>          DMA targets kdump-kernel memory
> 
>      Case 2: IOMMU off in system kernel, on in kdump kernel
>        system kernel leaves IOMMU off
>          DMA targets system-kernel memory
>        kexec to kdump kernel (IOMMU off, devices untouched)
>          DMA targets system-kernel memory (harmless)
>        kdump kernel enables IOMMU with no valid mappings
>          DMA causes IOMMU errors (annoying but harmless)
>        kdump kernel re-inits device
>          DMA targets IOMMU-mapped kdump-kernel memory
> 
>      Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>        system kernel enables IOMMU
>          DMA targets IOMMU-mapped system-kernel memory
>        kexec to kdump kernel (IOMMU on, devices untouched)
>          DMA targets IOMMU-mapped system-kernel memory
>        kdump kernel doesn't know about IOMMU or doesn't touch it
>          DMA targets IOMMU-mapped system-kernel memory
>        kdump kernel re-inits device
>          kernel assumes no IOMMU, so all new DMA mappings are invalid
> because DMAs actually do go through the IOMMU
>          (** corruption or other non-recoverable error likely **)
> 
>      Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>        system kernel enables IOMMU
>          DMA targets IOMMU-mapped system-kernel memory
>        kexec to kdump kernel (IOMMU on, devices untouched)
>          DMA targets IOMMU-mapped system-kernel memory
>        kdump kernel disables IOMMU
>          DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>          (** corruption or other non-recoverable error likely **)
>        kdump kernel re-inits device
>          DMA targets kdump-kernel memory
> 
>      Case 4: IOMMU on in system kernel, on in kdump kernel
>        system kernel enables IOMMU
>          DMA targets IOMMU-mapped system-kernel memory
>        kexec to kdump kernel (IOMMU on, devices untouched)
>          DMA targets IOMMU-mapped system-kernel memory
>        kdump kernel enables IOMMU with no valid mappings
>          DMA causes IOMMU errors (annoying but harmless)

This is not harmless. Errors like PCI SERR are detected here, and it
makes driver or system unstable, and kdump fails. I also got report that
system hangs up due to this.


>        kdump kernel re-inits device
>          DMA targets IOMMU-mapped kdump-kernel memory
> 
> The problem cases I see are 3a and 3b, but that's not the problem
> you're describing.  Obviously I'm missing something.
> 
> It sounds like you're seeing problems in case 2 or case 4, where the
> IOMMU is enabled in the kdump kernel.  Maybe my assumption about the
> IOMMU being enabled with no valid mappings is wrong?  Or maybe those
> IOMMU errors are not actually harmless?

As I wrote above, IOMMU errors are not harmless. What I know is:

Case 1:  Harmless
Case 2:  Not tested
Case 3a: Not tested
Case 3b: Cause problem, fixed by my patch
Case 4:  Cause problem, fixed by my patch

I have never tested case 2 and 3a, but I think it also causes problem.

 
> Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not
> with case 3a.  Therefore, it seems like the kdump kernel *must*
> contain IOMMU support unless it knows for certain that the system
> kernel wasn't using the IOMMU.

Yes, kdump kernel has to be compiled with support for IOMMU.


> 
> Do you have any bugzilla references or problem report URLs you could
> include here?

I know three Red Hat bugzilla, but I think these are private article and
you cannot see. I'll add you Cc list in bz so that you can see.

BZ#743790: Kdump fails with intel_iommu=on
https://bugzilla.redhat.com/show_bug.cgi?id=743790

BZ#743495: Hardware error is detected with intel_iommu=on
https://bugzilla.redhat.com/show_bug.cgi?id=743495

BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
https://bugzilla.redhat.com/show_bug.cgi?id=833299

> 
> Obviously I'm very confused here, so please help educate me :)
> 
> Also, you mentioned PCI passthrough with a KVM guest as being a
> problem.  Can you explain more about what makes that a problem?  I
> don't know enough about passthrough to know why a device being used by
> a KVM guest would cause more problems than a device being used
> directly by the host.

Sorry, my explanation was misleading. I mentioned PCI passthrough as
usage example of IOMMU. To assign devices to KVM guest directly using
PCI passthrough, we have to enable IOMMU for host kernel. But if you
enable IOMMU and panic happens in the host, kdump starts in the host but
it fails due to the problem I mentioned in patch description.
Did I answer your question?

Thanks,
Takao Indoh


> 
> Bjorn
> 
>> Solution:
>> All DMA transactions have to be stopped before iommu is initialized. By
>> this patch devices are reset and in-flight DMA is stopped before
>> pci_iommu_init.
>>
>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
>> root port/downstream port whose child is PCIe endpoint, and then reset
>> link between them. If the endpoint is VGA device, it is skipped because
>> the monitor blacks out if VGA controller is reset.
>>
>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>> ---
>>   Documentation/kernel-parameters.txt |    2 +
>>   drivers/pci/pci.c                   |  113 +++++++++++++++++++++++++++++++++++
>>   2 files changed, 115 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index c3bfacb..8c9e8e4 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>                                  any pair of devices, possibly at the cost of
>>                                  reduced performance.  This also guarantees
>>                                  that hot-added devices will work.
>> +               pcie_reset_endpoint_devices     Reset PCIe endpoint on boot by
>> +                               hot reset
>>                  cbiosize=nn[KMG]        The fixed amount of bus space which is
>>                                  reserved for the CardBus bridge's IO window.
>>                                  The default value is 256 bytes.
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index a899d8b..70c1205 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>   }
>>   EXPORT_SYMBOL(pci_fixup_cardbus);
>>
>> +/*
>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
>> + * endpoint except VGA device.
>> + */
>> +static int __init need_reset(struct pci_dev *dev)
>> +{
>> +       struct pci_bus *subordinate;
>> +       struct pci_dev *child;
>> +
>> +       if (!pci_is_pcie(dev) || !dev->subordinate ||
>> +           list_empty(&dev->subordinate->devices) ||
>> +           ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> +            (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>> +               return 0;
>> +
>> +       subordinate = dev->subordinate;
>> +       list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +               if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>> +                   (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>> +                   ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>> +                       /* Don't reset switch, bridge, VGA device */
>> +                       return 0;
>> +       }
>> +
>> +       return 1;
>> +}
>> +
>> +static void __init save_downstream_configs(struct pci_dev *dev)
>> +{
>> +       struct pci_bus *subordinate;
>> +       struct pci_dev *child;
>> +
>> +       subordinate = dev->subordinate;
>> +       list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +               dev_info(&child->dev, "save state\n");
>> +               pci_save_state(child);
>> +       }
>> +}
>> +
>> +static void __init restore_downstream_configs(struct pci_dev *dev)
>> +{
>> +       struct pci_bus *subordinate;
>> +       struct pci_dev *child;
>> +
>> +       subordinate = dev->subordinate;
>> +       list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +               dev_info(&child->dev, "restore state\n");
>> +               pci_restore_state(child);
>> +       }
>> +}
>> +
>> +static void __init do_downstream_device_reset(struct pci_dev *dev)
>> +{
>> +       u16 ctrl;
>> +
>> +       dev_info(&dev->dev, "Reset Secondary bus\n");
>> +
>> +       /* Assert Secondary Bus Reset */
>> +       pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> +       ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> +       pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +       /* Read config again to flush previous write */
>> +       pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> +
>> +       msleep(2);
>> +
>> +       /* De-assert Secondary Bus Reset */
>> +       ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> +       pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +}
>> +
>> +struct pci_dev_entry {
>> +       struct list_head list;
>> +       struct pci_dev *dev;
>> +};
>> +static int __initdata pcie_reset_endpoint_devices;
>> +static int __init reset_pcie_endpoints(void)
>> +{
>> +       struct pci_dev *dev = NULL;
>> +       struct pci_dev_entry *pdev_entry, *tmp;
>> +       LIST_HEAD(pdev_list);
>> +
>> +       if (!pcie_reset_endpoint_devices)
>> +               return 0;
>> +
>> +       for_each_pci_dev(dev)
>> +               if (need_reset(dev)) {
>> +                       pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>> +                       pdev_entry->dev = dev;
>> +                       list_add(&pdev_entry->list, &pdev_list);
>> +               }
>> +
>> +       list_for_each_entry(pdev_entry, &pdev_list, list)
>> +               save_downstream_configs(pdev_entry->dev);
>> +
>> +       list_for_each_entry(pdev_entry, &pdev_list, list)
>> +               do_downstream_device_reset(pdev_entry->dev);
>> +
>> +       msleep(1000);
>> +
>> +       list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>> +               restore_downstream_configs(pdev_entry->dev);
>> +               kfree(pdev_entry);
>> +       }
>> +
>> +       return 0;
>> +}
>> +fs_initcall_sync(reset_pcie_endpoints);
>> +
>>   static int __init pci_setup(char *str)
>>   {
>>          while (str) {
>> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
>>                                  pcie_bus_config = PCIE_BUS_PEER2PEER;
>>                          } else if (!strncmp(str, "pcie_scan_all", 13)) {
>>                                  pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>> +                       } else if (!strncmp(str, "pcie_reset_endpoint_devices",
>> +                                               27)) {
>> +                               pcie_reset_endpoint_devices = 1;
>>                          } else {
>>                                  printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>                                                  str);
>> --
>> 1.7.1
>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas June 11, 2013, 2:20 a.m. UTC | #6

On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> (2013/06/07 13:14), Bjorn Helgaas wrote:

>> One thing I'm not sure about is that you are only resetting PCIe
>> devices, but I don't think the problem is actually specific to PCIe,
>> is it?  I think the same issue could occur on any system with an
>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>> PA-RISC.
>
> Right, this is not specific to PCIe. The reasons why the target is only
> PCIe is just to make algorithm to reset simple. It is possible to reset
> legacy PCI devices in my patch, but code becomes somewhat complicated. I
> thought recently most systems used PCIe and there was little demand for
> resetting legacy PCI. Therefore I decided not to reset legacy PCI
> devices, but I'll do if there are requests :-)

I'm not sure you need to reset legacy devices (or non-PCI devices)
yet, but the current hook isn't anchored anywhere -- it's just an
fs_initcall() that doesn't give the reader any clue about the
connection between the reset and the problem it's solving.

If we do something like this patch, I think it needs to be done at the
point where we enable or disable the IOMMU.  That way, it's connected
to the important event, and there's a clue about how to make
corresponding fixes for other IOMMUs.

We already have a "reset_devices" boot option.  This is for the same
purpose, as far as I can tell, and I'm not sure there's value in
having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
fact, there's nothing specific even to PCI here.  The Intel VT-d docs
seem carefully written so they could apply to either PCIe or non-PCI
devices.

>> I tried to make a list of the interesting scenarios and the events
>> that are relevant to this problem:
>>
>>      Case 1: IOMMU off in system, off in kdump kernel
>>        system kernel leaves IOMMU off
>>          DMA targets system-kernel memory
>>        kexec to kdump kernel (IOMMU off, devices untouched)
>>          DMA targets system-kernel memory (harmless)
>>        kdump kernel re-inits device
>>          DMA targets kdump-kernel memory
>>
>>      Case 2: IOMMU off in system kernel, on in kdump kernel
>>        system kernel leaves IOMMU off
>>          DMA targets system-kernel memory
>>        kexec to kdump kernel (IOMMU off, devices untouched)
>>          DMA targets system-kernel memory (harmless)
>>        kdump kernel enables IOMMU with no valid mappings
>>          DMA causes IOMMU errors (annoying but harmless)
>>        kdump kernel re-inits device
>>          DMA targets IOMMU-mapped kdump-kernel memory
>>
>>      Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>        system kernel enables IOMMU
>>          DMA targets IOMMU-mapped system-kernel memory
>>        kexec to kdump kernel (IOMMU on, devices untouched)
>>          DMA targets IOMMU-mapped system-kernel memory
>>        kdump kernel doesn't know about IOMMU or doesn't touch it
>>          DMA targets IOMMU-mapped system-kernel memory
>>        kdump kernel re-inits device
>>          kernel assumes no IOMMU, so all new DMA mappings are invalid
>> because DMAs actually do go through the IOMMU
>>          (** corruption or other non-recoverable error likely **)
>>
>>      Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>        system kernel enables IOMMU
>>          DMA targets IOMMU-mapped system-kernel memory
>>        kexec to kdump kernel (IOMMU on, devices untouched)
>>          DMA targets IOMMU-mapped system-kernel memory
>>        kdump kernel disables IOMMU
>>          DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>          (** corruption or other non-recoverable error likely **)
>>        kdump kernel re-inits device
>>          DMA targets kdump-kernel memory
>>
>>      Case 4: IOMMU on in system kernel, on in kdump kernel
>>        system kernel enables IOMMU
>>          DMA targets IOMMU-mapped system-kernel memory
>>        kexec to kdump kernel (IOMMU on, devices untouched)
>>          DMA targets IOMMU-mapped system-kernel memory
>>        kdump kernel enables IOMMU with no valid mappings
>>          DMA causes IOMMU errors (annoying but harmless)
>>        kdump kernel re-inits device
>>          DMA targets IOMMU-mapped kdump-kernel memory
>
> This is not harmless. Errors like PCI SERR are detected here, and it
> makes driver or system unstable, and kdump fails. I also got report that
> system hangs up due to this.

OK, let's take this slowly.  Does an IOMMU error in the system kernel
also cause SERR or make the system unstable?  Is that the expected
behavior on IOMMU errors, or is there something special about the
kdump scenario that causes SERRs?  I see lots of DMAR errors, e.g.,
those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
reported with printk and don't seem to cause an SERR.  Maybe the SERR
is system-specific behavior?

https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
report of IOMMU errors related to a driver bug where we just get
printks, not SERR.

https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
when the kdump kernel reboots (after successfully saving a crashdump).
 But it is using "iommu=pt", which I don't believe makes sense.  The
scenario is basically case 4 above, but instead of the kdump kernel
starting with no valid IOMMU mappings, it identity-maps bus addresses
to physical memory addresses.  That's completely bogus because it's
certainly not what the system kernel did, so it's entirely likely to
make the system unstable or hang.  This is not an argument for doing a
reset; it's an argument for doing something smarter than "iommu=pt" in
the kdump kernel.

We might still want to reset PCIe devices, but I want to make sure
that we're not papering over other issues when we do.  Therefore, I'd
like to understand why IOMMU errors seem harmless in some cases but
not in others.

> Case 1:  Harmless
> Case 2:  Not tested
> Case 3a: Not tested
> Case 3b: Cause problem, fixed by my patch
> Case 4:  Cause problem, fixed by my patch
>
> I have never tested case 2 and 3a, but I think it also causes problem.

I do not believe we need to support case 3b (IOMMU on in system kernel
but disabled in kdump kernel).  There is no way to make that reliable
unless every single device that may perform DMA is reset, and since
you don't reset legacy PCI or VGA devices, you're not even proposing
to do that.

I think we need to support case 1 (for systems with no IOMMU at all)
and case 4 (IOMMU enabled in both system kernel and kdump kernel).  If
the kdump kernel can detect whether the IOMMU is enabled, that should
be enough -- it could figure out automatically whether we're in case 1
or 4.

>> Do you have any bugzilla references or problem report URLs you could
>> include here?
>
> I know three Red Hat bugzilla, but I think these are private article and
> you cannot see. I'll add you Cc list in bz so that you can see.
>
> BZ#743790: Kdump fails with intel_iommu=on
> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>
> BZ#743495: Hardware error is detected with intel_iommu=on
> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>
> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
> https://bugzilla.redhat.com/show_bug.cgi?id=833299

Thanks for adding me to the CC lists.  I looked all three and I'm not
sure there's anything sensitive in them.  It'd be nice if they could
be made public if there's not.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh June 11, 2013, 6:08 a.m. UTC | #7

(2013/06/11 11:20), Bjorn Helgaas wrote:
> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>> (2013/06/07 13:14), Bjorn Helgaas wrote:
> 
>>> One thing I'm not sure about is that you are only resetting PCIe
>>> devices, but I don't think the problem is actually specific to PCIe,
>>> is it?  I think the same issue could occur on any system with an
>>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>> PA-RISC.
>>
>> Right, this is not specific to PCIe. The reasons why the target is only
>> PCIe is just to make algorithm to reset simple. It is possible to reset
>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>> thought recently most systems used PCIe and there was little demand for
>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>> devices, but I'll do if there are requests :-)
> 
> I'm not sure you need to reset legacy devices (or non-PCI devices)
> yet, but the current hook isn't anchored anywhere -- it's just an
> fs_initcall() that doesn't give the reader any clue about the
> connection between the reset and the problem it's solving.
> 
> If we do something like this patch, I think it needs to be done at the
> point where we enable or disable the IOMMU.  That way, it's connected
> to the important event, and there's a clue about how to make
> corresponding fixes for other IOMMUs.

Ok. pci_iommu_init() is appropriate place to add this hook?

> We already have a "reset_devices" boot option.  This is for the same
> purpose, as far as I can tell, and I'm not sure there's value in
> having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
> fact, there's nothing specific even to PCI here.  The Intel VT-d docs
> seem carefully written so they could apply to either PCIe or non-PCI
> devices.

Yeah, I can integrate this option into reset_devices. The reason why
I separate them is to avoid regression.

I have tested my patch on many machines and basically it worked, but I
found two machines where this reset patch caused problem. The first one
was caused by bug of raid card firmware. After updating firmware, this
patch worked. The second one was due to bug of PCIe switch chip. I
reported this bug to the vendor but it is not fixed yet.

Anyway, this patch may cause problems on such a buggy machine, so I
introduced new boot parameter so that user can enable or disable this
function aside from reset_devices.

Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
reset_devices instead of adding new one, and using quirk or something to
avoid such a buggy devices.

So, basically I agree with using reset_devices, but I want to prepare
workaround in case this reset causes something wrong.


> 
>>> I tried to make a list of the interesting scenarios and the events
>>> that are relevant to this problem:
>>>
>>>       Case 1: IOMMU off in system, off in kdump kernel
>>>         system kernel leaves IOMMU off
>>>           DMA targets system-kernel memory
>>>         kexec to kdump kernel (IOMMU off, devices untouched)
>>>           DMA targets system-kernel memory (harmless)
>>>         kdump kernel re-inits device
>>>           DMA targets kdump-kernel memory
>>>
>>>       Case 2: IOMMU off in system kernel, on in kdump kernel
>>>         system kernel leaves IOMMU off
>>>           DMA targets system-kernel memory
>>>         kexec to kdump kernel (IOMMU off, devices untouched)
>>>           DMA targets system-kernel memory (harmless)
>>>         kdump kernel enables IOMMU with no valid mappings
>>>           DMA causes IOMMU errors (annoying but harmless)
>>>         kdump kernel re-inits device
>>>           DMA targets IOMMU-mapped kdump-kernel memory
>>>
>>>       Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>>         system kernel enables IOMMU
>>>           DMA targets IOMMU-mapped system-kernel memory
>>>         kexec to kdump kernel (IOMMU on, devices untouched)
>>>           DMA targets IOMMU-mapped system-kernel memory
>>>         kdump kernel doesn't know about IOMMU or doesn't touch it
>>>           DMA targets IOMMU-mapped system-kernel memory
>>>         kdump kernel re-inits device
>>>           kernel assumes no IOMMU, so all new DMA mappings are invalid
>>> because DMAs actually do go through the IOMMU
>>>           (** corruption or other non-recoverable error likely **)
>>>
>>>       Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>>         system kernel enables IOMMU
>>>           DMA targets IOMMU-mapped system-kernel memory
>>>         kexec to kdump kernel (IOMMU on, devices untouched)
>>>           DMA targets IOMMU-mapped system-kernel memory
>>>         kdump kernel disables IOMMU
>>>           DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>>           (** corruption or other non-recoverable error likely **)
>>>         kdump kernel re-inits device
>>>           DMA targets kdump-kernel memory
>>>
>>>       Case 4: IOMMU on in system kernel, on in kdump kernel
>>>         system kernel enables IOMMU
>>>           DMA targets IOMMU-mapped system-kernel memory
>>>         kexec to kdump kernel (IOMMU on, devices untouched)
>>>           DMA targets IOMMU-mapped system-kernel memory
>>>         kdump kernel enables IOMMU with no valid mappings
>>>           DMA causes IOMMU errors (annoying but harmless)
>>>         kdump kernel re-inits device
>>>           DMA targets IOMMU-mapped kdump-kernel memory
>>
>> This is not harmless. Errors like PCI SERR are detected here, and it
>> makes driver or system unstable, and kdump fails. I also got report that
>> system hangs up due to this.
> 
> OK, let's take this slowly.  Does an IOMMU error in the system kernel
> also cause SERR or make the system unstable?  Is that the expected
> behavior on IOMMU errors, or is there something special about the
> kdump scenario that causes SERRs?  I see lots of DMAR errors, e.g.,
> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
> reported with printk and don't seem to cause an SERR.  Maybe the SERR
> is system-specific behavior?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
> report of IOMMU errors related to a driver bug where we just get
> printks, not SERR.

Yes, it depends on platform or devices. At least PCI SERR is detected on
Fujitsu PRIMERGY BX920S2 and TX300S6.

Intel VT-d documents[1] says:

    3.5.1 Hardware Handling of Faulting DMA Requests
    DMA requests that result in remapping faults must be blocked by
    hardware. The exact method of DMA blocking is
    implementation-specific.  For example:

    - Faulting DMA write requests may be handled in much the same way as
      hardware handles write requests to non-existent memory. For
      example, the DMA request is discarded in a manner convenient for
      implementations (such as by dropping the cycle, completing the
      write request to memory with all byte enables masked off,
      re-directed to a dummy memory location, etc.).

    - Faulting DMA read requests may be handled in much the same way as
      hardware handles read requests to non-existent memory. For
      example, the request may be redirected to a dummy memory location,
      returned as all 0’s or 1’s in a manner convenient to the
      implementation, or the request may be completed with an explicit
      error indication (preferred). For faulting DMA read requests from
      PCI Express devices, hardware must indicate "Unsupported Request"
      (UR) in the completion status field of the PCI Express read
      completion.

So, after IOMMU error, its behavior is implementation-specific.

[1]
Intel Virtualization Technology for Directed I/O Architecture
Specification Rev1.3

> 
> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
> when the kdump kernel reboots (after successfully saving a crashdump).
>   But it is using "iommu=pt", which I don't believe makes sense.  The
> scenario is basically case 4 above, but instead of the kdump kernel
> starting with no valid IOMMU mappings, it identity-maps bus addresses
> to physical memory addresses.  That's completely bogus because it's
> certainly not what the system kernel did, so it's entirely likely to
> make the system unstable or hang.  This is not an argument for doing a
> reset; it's an argument for doing something smarter than "iommu=pt" in
> the kdump kernel.
> 
> We might still want to reset PCIe devices, but I want to make sure
> that we're not papering over other issues when we do.  Therefore, I'd
> like to understand why IOMMU errors seem harmless in some cases but
> not in others.

As I wrote above, IOMMU behavior on error is up to platform/devcies. I
think it also depends on the value of Uncorrectable Error Mask Register
in AER registers of each device.

>> Case 1:  Harmless
>> Case 2:  Not tested
>> Case 3a: Not tested
>> Case 3b: Cause problem, fixed by my patch
>> Case 4:  Cause problem, fixed by my patch
>>
>> I have never tested case 2 and 3a, but I think it also causes problem.
> 
> I do not believe we need to support case 3b (IOMMU on in system kernel
> but disabled in kdump kernel).  There is no way to make that reliable
> unless every single device that may perform DMA is reset, and since
> you don't reset legacy PCI or VGA devices, you're not even proposing
> to do that.
> 
> I think we need to support case 1 (for systems with no IOMMU at all)
> and case 4 (IOMMU enabled in both system kernel and kdump kernel).  If
> the kdump kernel can detect whether the IOMMU is enabled, that should
> be enough -- it could figure out automatically whether we're in case 1
> or 4.

Ok, I completely agree.


>>> Do you have any bugzilla references or problem report URLs you could
>>> include here?
>>
>> I know three Red Hat bugzilla, but I think these are private article and
>> you cannot see. I'll add you Cc list in bz so that you can see.
>>
>> BZ#743790: Kdump fails with intel_iommu=on
>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>
>> BZ#743495: Hardware error is detected with intel_iommu=on
>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>
>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
> 
> Thanks for adding me to the CC lists.  I looked all three and I'm not
> sure there's anything sensitive in them.  It'd be nice if they could
> be made public if there's not.

I really appreciate your comments! I'll confirm I can make it public.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sumner, William June 11, 2013, 11:19 p.m. UTC | #8

>(2013/06/11 11:20), Bjorn Helgaas wrote:
>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>>
>>>> One thing I'm not sure about is that you are only resetting PCIe
>>>> devices, but I don't think the problem is actually specific to PCIe,
>>>> is it?  I think the same issue could occur on any system with an
>>>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>>> PA-RISC.
>>>
>>> Right, this is not specific to PCIe. The reasons why the target is only
>>> PCIe is just to make algorithm to reset simple. It is possible to reset
>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>>> thought recently most systems used PCIe and there was little demand for
>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>>> devices, but I'll do if there are requests :-)
>>
>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>> yet, but the current hook isn't anchored anywhere -- it's just an
>> fs_initcall() that doesn't give the reader any clue about the
>> connection between the reset and the problem it's solving.
>>
>> If we do something like this patch, I think it needs to be done at the
>> point where we enable or disable the IOMMU.  That way, it's connected
>> to the important event, and there's a clue about how to make
>> corresponding fixes for other IOMMUs.
>
>Ok. pci_iommu_init() is appropriate place to add this hook?
>
>> We already have a "reset_devices" boot option.  This is for the same
>> purpose, as far as I can tell, and I'm not sure there's value in
>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
>> fact, there's nothing specific even to PCI here.  The Intel VT-d docs
>> seem carefully written so they could apply to either PCIe or non-PCI
>> devices.
>
>Yeah, I can integrate this option into reset_devices. The reason why
>I separate them is to avoid regression.
>
>I have tested my patch on many machines and basically it worked, but I
>found two machines where this reset patch caused problem. The first one
>was caused by bug of raid card firmware. After updating firmware, this
>patch worked. The second one was due to bug of PCIe switch chip. I
>reported this bug to the vendor but it is not fixed yet.
>
>Anyway, this patch may cause problems on such a buggy machine, so I
>introduced new boot parameter so that user can enable or disable this
>function aside from reset_devices.
>
>Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>reset_devices instead of adding new one, and using quirk or something to
>avoid such a buggy devices.

With respect to "and using quirk or something to avoid such buggy devices",
I believe that it will be necessary to provide a mechanism for devices that
need special handling to do the reset -- perhaps something like a list
of tuples: (device_type, function_to_call) with a default function_to_call
when the device_type is not found in the list.  These functions would
need to be physically separate from the device driver because if the device
is present it needs to be reset even if the crash kernel chooses not to load
the driver for that device.

>
>So, basically I agree with using reset_devices, but I want to prepare
>workaround in case this reset causes something wrong.
>
I like the ability to specify the original "reset_devices" separately from
invoking this new mechanism.  With so many different uses for Linux in
so many different environments and with so many different device drivers
it seems reasonable to keep the ability to tell the device drivers to
reset their devices -- instead of pulling the reset line on all devices.

I also like the ability to invoke the new reset feature separately from
telling the device drivers to do it.

>
>>
>>>> I tried to make a list of the interesting scenarios and the events
>>>> that are relevant to this problem:
>>>>
>>>>       Case 1: IOMMU off in system, off in kdump kernel
>>>>         system kernel leaves IOMMU off
>>>>           DMA targets system-kernel memory
>>>>         kexec to kdump kernel (IOMMU off, devices untouched)
>>>>           DMA targets system-kernel memory (harmless)
>>>>         kdump kernel re-inits device
>>>>           DMA targets kdump-kernel memory
>>>>
>>>>       Case 2: IOMMU off in system kernel, on in kdump kernel
>>>>         system kernel leaves IOMMU off
>>>>           DMA targets system-kernel memory
>>>>         kexec to kdump kernel (IOMMU off, devices untouched)
>>>>           DMA targets system-kernel memory (harmless)
>>>>         kdump kernel enables IOMMU with no valid mappings
>>>>           DMA causes IOMMU errors (annoying but harmless)
>>>>         kdump kernel re-inits device
>>>>           DMA targets IOMMU-mapped kdump-kernel memory
>>>>
>>>>       Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>>>         system kernel enables IOMMU
>>>>           DMA targets IOMMU-mapped system-kernel memory
>>>>         kexec to kdump kernel (IOMMU on, devices untouched)
>>>>           DMA targets IOMMU-mapped system-kernel memory
>>>>         kdump kernel doesn't know about IOMMU or doesn't touch it
>>>>           DMA targets IOMMU-mapped system-kernel memory
>>>>         kdump kernel re-inits device
>>>>           kernel assumes no IOMMU, so all new DMA mappings are invalid
>>>> because DMAs actually do go through the IOMMU
>>>>           (** corruption or other non-recoverable error likely **)
>>>>
>>>>
>>>>       Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>>>         system kernel enables IOMMU
>>>>           DMA targets IOMMU-mapped system-kernel memory
>>>>         kexec to kdump kernel (IOMMU on, devices untouched)
>>>>           DMA targets IOMMU-mapped system-kernel memory
>>>>         kdump kernel disables IOMMU
>>>>           DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>>>           (** corruption or other non-recoverable error likely **)
>>>>         kdump kernel re-inits device
>>>>           DMA targets kdump-kernel memory
>>>>
>>>>       Case 4: IOMMU on in system kernel, on in kdump kernel
>>>>         system kernel enables IOMMU
>>>>           DMA targets IOMMU-mapped system-kernel memory
>>>>         kexec to kdump kernel (IOMMU on, devices untouched)
>>>>           DMA targets IOMMU-mapped system-kernel memory
>>>>         kdump kernel enables IOMMU with no valid mappings
>>>>           DMA causes IOMMU errors (annoying but harmless)
>>>>         kdump kernel re-inits device
>>>>           DMA targets IOMMU-mapped kdump-kernel memory
>>>
>>> This is not harmless. Errors like PCI SERR are detected here, and it
>>> makes driver or system unstable, and kdump fails. I also got report that
>>> system hangs up due to this.
>>
>> OK, let's take this slowly.  Does an IOMMU error in the system kernel
>> also cause SERR or make the system unstable?  Is that the expected
>> behavior on IOMMU errors, or is there something special about the
>> kdump scenario that causes SERRs?  I see lots of DMAR errors, e.g.,
>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
>> reported with printk and don't seem to cause an SERR.  Maybe the SERR
>> is system-specific behavior?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
>> report of IOMMU errors related to a driver bug where we just get
>> printks, not SERR.
>
>Yes, it depends on platform or devices. At least PCI SERR is detected on
>Fujitsu PRIMERGY BX920S2 and TX300S6.
>
>Intel VT-d documents[1] says:
>
>    3.5.1 Hardware Handling of Faulting DMA Requests
>    DMA requests that result in remapping faults must be blocked by
>    hardware. The exact method of DMA blocking is
>    implementation-specific.  For example:
>
>    - Faulting DMA write requests may be handled in much the same way as
>      hardware handles write requests to non-existent memory. For
>      example, the DMA request is discarded in a manner convenient for
>      implementations (such as by dropping the cycle, completing the
>      write request to memory with all byte enables masked off,
>      re-directed to a dummy memory location, etc.).
>
>    - Faulting DMA read requests may be handled in much the same way as
>      hardware handles read requests to non-existent memory. For
>      example, the request may be redirected to a dummy memory location,
>      returned as all 0<92>s or 1<92>s in a manner convenient to the
>      implementation, or the request may be completed with an explicit
>      error indication (preferred). For faulting DMA read requests from
>      PCI Express devices, hardware must indicate "Unsupported Request"
>      (UR) in the completion status field of the PCI Express read
>      completion.
>
>So, after IOMMU error, its behavior is implementation-specific.
>
>[1]
>Intel Virtualization Technology for Directed I/O Architecture
>Specification Rev1.3
>
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
>> when the kdump kernel reboots (after successfully saving a crashdump).
>>   But it is using "iommu=pt", which I don't believe makes sense.  The
>> scenario is basically case 4 above, but instead of the kdump kernel
>> starting with no valid IOMMU mappings, it identity-maps bus addresses
>> to physical memory addresses.  That's completely bogus because it's
>> certainly not what the system kernel did, so it's entirely likely to
>> make the system unstable or hang.  This is not an argument for doing a
>> reset; it's an argument for doing something smarter than "iommu=pt" in
>> the kdump kernel.

>> We might still want to reset PCIe devices, but I want to make sure
>> that we're not papering over other issues when we do.  Therefore, I'd
>> like to understand why IOMMU errors seem harmless in some cases but
>> not in others.
>
>As I wrote above, IOMMU behavior on error is up to platform/devcies. I
>think it also depends on the value of Uncorrectable Error Mask Register
>in AER registers of each device.
>
>>> Case 1:  Harmless
>>> Case 2:  Not tested
>>> Case 3a: Not tested
>>> Case 3b: Cause problem, fixed by my patch
>>> Case 4:  Cause problem, fixed by my patch
>>>
>>> I have never tested case 2 and 3a, but I think it also causes problem.
>>
>> I do not believe we need to support case 3b (IOMMU on in system kernel
>> but disabled in kdump kernel).  There is no way to make that reliable
>> unless every single device that may perform DMA is reset, and since
>> you don't reset legacy PCI or VGA devices, you're not even proposing
>> to do that.
>>
>> I think we need to support case 1 (for systems with no IOMMU at all)
>> and case 4 (IOMMU enabled in both system kernel and kdump kernel).  If
>> the kdump kernel can detect whether the IOMMU is enabled, that should
>> be enough -- it could figure out automatically whether we're in case 1
>> or 4.
>
>Ok, I completely agree.
>
>
>>>> Do you have any bugzilla references or problem report URLs you could
>>>> include here?
>>>
>>> I know three Red Hat bugzilla, but I think these are private article and
>>> you cannot see. I'll add you Cc list in bz so that you can see.
>>>
>>> BZ#743790: Kdump fails with intel_iommu=on
>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>>
>>> BZ#743495: Hardware error is detected with intel_iommu=on
>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>>
>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
>>
>> Thanks for adding me to the CC lists.  I looked all three and I'm not
>> sure there's anything sensitive in them.  It'd be nice if they could
>> be made public if there's not.
>
>I really appreciate your comments! I'll confirm I can make it public.

I would greatly appreciate being able to see the bugzillas relating to
this patch.
>
>Thanks,
>Takao Indoh

Thinking out of the box:
Much of the discussion about dealing with the ongoing DMA leftover
from the system kernel has assumed that the crash kernel will reset
the IOMMU -- which causes various problems if done while there is any DMA
still active -- which leads to the idea of stopping all of the DMA.

Suppose the crash kernel does not reset the hardware IOMMU, but simply
detects that it is active, resets only the devices that are necessary
for the crash kernel to operate, and re-programs only the translations
for those devices.  All other translations remain the same (and remain valid)
so all leftover DMA continues into its buffer in the system kernel area
where it is harmless.  New translations needed by the kdump kernel are
added to the existing tables.

I have not yet tried this, so I am not ready to propose it as anything more
than a discussion topic at this time.

It might work this way: (A small modification to case 3a above)

IOMMU on in system kernel, kdump kernel accepts active IOMMU
   system kernel enables IOMMU
      DMA targets IOMMU-mapped system-kernel memory
   kexec to kdump kernel (IOMMU on, devices untouched)
      DMA targets IOMMU-mapped system-kernel memory
   kdump kernel detects active IOMMU and doesn't touch it
      DMA targets IOMMU-mapped system-kernel memory
   kdump kernel does not re-initialize IOMMU hardware
   kdump kernel initializes IOMMU in-memory management structures
   kdump kernel calls device drivers' standard initialization functions
      Drivers initialize their own devices -- DMA from that device stops
      When drivers request new DMA mappings, the kdump IOMMU driver:
      1. Updates its in-memory mgt structures for that device & range
      2. Updates IOMMU translate tables for that device & range
         . Translations for all other devices & ranges are unchanged
      3. Flushes IOMMU TLB to force IOMMU hardware update

   Notes:
      A. This seems very similar to case 1

      B. Only touch devices with drivers loaded into kdump kernel.
         . No need to touch devices that kdump is not going to use.

      C. This assumes that all DMA device drivers used by kdump will
         initialize the device before requesting DMA mappings.
         . Seems reasonable, but need to confirm how many do (or don't)
         . Only device drivers that might be included in the kdump
           kernel need to observe this initialization ordering.

      D. Could copy system kernel's translate tables into kdump kernel
         and adjust pointers if this feels more trustworthy than using
         the original structures where they are in the system kernel.

      E. Handle IRQ remappings in a similar manner.

Thanks,
Bill Sumner


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas June 12, 2013, 12:53 a.m. UTC | #9

[Your quoting is messed up below]
>> This is Takao's text
> This is Bill's text

On Tue, Jun 11, 2013 at 5:19 PM, Sumner, William <bill.sumner@hp.com> wrote:
>>(2013/06/11 11:20), Bjorn Helgaas wrote:
>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:

>>Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>>reset_devices instead of adding new one, and using quirk or something to
>>avoid such a buggy devices.
>
> With respect to "and using quirk or something to avoid such buggy devices",
> I believe that it will be necessary to provide a mechanism for devices that
> need special handling to do the reset ...

You mean something like pci_dev_specific_reset()?

>>So, basically I agree with using reset_devices, but I want to prepare
>>workaround in case this reset causes something wrong.
>>
> I like the ability to specify the original "reset_devices" separately from
> invoking this new mechanism.  With so many different uses for Linux in
> so many different environments and with so many different device drivers
> it seems reasonable to keep the ability to tell the device drivers to
> reset their devices -- instead of pulling the reset line on all devices.

The problem with adding new options for this and that is it confuses
users and fragments testing.  Users already randomly try options and
publish combinations that happen to work as "solutions."  Then we
don't get problem reports and don't get a chance to fix things
properly.  The ideal OS would have zero options and be able to figure
everything out by itself.  So that's my bias about giving users
flexibility by adding new options -- I don't like it, and I think we
have to accept some lack of flexibility to keep the system complexity
tractable  :)

> ...
> Thinking out of the box:
> Much of the discussion about dealing with the ongoing DMA leftover
> from the system kernel has assumed that the crash kernel will reset
> the IOMMU -- which causes various problems if done while there is any DMA
> still active -- which leads to the idea of stopping all of the DMA.
>
> Suppose the crash kernel does not reset the hardware IOMMU, but simply
> detects that it is active, resets only the devices that are necessary
> for the crash kernel to operate, and re-programs only the translations
> for those devices.  All other translations remain the same (and remain valid)
> so all leftover DMA continues into its buffer in the system kernel area
> where it is harmless.  New translations needed by the kdump kernel are
> added to the existing tables.
>
> I have not yet tried this, so I am not ready to propose it as anything more
> than a discussion topic at this time.
>
> It might work this way: (A small modification to case 3a above)
>
> IOMMU on in system kernel, kdump kernel accepts active IOMMU
>    system kernel enables IOMMU
>       DMA targets IOMMU-mapped system-kernel memory
>    kexec to kdump kernel (IOMMU on, devices untouched)
>       DMA targets IOMMU-mapped system-kernel memory
>    kdump kernel detects active IOMMU and doesn't touch it
>       DMA targets IOMMU-mapped system-kernel memory
>    kdump kernel does not re-initialize IOMMU hardware
>    kdump kernel initializes IOMMU in-memory management structures
>    kdump kernel calls device drivers' standard initialization functions
>       Drivers initialize their own devices -- DMA from that device stops
>       When drivers request new DMA mappings, the kdump IOMMU driver:
>       1. Updates its in-memory mgt structures for that device & range
>       2. Updates IOMMU translate tables for that device & range
>          . Translations for all other devices & ranges are unchanged
>       3. Flushes IOMMU TLB to force IOMMU hardware update

This is certainly an interesting idea.

It would require some additional smarts in the IOMMU driver to take
over existing page tables.  Those data structures from the system
kernel are outside the memory map known to the kdump kernel, so
there'd likely be VM issues there.  The IOMMU driver would also have
to be able to reconstruct the state for the bus address space
allocator, e.g., use the I/O page tables to rebuild a bitmap of
allocated space.

We'll end up with a bunch of dma_map() calls from the system kernel
that don't have corresponding dma_unmap()s.  I guess when a driver
initializes its device, there might have to be a new interface to
"unmap any existing mappings for this device, even though I don't know
what they are."

I think it's possible, but it sounds like quite a lot of work and I'm
not sure it's worth it for this relatively limited use case.  And it's
more IOMMU-specific than a "reset devices" strategy, so most of it
would have to be done in each IOMMU driver.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas June 12, 2013, 4:45 a.m. UTC | #10

[+cc Vivek, Haren; sorry I didn't think to add you earlier]

On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
<indou.takao@jp.fujitsu.com> wrote:
> (2013/06/11 11:20), Bjorn Helgaas wrote:

>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>> yet, but the current hook isn't anchored anywhere -- it's just an
>> fs_initcall() that doesn't give the reader any clue about the
>> connection between the reset and the problem it's solving.
>>
>> If we do something like this patch, I think it needs to be done at the
>> point where we enable or disable the IOMMU.  That way, it's connected
>> to the important event, and there's a clue about how to make
>> corresponding fixes for other IOMMUs.
>
> Ok. pci_iommu_init() is appropriate place to add this hook?

I looked at various IOMMU init places today, and it's far more
complicated and varied than I had hoped.

This reset scheme depends on enumerating PCI devices before we
initialize the IOMMU used by those devices.  x86 works that way today,
but not all architectures do (see the sparc pci_fire_pbm_init(), for
example).  And I think conceptually, the IOMMU should be enumerated
and initialized *before* the devices that use it.

So I'm uncomfortable with that aspect of this scheme.

It would be at least conceivable to reset the devices in the system
kernel, before the kexec.  I know we want to do as little as possible
in the crashing kernel, but it's at least a possibility, and it might
be cleaner.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Donald Dutile June 12, 2013, 1:19 p.m. UTC | #11

On 06/11/2013 07:19 PM, Sumner, William wrote:
>
>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh<indou.takao@jp.fujitsu.com>  wrote:
>>>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>>>
>>>>> One thing I'm not sure about is that you are only resetting PCIe
>>>>> devices, but I don't think the problem is actually specific to PCIe,
>>>>> is it?  I think the same issue could occur on any system with an
>>>>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>>>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>>>> PA-RISC.
>>>>
>>>> Right, this is not specific to PCIe. The reasons why the target is only
>>>> PCIe is just to make algorithm to reset simple. It is possible to reset
>>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>>>> thought recently most systems used PCIe and there was little demand for
>>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>>>> devices, but I'll do if there are requests :-)
>>>
>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>> fs_initcall() that doesn't give the reader any clue about the
>>> connection between the reset and the problem it's solving.
>>>
>>> If we do something like this patch, I think it needs to be done at the
>>> point where we enable or disable the IOMMU.  That way, it's connected
>>> to the important event, and there's a clue about how to make
>>> corresponding fixes for other IOMMUs.
>>
>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>
>>> We already have a "reset_devices" boot option.  This is for the same
>>> purpose, as far as I can tell, and I'm not sure there's value in
>>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
>>> fact, there's nothing specific even to PCI here.  The Intel VT-d docs
>>> seem carefully written so they could apply to either PCIe or non-PCI
>>> devices.
>>
>> Yeah, I can integrate this option into reset_devices. The reason why
>> I separate them is to avoid regression.
>>
>> I have tested my patch on many machines and basically it worked, but I
>> found two machines where this reset patch caused problem. The first one
>> was caused by bug of raid card firmware. After updating firmware, this
>> patch worked. The second one was due to bug of PCIe switch chip. I
>> reported this bug to the vendor but it is not fixed yet.
>>
>> Anyway, this patch may cause problems on such a buggy machine, so I
>> introduced new boot parameter so that user can enable or disable this
>> function aside from reset_devices.
>>
>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>> reset_devices instead of adding new one, and using quirk or something to
>> avoid such a buggy devices.
>
> With respect to "and using quirk or something to avoid such buggy devices",
> I believe that it will be necessary to provide a mechanism for devices that
> need special handling to do the reset -- perhaps something like a list
> of tuples: (device_type, function_to_call) with a default function_to_call
> when the device_type is not found in the list.  These functions would
> need to be physically separate from the device driver because if the device
> is present it needs to be reset even if the crash kernel chooses not to load
> the driver for that device.
>
>>
>> So, basically I agree with using reset_devices, but I want to prepare
>> workaround in case this reset causes something wrong.
>>
> I like the ability to specify the original "reset_devices" separately from
> invoking this new mechanism.  With so many different uses for Linux in
> so many different environments and with so many different device drivers
> it seems reasonable to keep the ability to tell the device drivers to
> reset their devices -- instead of pulling the reset line on all devices.
>
> I also like the ability to invoke the new reset feature separately from
> telling the device drivers to do it.
>
>>
>>>
>>>>> I tried to make a list of the interesting scenarios and the events
>>>>> that are relevant to this problem:
>>>>>
>>>>>        Case 1: IOMMU off in system, off in kdump kernel
>>>>>          system kernel leaves IOMMU off
>>>>>            DMA targets system-kernel memory
>>>>>          kexec to kdump kernel (IOMMU off, devices untouched)
>>>>>            DMA targets system-kernel memory (harmless)
>>>>>          kdump kernel re-inits device
>>>>>            DMA targets kdump-kernel memory
>>>>>
>>>>>        Case 2: IOMMU off in system kernel, on in kdump kernel
>>>>>          system kernel leaves IOMMU off
>>>>>            DMA targets system-kernel memory
>>>>>          kexec to kdump kernel (IOMMU off, devices untouched)
>>>>>            DMA targets system-kernel memory (harmless)
>>>>>          kdump kernel enables IOMMU with no valid mappings
>>>>>            DMA causes IOMMU errors (annoying but harmless)
>>>>>          kdump kernel re-inits device
>>>>>            DMA targets IOMMU-mapped kdump-kernel memory
>>>>>
>>>>>        Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>>>>          system kernel enables IOMMU
>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>          kdump kernel doesn't know about IOMMU or doesn't touch it
>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>          kdump kernel re-inits device
>>>>>            kernel assumes no IOMMU, so all new DMA mappings are invalid
>>>>> because DMAs actually do go through the IOMMU
>>>>>            (** corruption or other non-recoverable error likely **)
>>>>>
>>>>>
>>>>>        Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>>>>          system kernel enables IOMMU
>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>          kdump kernel disables IOMMU
>>>>>            DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>>>>            (** corruption or other non-recoverable error likely **)
>>>>>          kdump kernel re-inits device
>>>>>            DMA targets kdump-kernel memory
>>>>>
>>>>>        Case 4: IOMMU on in system kernel, on in kdump kernel
>>>>>          system kernel enables IOMMU
>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>          kdump kernel enables IOMMU with no valid mappings
>>>>>            DMA causes IOMMU errors (annoying but harmless)
>>>>>          kdump kernel re-inits device
>>>>>            DMA targets IOMMU-mapped kdump-kernel memory
>>>>
>>>> This is not harmless. Errors like PCI SERR are detected here, and it
>>>> makes driver or system unstable, and kdump fails. I also got report that
>>>> system hangs up due to this.
>>>
>>> OK, let's take this slowly.  Does an IOMMU error in the system kernel
>>> also cause SERR or make the system unstable?  Is that the expected
>>> behavior on IOMMU errors, or is there something special about the
>>> kdump scenario that causes SERRs?  I see lots of DMAR errors, e.g.,
>>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
>>> reported with printk and don't seem to cause an SERR.  Maybe the SERR
>>> is system-specific behavior?
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
>>> report of IOMMU errors related to a driver bug where we just get
>>> printks, not SERR.
>>
>> Yes, it depends on platform or devices. At least PCI SERR is detected on
>> Fujitsu PRIMERGY BX920S2 and TX300S6.
>>
>> Intel VT-d documents[1] says:
>>
>>     3.5.1 Hardware Handling of Faulting DMA Requests
>>     DMA requests that result in remapping faults must be blocked by
>>     hardware. The exact method of DMA blocking is
>>     implementation-specific.  For example:
>>
>>     - Faulting DMA write requests may be handled in much the same way as
>>       hardware handles write requests to non-existent memory. For
>>       example, the DMA request is discarded in a manner convenient for
>>       implementations (such as by dropping the cycle, completing the
>>       write request to memory with all byte enables masked off,
>>       re-directed to a dummy memory location, etc.).
>>
>>     - Faulting DMA read requests may be handled in much the same way as
>>       hardware handles read requests to non-existent memory. For
>>       example, the request may be redirected to a dummy memory location,
>>       returned as all 0<92>s or 1<92>s in a manner convenient to the
>>       implementation, or the request may be completed with an explicit
>>       error indication (preferred). For faulting DMA read requests from
>>       PCI Express devices, hardware must indicate "Unsupported Request"
>>       (UR) in the completion status field of the PCI Express read
>>       completion.
>>
>> So, after IOMMU error, its behavior is implementation-specific.
>>
>> [1]
>> Intel Virtualization Technology for Directed I/O Architecture
>> Specification Rev1.3
>>
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
>>> when the kdump kernel reboots (after successfully saving a crashdump).
>>>    But it is using "iommu=pt", which I don't believe makes sense.  The
>>> scenario is basically case 4 above, but instead of the kdump kernel
>>> starting with no valid IOMMU mappings, it identity-maps bus addresses
>>> to physical memory addresses.  That's completely bogus because it's
>>> certainly not what the system kernel did, so it's entirely likely to
>>> make the system unstable or hang.  This is not an argument for doing a
>>> reset; it's an argument for doing something smarter than "iommu=pt" in
>>> the kdump kernel.
>
>>> We might still want to reset PCIe devices, but I want to make sure
>>> that we're not papering over other issues when we do.  Therefore, I'd
>>> like to understand why IOMMU errors seem harmless in some cases but
>>> not in others.
>>
>> As I wrote above, IOMMU behavior on error is up to platform/devcies. I
>> think it also depends on the value of Uncorrectable Error Mask Register
>> in AER registers of each device.
>>
>>>> Case 1:  Harmless
>>>> Case 2:  Not tested
>>>> Case 3a: Not tested
>>>> Case 3b: Cause problem, fixed by my patch
>>>> Case 4:  Cause problem, fixed by my patch
>>>>
>>>> I have never tested case 2 and 3a, but I think it also causes problem.
>>>
>>> I do not believe we need to support case 3b (IOMMU on in system kernel
>>> but disabled in kdump kernel).  There is no way to make that reliable
>>> unless every single device that may perform DMA is reset, and since
>>> you don't reset legacy PCI or VGA devices, you're not even proposing
>>> to do that.
>>>
>>> I think we need to support case 1 (for systems with no IOMMU at all)
>>> and case 4 (IOMMU enabled in both system kernel and kdump kernel).  If
>>> the kdump kernel can detect whether the IOMMU is enabled, that should
>>> be enough -- it could figure out automatically whether we're in case 1
>>> or 4.
>>
>> Ok, I completely agree.
>>
>>
>>>>> Do you have any bugzilla references or problem report URLs you could
>>>>> include here?
>>>>
>>>> I know three Red Hat bugzilla, but I think these are private article and
>>>> you cannot see. I'll add you Cc list in bz so that you can see.
>>>>
>>>> BZ#743790: Kdump fails with intel_iommu=on
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>>>
>>>> BZ#743495: Hardware error is detected with intel_iommu=on
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>>>
>>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
>>>
>>> Thanks for adding me to the CC lists.  I looked all three and I'm not
>>> sure there's anything sensitive in them.  It'd be nice if they could
>>> be made public if there's not.
>>
>> I really appreciate your comments! I'll confirm I can make it public.
>
> I would greatly appreciate being able to see the bugzillas relating to
> this patch.
>>
>> Thanks,
>> Takao Indoh
>
> Thinking out of the box:
> Much of the discussion about dealing with the ongoing DMA leftover
> from the system kernel has assumed that the crash kernel will reset
> the IOMMU -- which causes various problems if done while there is any DMA
> still active -- which leads to the idea of stopping all of the DMA.
>
> Suppose the crash kernel does not reset the hardware IOMMU, but simply
> detects that it is active, resets only the devices that are necessary
> for the crash kernel to operate, and re-programs only the translations

This suggestion brings up this one:
Only reset the devices that are generating IOMMU faults, i.e., add a kdump kernel
hook to the IOMMU fault handler.  While kdump kernel re-initing, IOMMU faults are grabbed
by the kexec'd kernel and the device is reset. IOW, directed, kdump device reset.
If no faults, then device is idle, or re-init'd and using new maps, and
all is well.  Once kdump kernel fully init'd, then just return from
kdump kernel callout, and let system do its fault handling.
It can be made mostly common (reset code in kexec mod under drivers/iommu),
with simple calls out from each IOMMU fault handler.

Of course, this assumes that systems don't turn IOMMU faults into system SERRs,
and crash the system.  IMO, as I've stated to a number of system developers,
IOMMU (mapping) faults should not crash the system -- they already isolate, and
prevent corruption, so worse case, some part of the system will stop doing I/O,
and that will either get retried, or aborted and a cleaner panic (and kdump
kernel switch) will ensue.

> for those devices.  All other translations remain the same (and remain valid)
> so all leftover DMA continues into its buffer in the system kernel area
> where it is harmless.  New translations needed by the kdump kernel are
> added to the existing tables.
>
A crashed system shouldn't assume that ongoing DMA is sane.... it should be stopped.
I would expect the kdump kernel to reset devices & acquire new dma mappings
upon reboot.   Thus, the only issue is how to 'throttle' IOMMU faults, and
not allow them to crash systems while the system is recovering/resetting itself,
but it's not one big (power) reset to cause it.

> I have not yet tried this, so I am not ready to propose it as anything more
> than a discussion topic at this time.
>
> It might work this way: (A small modification to case 3a above)
>
> IOMMU on in system kernel, kdump kernel accepts active IOMMU
>     system kernel enables IOMMU
>        DMA targets IOMMU-mapped system-kernel memory
>     kexec to kdump kernel (IOMMU on, devices untouched)
>        DMA targets IOMMU-mapped system-kernel memory
it may not, it may be bad/bogus device I/O causing the crash...

>     kdump kernel detects active IOMMU and doesn't touch it
>        DMA targets IOMMU-mapped system-kernel memory
>     kdump kernel does not re-initialize IOMMU hardware
>     kdump kernel initializes IOMMU in-memory management structures
>     kdump kernel calls device drivers' standard initialization functions
>        Drivers initialize their own devices -- DMA from that device stops
>        When drivers request new DMA mappings, the kdump IOMMU driver:
>        1. Updates its in-memory mgt structures for that device&  range
>        2. Updates IOMMU translate tables for that device&  range
>           . Translations for all other devices&  ranges are unchanged
>        3. Flushes IOMMU TLB to force IOMMU hardware update
>
>     Notes:
>        A. This seems very similar to case 1
>
>        B. Only touch devices with drivers loaded into kdump kernel.
>           . No need to touch devices that kdump is not going to use.
>
>        C. This assumes that all DMA device drivers used by kdump will
>           initialize the device before requesting DMA mappings.
>           . Seems reasonable, but need to confirm how many do (or don't)
>           . Only device drivers that might be included in the kdump
>             kernel need to observe this initialization ordering.
>
>        D. Could copy system kernel's translate tables into kdump kernel
>           and adjust pointers if this feels more trustworthy than using
>           the original structures where they are in the system kernel.
>
>        E. Handle IRQ remappings in a similar manner.
An even more serious attack vector: flood system w/interrupts (by assigned device in a guest),
effectively creating a DoS, b/c intrs caused raised (hw) ipl, while bogus DMA does
not cause system ipl elevation(blockage of other system progress), except when it
generates IOMMU faults which become intrs.
hmmm, wondering if another solution is to flip IOMMU fault handling into a poll-mode
at kdump init time, and then flip it to intr-driven, once I/O has been completely init'd ?
Per-device fault throttling would be a nice hw feature! (wishful thinking)


>
> Thanks,
> Bill Sumner
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh June 13, 2013, 2:44 a.m. UTC | #12

(2013/06/12 13:45), Bjorn Helgaas wrote:
> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
> 
> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
> <indou.takao@jp.fujitsu.com> wrote:
>> (2013/06/11 11:20), Bjorn Helgaas wrote:
> 
>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>> fs_initcall() that doesn't give the reader any clue about the
>>> connection between the reset and the problem it's solving.
>>>
>>> If we do something like this patch, I think it needs to be done at the
>>> point where we enable or disable the IOMMU.  That way, it's connected
>>> to the important event, and there's a clue about how to make
>>> corresponding fixes for other IOMMUs.
>>
>> Ok. pci_iommu_init() is appropriate place to add this hook?
> 
> I looked at various IOMMU init places today, and it's far more
> complicated and varied than I had hoped.
> 
> This reset scheme depends on enumerating PCI devices before we
> initialize the IOMMU used by those devices.  x86 works that way today,
> but not all architectures do (see the sparc pci_fire_pbm_init(), for

Sorry, could you tell me which part depends on architecture?

> example).  And I think conceptually, the IOMMU should be enumerated
> and initialized *before* the devices that use it.
> 
> So I'm uncomfortable with that aspect of this scheme.
> 
> It would be at least conceivable to reset the devices in the system
> kernel, before the kexec.  I know we want to do as little as possible
> in the crashing kernel, but it's at least a possibility, and it might
> be cleaner.

I bet this will be not accepted by kdump maintainer. Everything in panic
kernel is unreliable.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh June 13, 2013, 3:25 a.m. UTC | #13

(2013/06/12 22:19), Don Dutile wrote:
> On 06/11/2013 07:19 PM, Sumner, William wrote:
>>
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh<indou.takao@jp.fujitsu.com>  wrote:
>>>>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>>>>
>>>>>> One thing I'm not sure about is that you are only resetting PCIe
>>>>>> devices, but I don't think the problem is actually specific to PCIe,
>>>>>> is it?  I think the same issue could occur on any system with an
>>>>>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>>>>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>>>>> PA-RISC.
>>>>>
>>>>> Right, this is not specific to PCIe. The reasons why the target is only
>>>>> PCIe is just to make algorithm to reset simple. It is possible to reset
>>>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>>>>> thought recently most systems used PCIe and there was little demand for
>>>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>>>>> devices, but I'll do if there are requests :-)
>>>>
>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>> fs_initcall() that doesn't give the reader any clue about the
>>>> connection between the reset and the problem it's solving.
>>>>
>>>> If we do something like this patch, I think it needs to be done at the
>>>> point where we enable or disable the IOMMU.  That way, it's connected
>>>> to the important event, and there's a clue about how to make
>>>> corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
>>>> We already have a "reset_devices" boot option.  This is for the same
>>>> purpose, as far as I can tell, and I'm not sure there's value in
>>>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
>>>> fact, there's nothing specific even to PCI here.  The Intel VT-d docs
>>>> seem carefully written so they could apply to either PCIe or non-PCI
>>>> devices.
>>>
>>> Yeah, I can integrate this option into reset_devices. The reason why
>>> I separate them is to avoid regression.
>>>
>>> I have tested my patch on many machines and basically it worked, but I
>>> found two machines where this reset patch caused problem. The first one
>>> was caused by bug of raid card firmware. After updating firmware, this
>>> patch worked. The second one was due to bug of PCIe switch chip. I
>>> reported this bug to the vendor but it is not fixed yet.
>>>
>>> Anyway, this patch may cause problems on such a buggy machine, so I
>>> introduced new boot parameter so that user can enable or disable this
>>> function aside from reset_devices.
>>>
>>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>>> reset_devices instead of adding new one, and using quirk or something to
>>> avoid such a buggy devices.
>>
>> With respect to "and using quirk or something to avoid such buggy devices",
>> I believe that it will be necessary to provide a mechanism for devices that
>> need special handling to do the reset -- perhaps something like a list
>> of tuples: (device_type, function_to_call) with a default function_to_call
>> when the device_type is not found in the list.  These functions would
>> need to be physically separate from the device driver because if the device
>> is present it needs to be reset even if the crash kernel chooses not to load
>> the driver for that device.
>>
>>>
>>> So, basically I agree with using reset_devices, but I want to prepare
>>> workaround in case this reset causes something wrong.
>>>
>> I like the ability to specify the original "reset_devices" separately from
>> invoking this new mechanism.  With so many different uses for Linux in
>> so many different environments and with so many different device drivers
>> it seems reasonable to keep the ability to tell the device drivers to
>> reset their devices -- instead of pulling the reset line on all devices.
>>
>> I also like the ability to invoke the new reset feature separately from
>> telling the device drivers to do it.
>>
>>>
>>>>
>>>>>> I tried to make a list of the interesting scenarios and the events
>>>>>> that are relevant to this problem:
>>>>>>
>>>>>>        Case 1: IOMMU off in system, off in kdump kernel
>>>>>>          system kernel leaves IOMMU off
>>>>>>            DMA targets system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU off, devices untouched)
>>>>>>            DMA targets system-kernel memory (harmless)
>>>>>>          kdump kernel re-inits device
>>>>>>            DMA targets kdump-kernel memory
>>>>>>
>>>>>>        Case 2: IOMMU off in system kernel, on in kdump kernel
>>>>>>          system kernel leaves IOMMU off
>>>>>>            DMA targets system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU off, devices untouched)
>>>>>>            DMA targets system-kernel memory (harmless)
>>>>>>          kdump kernel enables IOMMU with no valid mappings
>>>>>>            DMA causes IOMMU errors (annoying but harmless)
>>>>>>          kdump kernel re-inits device
>>>>>>            DMA targets IOMMU-mapped kdump-kernel memory
>>>>>>
>>>>>>        Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>>>>>          system kernel enables IOMMU
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kdump kernel doesn't know about IOMMU or doesn't touch it
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kdump kernel re-inits device
>>>>>>            kernel assumes no IOMMU, so all new DMA mappings are invalid
>>>>>> because DMAs actually do go through the IOMMU
>>>>>>            (** corruption or other non-recoverable error likely **)
>>>>>>
>>>>>>
>>>>>>        Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>>>>>          system kernel enables IOMMU
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kdump kernel disables IOMMU
>>>>>>            DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>>>>>            (** corruption or other non-recoverable error likely **)
>>>>>>          kdump kernel re-inits device
>>>>>>            DMA targets kdump-kernel memory
>>>>>>
>>>>>>        Case 4: IOMMU on in system kernel, on in kdump kernel
>>>>>>          system kernel enables IOMMU
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kdump kernel enables IOMMU with no valid mappings
>>>>>>            DMA causes IOMMU errors (annoying but harmless)
>>>>>>          kdump kernel re-inits device
>>>>>>            DMA targets IOMMU-mapped kdump-kernel memory
>>>>>
>>>>> This is not harmless. Errors like PCI SERR are detected here, and it
>>>>> makes driver or system unstable, and kdump fails. I also got report that
>>>>> system hangs up due to this.
>>>>
>>>> OK, let's take this slowly.  Does an IOMMU error in the system kernel
>>>> also cause SERR or make the system unstable?  Is that the expected
>>>> behavior on IOMMU errors, or is there something special about the
>>>> kdump scenario that causes SERRs?  I see lots of DMAR errors, e.g.,
>>>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
>>>> reported with printk and don't seem to cause an SERR.  Maybe the SERR
>>>> is system-specific behavior?
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
>>>> report of IOMMU errors related to a driver bug where we just get
>>>> printks, not SERR.
>>>
>>> Yes, it depends on platform or devices. At least PCI SERR is detected on
>>> Fujitsu PRIMERGY BX920S2 and TX300S6.
>>>
>>> Intel VT-d documents[1] says:
>>>
>>>     3.5.1 Hardware Handling of Faulting DMA Requests
>>>     DMA requests that result in remapping faults must be blocked by
>>>     hardware. The exact method of DMA blocking is
>>>     implementation-specific.  For example:
>>>
>>>     - Faulting DMA write requests may be handled in much the same way as
>>>       hardware handles write requests to non-existent memory. For
>>>       example, the DMA request is discarded in a manner convenient for
>>>       implementations (such as by dropping the cycle, completing the
>>>       write request to memory with all byte enables masked off,
>>>       re-directed to a dummy memory location, etc.).
>>>
>>>     - Faulting DMA read requests may be handled in much the same way as
>>>       hardware handles read requests to non-existent memory. For
>>>       example, the request may be redirected to a dummy memory location,
>>>       returned as all 0<92>s or 1<92>s in a manner convenient to the
>>>       implementation, or the request may be completed with an explicit
>>>       error indication (preferred). For faulting DMA read requests from
>>>       PCI Express devices, hardware must indicate "Unsupported Request"
>>>       (UR) in the completion status field of the PCI Express read
>>>       completion.
>>>
>>> So, after IOMMU error, its behavior is implementation-specific.
>>>
>>> [1]
>>> Intel Virtualization Technology for Directed I/O Architecture
>>> Specification Rev1.3
>>>
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
>>>> when the kdump kernel reboots (after successfully saving a crashdump).
>>>>    But it is using "iommu=pt", which I don't believe makes sense.  The
>>>> scenario is basically case 4 above, but instead of the kdump kernel
>>>> starting with no valid IOMMU mappings, it identity-maps bus addresses
>>>> to physical memory addresses.  That's completely bogus because it's
>>>> certainly not what the system kernel did, so it's entirely likely to
>>>> make the system unstable or hang.  This is not an argument for doing a
>>>> reset; it's an argument for doing something smarter than "iommu=pt" in
>>>> the kdump kernel.
>>
>>>> We might still want to reset PCIe devices, but I want to make sure
>>>> that we're not papering over other issues when we do.  Therefore, I'd
>>>> like to understand why IOMMU errors seem harmless in some cases but
>>>> not in others.
>>>
>>> As I wrote above, IOMMU behavior on error is up to platform/devcies. I
>>> think it also depends on the value of Uncorrectable Error Mask Register
>>> in AER registers of each device.
>>>
>>>>> Case 1:  Harmless
>>>>> Case 2:  Not tested
>>>>> Case 3a: Not tested
>>>>> Case 3b: Cause problem, fixed by my patch
>>>>> Case 4:  Cause problem, fixed by my patch
>>>>>
>>>>> I have never tested case 2 and 3a, but I think it also causes problem.
>>>>
>>>> I do not believe we need to support case 3b (IOMMU on in system kernel
>>>> but disabled in kdump kernel).  There is no way to make that reliable
>>>> unless every single device that may perform DMA is reset, and since
>>>> you don't reset legacy PCI or VGA devices, you're not even proposing
>>>> to do that.
>>>>
>>>> I think we need to support case 1 (for systems with no IOMMU at all)
>>>> and case 4 (IOMMU enabled in both system kernel and kdump kernel).  If
>>>> the kdump kernel can detect whether the IOMMU is enabled, that should
>>>> be enough -- it could figure out automatically whether we're in case 1
>>>> or 4.
>>>
>>> Ok, I completely agree.
>>>
>>>
>>>>>> Do you have any bugzilla references or problem report URLs you could
>>>>>> include here?
>>>>>
>>>>> I know three Red Hat bugzilla, but I think these are private article and
>>>>> you cannot see. I'll add you Cc list in bz so that you can see.
>>>>>
>>>>> BZ#743790: Kdump fails with intel_iommu=on
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>>>>
>>>>> BZ#743495: Hardware error is detected with intel_iommu=on
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>>>>
>>>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
>>>>
>>>> Thanks for adding me to the CC lists.  I looked all three and I'm not
>>>> sure there's anything sensitive in them.  It'd be nice if they could
>>>> be made public if there's not.
>>>
>>> I really appreciate your comments! I'll confirm I can make it public.
>>
>> I would greatly appreciate being able to see the bugzillas relating to
>> this patch.
>>>
>>> Thanks,
>>> Takao Indoh
>>
>> Thinking out of the box:
>> Much of the discussion about dealing with the ongoing DMA leftover
>> from the system kernel has assumed that the crash kernel will reset
>> the IOMMU -- which causes various problems if done while there is any DMA
>> still active -- which leads to the idea of stopping all of the DMA.
>>
>> Suppose the crash kernel does not reset the hardware IOMMU, but simply
>> detects that it is active, resets only the devices that are necessary
>> for the crash kernel to operate, and re-programs only the translations
> 
> This suggestion brings up this one:
> Only reset the devices that are generating IOMMU faults, i.e., add a kdump kernel
> hook to the IOMMU fault handler.  While kdump kernel re-initing, IOMMU faults are grabbed
> by the kexec'd kernel and the device is reset. IOW, directed, kdump device reset.
> If no faults, then device is idle, or re-init'd and using new maps, and
> all is well.  Once kdump kernel fully init'd, then just return from
> kdump kernel callout, and let system do its fault handling.
> It can be made mostly common (reset code in kexec mod under drivers/iommu),
> with simple calls out from each IOMMU fault handler.

What I tried before is:
1) Prepare work queue handler
2) In IOMMU fault handler, wake up the work queue handler
3) In the work queue handler, reset devices against the error
   source.

As a result, this method did not work. The device are reset during its
initialization. Please take a look at the message below.  This is a
message when megaraid_sas driver is loaded in second kernel.

megasas: 00.00.05.40-rh2 Thu. Aug. 4 17:00:00 PDT 2011
megaraid_sas 0000:01:00.0: resetting MSI-X
megasas: 0x1000:0x0073:0x1734:0x1177: bus 1:slot 0:func 0
megaraid_sas 0000:01:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28
megaraid_sas 0000:01:00.0: setting latency timer to 64
megasas: Waiting for FW to come to ready state
DRHD: handling fault status reg 702
DMAR:[DMA Write] Request device [01:00.0] fault addr ffff9000
DMAR:[fault reason 05] PTE Write access is not set
megasas: FW now in Ready state
  alloc irq_desc for 51 on node -1
  alloc kstat_irqs on node -1
megaraid_sas 0000:01:00.0: irq 51 for MSI/MSI-X
megasas_init_mfi: fw_support_ieee=67108864


Note that DMAR error is detected during driver initialization. So when I
added reset method which I mentioned above, the device is reset here,
and after that, megasas could not find its disks and kdump failed.

Thanks,
Takao Indoh


> 
> Of course, this assumes that systems don't turn IOMMU faults into system SERRs,
> and crash the system.  IMO, as I've stated to a number of system developers,
> IOMMU (mapping) faults should not crash the system -- they already isolate, and
> prevent corruption, so worse case, some part of the system will stop doing I/O,
> and that will either get retried, or aborted and a cleaner panic (and kdump
> kernel switch) will ensue.
> 
>> for those devices.  All other translations remain the same (and remain valid)
>> so all leftover DMA continues into its buffer in the system kernel area
>> where it is harmless.  New translations needed by the kdump kernel are
>> added to the existing tables.
>>
> A crashed system shouldn't assume that ongoing DMA is sane.... it should be stopped.
> I would expect the kdump kernel to reset devices & acquire new dma mappings
> upon reboot.   Thus, the only issue is how to 'throttle' IOMMU faults, and
> not allow them to crash systems while the system is recovering/resetting itself,
> but it's not one big (power) reset to cause it.
> 
>> I have not yet tried this, so I am not ready to propose it as anything more
>> than a discussion topic at this time.
>>
>> It might work this way: (A small modification to case 3a above)
>>
>> IOMMU on in system kernel, kdump kernel accepts active IOMMU
>>     system kernel enables IOMMU
>>        DMA targets IOMMU-mapped system-kernel memory
>>     kexec to kdump kernel (IOMMU on, devices untouched)
>>        DMA targets IOMMU-mapped system-kernel memory
> it may not, it may be bad/bogus device I/O causing the crash...
> 
>>     kdump kernel detects active IOMMU and doesn't touch it
>>        DMA targets IOMMU-mapped system-kernel memory
>>     kdump kernel does not re-initialize IOMMU hardware
>>     kdump kernel initializes IOMMU in-memory management structures
>>     kdump kernel calls device drivers' standard initialization functions
>>        Drivers initialize their own devices -- DMA from that device stops
>>        When drivers request new DMA mappings, the kdump IOMMU driver:
>>        1. Updates its in-memory mgt structures for that device&  range
>>        2. Updates IOMMU translate tables for that device&  range
>>           . Translations for all other devices&  ranges are unchanged
>>        3. Flushes IOMMU TLB to force IOMMU hardware update
>>
>>     Notes:
>>        A. This seems very similar to case 1
>>
>>        B. Only touch devices with drivers loaded into kdump kernel.
>>           . No need to touch devices that kdump is not going to use.
>>
>>        C. This assumes that all DMA device drivers used by kdump will
>>           initialize the device before requesting DMA mappings.
>>           . Seems reasonable, but need to confirm how many do (or don't)
>>           . Only device drivers that might be included in the kdump
>>             kernel need to observe this initialization ordering.
>>
>>        D. Could copy system kernel's translate tables into kdump kernel
>>           and adjust pointers if this feels more trustworthy than using
>>           the original structures where they are in the system kernel.
>>
>>        E. Handle IRQ remappings in a similar manner.
> An even more serious attack vector: flood system w/interrupts (by assigned device in a guest),
> effectively creating a DoS, b/c intrs caused raised (hw) ipl, while bogus DMA does
> not cause system ipl elevation(blockage of other system progress), except when it
> generates IOMMU faults which become intrs.
> hmmm, wondering if another solution is to flip IOMMU fault handling into a poll-mode
> at kdump init time, and then flip it to intr-driven, once I/O has been completely init'd ?
> Per-device fault throttling would be a nice hw feature! (wishful thinking)
> 
> 
>>
>> Thanks,
>> Bill Sumner
>>
>>
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas June 13, 2013, 3:41 a.m. UTC | #14

On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> (2013/06/12 13:45), Bjorn Helgaas wrote:
>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>
>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>> <indou.takao@jp.fujitsu.com> wrote:
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>
>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>> fs_initcall() that doesn't give the reader any clue about the
>>>> connection between the reset and the problem it's solving.
>>>>
>>>> If we do something like this patch, I think it needs to be done at the
>>>> point where we enable or disable the IOMMU.  That way, it's connected
>>>> to the important event, and there's a clue about how to make
>>>> corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>
>> I looked at various IOMMU init places today, and it's far more
>> complicated and varied than I had hoped.
>>
>> This reset scheme depends on enumerating PCI devices before we
>> initialize the IOMMU used by those devices.  x86 works that way today,
>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>
> Sorry, could you tell me which part depends on architecture?

Your patch works if PCIe devices are reset before the kdump kernel
enables the IOMMU.  On x86, this is possible because PCI enumeration
happens before the IOMMU initialization.  On sparc, the IOMMU is
initialized before PCI devices are enumerated, so there would still be
a window where ongoing DMA could cause an IOMMU error.

Of course, it might be possible to reorganize the sparc code to to the
IOMMU init *after* it enumerates PCI devices.  But I think that change
would be hard to justify.

And I think even on x86, it would be better if we did the IOMMU init
before PCI enumeration -- the PCI devices depend on the IOMMU, so
logically the IOMMU should be initialized first so the PCI devices can
be associated with it as they are enumerated.

>> example).  And I think conceptually, the IOMMU should be enumerated
>> and initialized *before* the devices that use it.
>>
>> So I'm uncomfortable with that aspect of this scheme.
>>
>> It would be at least conceivable to reset the devices in the system
>> kernel, before the kexec.  I know we want to do as little as possible
>> in the crashing kernel, but it's at least a possibility, and it might
>> be cleaner.
>
> I bet this will be not accepted by kdump maintainer. Everything in panic
> kernel is unreliable.

kdump is inherently unreliable.  The kdump kernel doesn't start from
an arbitrary machine state.  We don't expect it to tolerate all CPUs
running, for example.  Maybe it should be expected to tolerate PCI
devices running, either.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh June 14, 2013, 2:11 a.m. UTC | #15

(2013/06/13 12:41), Bjorn Helgaas wrote:
> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>> (2013/06/12 13:45), Bjorn Helgaas wrote:
>>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>>
>>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>>> <indou.takao@jp.fujitsu.com> wrote:
>>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>
>>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>>> fs_initcall() that doesn't give the reader any clue about the
>>>>> connection between the reset and the problem it's solving.
>>>>>
>>>>> If we do something like this patch, I think it needs to be done at the
>>>>> point where we enable or disable the IOMMU.  That way, it's connected
>>>>> to the important event, and there's a clue about how to make
>>>>> corresponding fixes for other IOMMUs.
>>>>
>>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
>>> I looked at various IOMMU init places today, and it's far more
>>> complicated and varied than I had hoped.
>>>
>>> This reset scheme depends on enumerating PCI devices before we
>>> initialize the IOMMU used by those devices.  x86 works that way today,
>>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>>
>> Sorry, could you tell me which part depends on architecture?
> 
> Your patch works if PCIe devices are reset before the kdump kernel
> enables the IOMMU.  On x86, this is possible because PCI enumeration
> happens before the IOMMU initialization.  On sparc, the IOMMU is
> initialized before PCI devices are enumerated, so there would still be
> a window where ongoing DMA could cause an IOMMU error.

Ok, understood, thanks.

Hmmm, it seems to be difficult to find out method which is common to
all architectures. So, what I can do for now is introducing reset scheme
which is only for x86.

1) Change this patch so that it work only on x86 platform. For example
   call this reset code from x86_init.iommu.iommu_init() instead of
   fs_initcall.

Or another idea is:

2) Enumerate PCI devices in IOMMU layer. That is:
   PCI layer
     Just provide interface to reset given strcut pci_dev. Maybe
     pci_reset_function() looks good for this purpose.
   IOMMU layer
     Determine which devices should be reset. On kernel boot, check if
     IOMMU is already active or not, and if active, check IOMMU page
     table and reset devices whose entry exists there.

> Of course, it might be possible to reorganize the sparc code to to the
> IOMMU init *after* it enumerates PCI devices.  But I think that change
> would be hard to justify.
> 
> And I think even on x86, it would be better if we did the IOMMU init
> before PCI enumeration -- the PCI devices depend on the IOMMU, so
> logically the IOMMU should be initialized first so the PCI devices can
> be associated with it as they are enumerated.

So third idea is:

3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We
   need to implement new code to enumerate PCI devices and reset them
   for this purpose.

Idea 2 is not difficult to implement, but one problem is that this
method may be dangerous. We need to scan IOMMU page table which is used
in previous kernel, but it may be broken. Idea 3 seems to be difficult
to implement...


> 
>>> example).  And I think conceptually, the IOMMU should be enumerated
>>> and initialized *before* the devices that use it.
>>>
>>> So I'm uncomfortable with that aspect of this scheme.
>>>
>>> It would be at least conceivable to reset the devices in the system
>>> kernel, before the kexec.  I know we want to do as little as possible
>>> in the crashing kernel, but it's at least a possibility, and it might
>>> be cleaner.
>>
>> I bet this will be not accepted by kdump maintainer. Everything in panic
>> kernel is unreliable.
> 
> kdump is inherently unreliable.  The kdump kernel doesn't start from
> an arbitrary machine state.  We don't expect it to tolerate all CPUs
> running, for example.  Maybe it should be expected to tolerate PCI
> devices running, either.

What I wanted to say is that any resources of first kernel are
unreliable. Under panic situation, struct pci_dev tree may be broken, or
pci_lock may be already hold by someone, etc. So, if we do this in first
kernel, maybe kdump needs its own code to enumerate PCI devices and
reset them.

Vivek?

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh July 24, 2013, 6:29 a.m. UTC | #16

Sorry for letting this discussion slide, I was busy on other works:-(
Anyway, the summary of previous discussion is:
- My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
  boot. This expects PCI enumeration is done before IOMMU
  initialization as follows.
    (1) PCI enumeration
    (2) fs_initcall ---> device reset
    (3) IOMMU initialization
- This works on x86, but does not work on other architecture because
  IOMMU is initialized before PCI enumeration on some architectures. So,
  device reset should be done where IOMMU is initialized instead of
  initcall.
- Or, as another idea, we can reset devices in first kernel(panic kernel)

Resetting devices in panic kernel is against kdump policy and seems not to
be good idea. So I think adding reset code into iommu initialization is
better. I'll post patches for that.

Another discussion point is how to handle buggy devices. Resetting buggy
devices makes system more unstable. One of ideas is using boot parameter
so that user can choose to reset devices or not. An existed parameter
"reset_devices" is not helpful for this purpose because it is always
necessary for kdump and user cannot get rid of it. Introducing new boot
parameter seems to be unpopular for maintainers. Any ideas?

Thanks,
Takao Indoh

(2013/06/14 11:11), Takao Indoh wrote:
> (2013/06/13 12:41), Bjorn Helgaas wrote:
>> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>>> (2013/06/12 13:45), Bjorn Helgaas wrote:
>>>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>>>
>>>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>>>> <indou.takao@jp.fujitsu.com> wrote:
>>>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>>
>>>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>>>> fs_initcall() that doesn't give the reader any clue about the
>>>>>> connection between the reset and the problem it's solving.
>>>>>>
>>>>>> If we do something like this patch, I think it needs to be done at the
>>>>>> point where we enable or disable the IOMMU.  That way, it's connected
>>>>>> to the important event, and there's a clue about how to make
>>>>>> corresponding fixes for other IOMMUs.
>>>>>
>>>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>>
>>>> I looked at various IOMMU init places today, and it's far more
>>>> complicated and varied than I had hoped.
>>>>
>>>> This reset scheme depends on enumerating PCI devices before we
>>>> initialize the IOMMU used by those devices.  x86 works that way today,
>>>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>>>
>>> Sorry, could you tell me which part depends on architecture?
>>
>> Your patch works if PCIe devices are reset before the kdump kernel
>> enables the IOMMU.  On x86, this is possible because PCI enumeration
>> happens before the IOMMU initialization.  On sparc, the IOMMU is
>> initialized before PCI devices are enumerated, so there would still be
>> a window where ongoing DMA could cause an IOMMU error.
> 
> Ok, understood, thanks.
> 
> Hmmm, it seems to be difficult to find out method which is common to
> all architectures. So, what I can do for now is introducing reset scheme
> which is only for x86.
> 
> 1) Change this patch so that it work only on x86 platform. For example
>     call this reset code from x86_init.iommu.iommu_init() instead of
>     fs_initcall.
> 
> Or another idea is:
> 
> 2) Enumerate PCI devices in IOMMU layer. That is:
>     PCI layer
>       Just provide interface to reset given strcut pci_dev. Maybe
>       pci_reset_function() looks good for this purpose.
>     IOMMU layer
>       Determine which devices should be reset. On kernel boot, check if
>       IOMMU is already active or not, and if active, check IOMMU page
>       table and reset devices whose entry exists there.
> 
>> Of course, it might be possible to reorganize the sparc code to to the
>> IOMMU init *after* it enumerates PCI devices.  But I think that change
>> would be hard to justify.
>>
>> And I think even on x86, it would be better if we did the IOMMU init
>> before PCI enumeration -- the PCI devices depend on the IOMMU, so
>> logically the IOMMU should be initialized first so the PCI devices can
>> be associated with it as they are enumerated.
> 
> So third idea is:
> 
> 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We
>     need to implement new code to enumerate PCI devices and reset them
>     for this purpose.
> 
> Idea 2 is not difficult to implement, but one problem is that this
> method may be dangerous. We need to scan IOMMU page table which is used
> in previous kernel, but it may be broken. Idea 3 seems to be difficult
> to implement...
> 
> 
>>
>>>> example).  And I think conceptually, the IOMMU should be enumerated
>>>> and initialized *before* the devices that use it.
>>>>
>>>> So I'm uncomfortable with that aspect of this scheme.
>>>>
>>>> It would be at least conceivable to reset the devices in the system
>>>> kernel, before the kexec.  I know we want to do as little as possible
>>>> in the crashing kernel, but it's at least a possibility, and it might
>>>> be cleaner.
>>>
>>> I bet this will be not accepted by kdump maintainer. Everything in panic
>>> kernel is unreliable.
>>
>> kdump is inherently unreliable.  The kdump kernel doesn't start from
>> an arbitrary machine state.  We don't expect it to tolerate all CPUs
>> running, for example.  Maybe it should be expected to tolerate PCI
>> devices running, either.
> 
> What I wanted to say is that any resources of first kernel are
> unreliable. Under panic situation, struct pci_dev tree may be broken, or
> pci_lock may be already hold by someone, etc. So, if we do this in first
> kernel, maybe kdump needs its own code to enumerate PCI devices and
> reset them.
> 
> Vivek?
> 
> Thanks,
> Takao Indoh
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vivek Goyal July 25, 2013, 2:24 p.m. UTC | #17

On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote:
> Sorry for letting this discussion slide, I was busy on other works:-(
> Anyway, the summary of previous discussion is:
> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>   boot. This expects PCI enumeration is done before IOMMU
>   initialization as follows.
>     (1) PCI enumeration
>     (2) fs_initcall ---> device reset
>     (3) IOMMU initialization
> - This works on x86, but does not work on other architecture because
>   IOMMU is initialized before PCI enumeration on some architectures. So,
>   device reset should be done where IOMMU is initialized instead of
>   initcall.
> - Or, as another idea, we can reset devices in first kernel(panic kernel)
> 
> Resetting devices in panic kernel is against kdump policy and seems not to
> be good idea. So I think adding reset code into iommu initialization is
> better. I'll post patches for that.

I don't understand all the details but I agree that idea of trying to
reset IOMMU in crashed kernel might not fly.

> 
> Another discussion point is how to handle buggy devices. Resetting buggy
> devices makes system more unstable. One of ideas is using boot parameter
> so that user can choose to reset devices or not.

So who would decide which device is buggy and don't reset it. Give
some details here.

Can't we simply blacklist associated module, so that it never loads
and then it never tries to reset the devices?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas July 25, 2013, 5 p.m. UTC | #18

On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
<indou.takao@jp.fujitsu.com> wrote:
> Sorry for letting this discussion slide, I was busy on other works:-(
> Anyway, the summary of previous discussion is:
> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>   boot. This expects PCI enumeration is done before IOMMU
>   initialization as follows.
>     (1) PCI enumeration
>     (2) fs_initcall ---> device reset
>     (3) IOMMU initialization
> - This works on x86, but does not work on other architecture because
>   IOMMU is initialized before PCI enumeration on some architectures. So,
>   device reset should be done where IOMMU is initialized instead of
>   initcall.
> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>
> Resetting devices in panic kernel is against kdump policy and seems not to
> be good idea. So I think adding reset code into iommu initialization is
> better. I'll post patches for that.

Of course nobody *wants* to do anything in the panic kernel.  But
simply saying "it's against kdump policy and seems not to be a good
idea" is not a technical argument.  There are things that are
impractical to do in the kdump kernel, so they have to be done in the
panic kernel even though we know the kernel is unreliable and the
attempt may fail.

My point about IOMMU and PCI initialization order doesn't go away just
because it doesn't fit "kdump policy."  Having system initialization
occur in a logical order is far more important than making kdump work.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh July 29, 2013, 12:20 a.m. UTC | #19

(2013/07/25 23:24), Vivek Goyal wrote:
> On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote:
>> Sorry for letting this discussion slide, I was busy on other works:-(
>> Anyway, the summary of previous discussion is:
>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>    boot. This expects PCI enumeration is done before IOMMU
>>    initialization as follows.
>>      (1) PCI enumeration
>>      (2) fs_initcall ---> device reset
>>      (3) IOMMU initialization
>> - This works on x86, but does not work on other architecture because
>>    IOMMU is initialized before PCI enumeration on some architectures. So,
>>    device reset should be done where IOMMU is initialized instead of
>>    initcall.
>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>
>> Resetting devices in panic kernel is against kdump policy and seems not to
>> be good idea. So I think adding reset code into iommu initialization is
>> better. I'll post patches for that.
> 
> I don't understand all the details but I agree that idea of trying to
> reset IOMMU in crashed kernel might not fly.
> 
>>
>> Another discussion point is how to handle buggy devices. Resetting buggy
>> devices makes system more unstable. One of ideas is using boot parameter
>> so that user can choose to reset devices or not.
> 
> So who would decide which device is buggy and don't reset it. Give
> some details here.

I found the case that kdump does not work after resetting devices and
it works when removing reset patch. The cause of problem is a bug of
PCIe switch chip. If there is boot parameter not to reset devices,
user can use it as workaround.

I think in this case we should add PCI quirk to avoid this buggy
hardware, but we need to wait errata from vendor and it basically takes
long time.

> 
> Can't we simply blacklist associated module, so that it never loads
> and then it never tries to reset the devices?
> 

So you mean that device reset should be done on its driver loading?

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh July 29, 2013, 12:37 a.m. UTC | #20

(2013/07/26 2:00), Bjorn Helgaas wrote:
> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
> <indou.takao@jp.fujitsu.com> wrote:
>> Sorry for letting this discussion slide, I was busy on other works:-(
>> Anyway, the summary of previous discussion is:
>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>    boot. This expects PCI enumeration is done before IOMMU
>>    initialization as follows.
>>      (1) PCI enumeration
>>      (2) fs_initcall ---> device reset
>>      (3) IOMMU initialization
>> - This works on x86, but does not work on other architecture because
>>    IOMMU is initialized before PCI enumeration on some architectures. So,
>>    device reset should be done where IOMMU is initialized instead of
>>    initcall.
>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>
>> Resetting devices in panic kernel is against kdump policy and seems not to
>> be good idea. So I think adding reset code into iommu initialization is
>> better. I'll post patches for that.
> 
> Of course nobody *wants* to do anything in the panic kernel.  But
> simply saying "it's against kdump policy and seems not to be a good
> idea" is not a technical argument.  There are things that are
> impractical to do in the kdump kernel, so they have to be done in the
> panic kernel even though we know the kernel is unreliable and the
> attempt may fail.

Accessing kernel data in panic kernel causes panic again, so
- Don't touch kernel data in panic situation
- Jump to kdump kernel as quickly as possible, and do things in safe
  kernel
These are basic "kdump policy". Of course if there are any works which
we cannot do in kdump kernel and can do only in panic kernel, for
example saving registers or stopping cpus, we should do them in panic
kernel.

Resetting devices in panic kernel is worth considering if we can safely
find pci_dev and reset it, but I have no idea how to do that because
for example struct pci_dev may be borken.

> 
> My point about IOMMU and PCI initialization order doesn't go away just
> because it doesn't fit "kdump policy."  Having system initialization
> occur in a logical order is far more important than making kdump work.

My next plan is as follows. I think this is matched to logical order
on boot.

drivers/pci/pci.c
- Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)

drivers/iommu/intel-iommu.c
- On initialization, if IOMMU is already enabled, call this bus reset
  function before disabling and re-enabling IOMMU.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas July 29, 2013, 2:17 p.m. UTC | #21

On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> (2013/07/26 2:00), Bjorn Helgaas wrote:
>> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
>> <indou.takao@jp.fujitsu.com> wrote:
>>> Sorry for letting this discussion slide, I was busy on other works:-(
>>> Anyway, the summary of previous discussion is:
>>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>>    boot. This expects PCI enumeration is done before IOMMU
>>>    initialization as follows.
>>>      (1) PCI enumeration
>>>      (2) fs_initcall ---> device reset
>>>      (3) IOMMU initialization
>>> - This works on x86, but does not work on other architecture because
>>>    IOMMU is initialized before PCI enumeration on some architectures. So,
>>>    device reset should be done where IOMMU is initialized instead of
>>>    initcall.
>>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>>
>>> Resetting devices in panic kernel is against kdump policy and seems not to
>>> be good idea. So I think adding reset code into iommu initialization is
>>> better. I'll post patches for that.
>>
>> Of course nobody *wants* to do anything in the panic kernel.  But
>> simply saying "it's against kdump policy and seems not to be a good
>> idea" is not a technical argument.  There are things that are
>> impractical to do in the kdump kernel, so they have to be done in the
>> panic kernel even though we know the kernel is unreliable and the
>> attempt may fail.
>
> Accessing kernel data in panic kernel causes panic again, so
> - Don't touch kernel data in panic situation
> - Jump to kdump kernel as quickly as possible, and do things in safe
>   kernel
> These are basic "kdump policy". Of course if there are any works which
> we cannot do in kdump kernel and can do only in panic kernel, for
> example saving registers or stopping cpus, we should do them in panic
> kernel.
>
> Resetting devices in panic kernel is worth considering if we can safely
> find pci_dev and reset it, but I have no idea how to do that because
> for example struct pci_dev may be borken.

Nobody can guarantee that the panic kernel can do *anything* safely
because any arbitrary kernel data or text may be corrupted.  But if
you consider any specific data structure, e.g., CPU or PCI device
lists, it's not very likely that it will be corrupted.

>> My point about IOMMU and PCI initialization order doesn't go away just
>> because it doesn't fit "kdump policy."  Having system initialization
>> occur in a logical order is far more important than making kdump work.
>
> My next plan is as follows. I think this is matched to logical order
> on boot.
>
> drivers/pci/pci.c
> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>
> drivers/iommu/intel-iommu.c
> - On initialization, if IOMMU is already enabled, call this bus reset
>   function before disabling and re-enabling IOMMU.

I raised this issue because of arches like sparc that enumerate the
IOMMU before the PCI devices that use it.  In that situation, I think
you're proposing this:

  panic kernel
    enable IOMMU
    panic
  kdump kernel
    initialize IOMMU (already enabled)
      pci_reset_bus
      disable IOMMU
      enable IOMMU
    enumerate PCI devices

But the problem is that when you call pci_reset_bus(), you haven't
enumerated the PCI devices, so you don't know what to reset.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh July 30, 2013, 6:09 a.m. UTC | #22

(2013/07/29 23:17), Bjorn Helgaas wrote:
> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>> (2013/07/26 2:00), Bjorn Helgaas wrote:
>>> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
>>> <indou.takao@jp.fujitsu.com> wrote:
>>>> Sorry for letting this discussion slide, I was busy on other works:-(
>>>> Anyway, the summary of previous discussion is:
>>>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>>>     boot. This expects PCI enumeration is done before IOMMU
>>>>     initialization as follows.
>>>>       (1) PCI enumeration
>>>>       (2) fs_initcall ---> device reset
>>>>       (3) IOMMU initialization
>>>> - This works on x86, but does not work on other architecture because
>>>>     IOMMU is initialized before PCI enumeration on some architectures. So,
>>>>     device reset should be done where IOMMU is initialized instead of
>>>>     initcall.
>>>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>>>
>>>> Resetting devices in panic kernel is against kdump policy and seems not to
>>>> be good idea. So I think adding reset code into iommu initialization is
>>>> better. I'll post patches for that.
>>>
>>> Of course nobody *wants* to do anything in the panic kernel.  But
>>> simply saying "it's against kdump policy and seems not to be a good
>>> idea" is not a technical argument.  There are things that are
>>> impractical to do in the kdump kernel, so they have to be done in the
>>> panic kernel even though we know the kernel is unreliable and the
>>> attempt may fail.
>>
>> Accessing kernel data in panic kernel causes panic again, so
>> - Don't touch kernel data in panic situation
>> - Jump to kdump kernel as quickly as possible, and do things in safe
>>    kernel
>> These are basic "kdump policy". Of course if there are any works which
>> we cannot do in kdump kernel and can do only in panic kernel, for
>> example saving registers or stopping cpus, we should do them in panic
>> kernel.
>>
>> Resetting devices in panic kernel is worth considering if we can safely
>> find pci_dev and reset it, but I have no idea how to do that because
>> for example struct pci_dev may be borken.
> 
> Nobody can guarantee that the panic kernel can do *anything* safely
> because any arbitrary kernel data or text may be corrupted.  But if
> you consider any specific data structure, e.g., CPU or PCI device
> lists, it's not very likely that it will be corrupted.

To reset device we need to scan pci device tree using for_each_pci_dev.
Something like bust_spinlocks() to clear pci_lock forcibly is needed.
Vivek, adding these into kdump is acceptable for you? Or any other
ideas? I think iterating over a list like for_each_pci_dev is dangerous.

> 
>>> My point about IOMMU and PCI initialization order doesn't go away just
>>> because it doesn't fit "kdump policy."  Having system initialization
>>> occur in a logical order is far more important than making kdump work.
>>
>> My next plan is as follows. I think this is matched to logical order
>> on boot.
>>
>> drivers/pci/pci.c
>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>>
>> drivers/iommu/intel-iommu.c
>> - On initialization, if IOMMU is already enabled, call this bus reset
>>    function before disabling and re-enabling IOMMU.
> 
> I raised this issue because of arches like sparc that enumerate the
> IOMMU before the PCI devices that use it.  In that situation, I think
> you're proposing this:
> 
>    panic kernel
>      enable IOMMU
>      panic
>    kdump kernel
>      initialize IOMMU (already enabled)
>        pci_reset_bus
>        disable IOMMU
>        enable IOMMU
>      enumerate PCI devices
> 
> But the problem is that when you call pci_reset_bus(), you haven't
> enumerated the PCI devices, so you don't know what to reset.

Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
initialization is based on the assumption that enumeration of PCI devices 
is already done. We can find target device from IOMMU page table instead
of scanning all devices in pci tree.

Therefore, this idea is only for intel-iommu. Other architectures need
to implement their own reset code.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas July 30, 2013, 3:59 p.m. UTC | #23

On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
<indou.takao@jp.fujitsu.com> wrote:
> (2013/07/29 23:17), Bjorn Helgaas wrote:
>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>>> (2013/07/26 2:00), Bjorn Helgaas wrote:

>>>> My point about IOMMU and PCI initialization order doesn't go away just
>>>> because it doesn't fit "kdump policy."  Having system initialization
>>>> occur in a logical order is far more important than making kdump work.
>>>
>>> My next plan is as follows. I think this is matched to logical order
>>> on boot.
>>>
>>> drivers/pci/pci.c
>>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>>>
>>> drivers/iommu/intel-iommu.c
>>> - On initialization, if IOMMU is already enabled, call this bus reset
>>>    function before disabling and re-enabling IOMMU.
>>
>> I raised this issue because of arches like sparc that enumerate the
>> IOMMU before the PCI devices that use it.  In that situation, I think
>> you're proposing this:
>>
>>    panic kernel
>>      enable IOMMU
>>      panic
>>    kdump kernel
>>      initialize IOMMU (already enabled)
>>        pci_reset_bus
>>        disable IOMMU
>>        enable IOMMU
>>      enumerate PCI devices
>>
>> But the problem is that when you call pci_reset_bus(), you haven't
>> enumerated the PCI devices, so you don't know what to reset.
>
> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> initialization is based on the assumption that enumeration of PCI devices
> is already done. We can find target device from IOMMU page table instead
> of scanning all devices in pci tree.
>
> Therefore, this idea is only for intel-iommu. Other architectures need
> to implement their own reset code.

That's my point.  I'm opposed to adding code to PCI when it only
benefits x86 and we know other arches will need a fundamentally
different design.  I would rather have a design that can work for all
arches.

If your implementation is totally implemented under arch/x86 (or in
intel-iommu.c, I guess), I can't object as much.  However, I think
that eventually even x86 should enumerate the IOMMUs via ACPI before
we enumerate PCI devices.

It's pretty clear that's how BIOS designers expect the OS to work.
For example, sec 8.7.3 of the Intel Virtualization Technology for
Directed I/O spec, rev 1.3, shows the expectation that remapping
hardware (IOMMU) is initialized before discovering the I/O hierarchy
below a hot-added host bridge.  Obviously you're not talking about a
hot-add scenario, but I think the same sequence should apply at
boot-time as well.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vivek Goyal July 31, 2013, 4:09 p.m. UTC | #24

On Thu, Jul 25, 2013 at 11:00:46AM -0600, Bjorn Helgaas wrote:
> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
> <indou.takao@jp.fujitsu.com> wrote:
> > Sorry for letting this discussion slide, I was busy on other works:-(
> > Anyway, the summary of previous discussion is:
> > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
> >   boot. This expects PCI enumeration is done before IOMMU
> >   initialization as follows.
> >     (1) PCI enumeration
> >     (2) fs_initcall ---> device reset
> >     (3) IOMMU initialization
> > - This works on x86, but does not work on other architecture because
> >   IOMMU is initialized before PCI enumeration on some architectures. So,
> >   device reset should be done where IOMMU is initialized instead of
> >   initcall.
> > - Or, as another idea, we can reset devices in first kernel(panic kernel)
> >
> > Resetting devices in panic kernel is against kdump policy and seems not to
> > be good idea. So I think adding reset code into iommu initialization is
> > better. I'll post patches for that.
> 
> Of course nobody *wants* to do anything in the panic kernel.  But
> simply saying "it's against kdump policy and seems not to be a good
> idea" is not a technical argument.  There are things that are
> impractical to do in the kdump kernel, so they have to be done in the
> panic kernel even though we know the kernel is unreliable and the
> attempt may fail.

I think resetting all devices in crashed kernel is really a lot of
code. If there is a small piece of code, it can still be considered.

I don't know much about IOMMU or PCI or PCIE. But I am taking one step
back and discuss again the idea of not resetting the IOMMU in second
kernel.

I think resetting the bus is a good idea but just resetting PCIE
will solve only part of the problem and we will same issues with
devices on other buses.

So what sounds more appealing if we could fix this particular
problem at IOMMU level first (and continue to develp patches for
resetting various buses).

In the past also these ideas have been proposed that continue to
use translation table from first kernel. Retain those mappings and
don't reset IOMMU. Reserve some space for kdump mappings in first
kernel and use that reserved mapping space in second kernel. It
never got implemented though.

Bjorn, so what's the fundamental problem with this idea?

Also, what's wrong with DMAR error. If some device tried to do DMA,
and DMA was blocked because IOMMU got reset and mappings are no more
there, why does it lead to failure. Shouldn't we just reate limit 
error messages in such case and if device is needed, anyway driver
will reset it.

Other problem mentioned in this thread is PCI SERR. What is it? Is
it some kind of error device reports if it can't do DMA successfully.
Can these errors be simply ignored kdump kernel? This problem sounds
similar to a device keeping interrupt asserted in second kernel and
kernel simply disables the interrupt line if nobody claims the
interrupt.

IOW, it feels to me that we should handle the issue (DMAR error) at
IOMMU level first (instead of trying to make sure that by the time
we get to initialize IOMMU(), all devices in system have been quiesced
and nobody is doing DMA).

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vivek Goyal July 31, 2013, 7:56 p.m. UTC | #25

On Tue, Jul 30, 2013 at 09:59:16AM -0600, Bjorn Helgaas wrote:
> On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> <indou.takao@jp.fujitsu.com> wrote:
> > (2013/07/29 23:17), Bjorn Helgaas wrote:
> >> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> >>> (2013/07/26 2:00), Bjorn Helgaas wrote:
> 
> >>>> My point about IOMMU and PCI initialization order doesn't go away just
> >>>> because it doesn't fit "kdump policy."  Having system initialization
> >>>> occur in a logical order is far more important than making kdump work.
> >>>
> >>> My next plan is as follows. I think this is matched to logical order
> >>> on boot.
> >>>
> >>> drivers/pci/pci.c
> >>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
> >>>
> >>> drivers/iommu/intel-iommu.c
> >>> - On initialization, if IOMMU is already enabled, call this bus reset
> >>>    function before disabling and re-enabling IOMMU.
> >>
> >> I raised this issue because of arches like sparc that enumerate the
> >> IOMMU before the PCI devices that use it.  In that situation, I think
> >> you're proposing this:
> >>
> >>    panic kernel
> >>      enable IOMMU
> >>      panic
> >>    kdump kernel
> >>      initialize IOMMU (already enabled)
> >>        pci_reset_bus
> >>        disable IOMMU
> >>        enable IOMMU
> >>      enumerate PCI devices
> >>
> >> But the problem is that when you call pci_reset_bus(), you haven't
> >> enumerated the PCI devices, so you don't know what to reset.
> >
> > Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> > initialization is based on the assumption that enumeration of PCI devices
> > is already done. We can find target device from IOMMU page table instead
> > of scanning all devices in pci tree.
> >
> > Therefore, this idea is only for intel-iommu. Other architectures need
> > to implement their own reset code.
> 
> That's my point.  I'm opposed to adding code to PCI when it only
> benefits x86 and we know other arches will need a fundamentally
> different design.  I would rather have a design that can work for all
> arches.

I agree. It makes sense to work on a design which works for all
arches.

And that's when I feel that handling this problem at IOMMU level
makes more sense, if we can.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v2] PCI: Reset PCIe devices to stop ongoing DMA

Commit Message

Comments

Patch