Message ID | 1368509365-2260-1-git-send-email-indou.takao@jp.fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Takao Indoh <indou.takao@jp.fujitsu.com> writes: > This patch resets PCIe devices on boot to stop ongoing DMA. When > "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered > on each PCIe root port and downstream port to reset its downstream > endpoint. > > Problem: > This patch solves the problem that kdump can fail when intel_iommu=on is > specified. When intel_iommu=on is specified, many dma-remapping errors > occur in second kernel and it causes problems like driver error or PCI > SERR, at last kdump fails. This problem is caused as follows. > 1) Devices are working on first kernel. > 2) Switch to second kernel(kdump kernel). The devices are still working > and its DMA continues during this switch. > 3) iommu is initialized during second kernel boot and ongoing DMA causes > dma-remapping errors. > > Solution: > All DMA transactions have to be stopped before iommu is initialized. By > this patch devices are reset and in-flight DMA is stopped before > pci_iommu_init. > > To invoke hot reset on an endpoint, its upstream link need to be reset. > reset_pcie_endpoints() is called from fs_initcall_sync, and it finds > root port/downstream port whose child is PCIe endpoint, and then reset > link between them. If the endpoint is VGA device, it is skipped because > the monitor blacks out if VGA controller is reset. At a quick skim this patch looks reasonable. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > Changelog: > v2: > - Read pci config before de-assert secondary bus reset to flush previous > write > - Change function/variable name > - Make a list of devices to be reset > > v1: > https://patchwork.kernel.org/patch/2482291/ > > Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> > --- > Documentation/kernel-parameters.txt | 2 + > drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++ > 2 files changed, 115 insertions(+), 0 deletions(-) > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt > index c3bfacb..8c9e8e4 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. > any pair of devices, possibly at the cost of > reduced performance. This also guarantees > that hot-added devices will work. > + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by > + hot reset > cbiosize=nn[KMG] The fixed amount of bus space which is > reserved for the CardBus bridge's IO window. > The default value is 256 bytes. > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index a899d8b..70c1205 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) > } > EXPORT_SYMBOL(pci_fixup_cardbus); > > +/* > + * Return true if dev is PCIe root port or downstream port whose child is PCIe > + * endpoint except VGA device. > + */ > +static int __init need_reset(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + if (!pci_is_pcie(dev) || !dev->subordinate || > + list_empty(&dev->subordinate->devices) || > + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && > + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) > + return 0; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || > + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || > + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) > + /* Don't reset switch, bridge, VGA device */ > + return 0; > + } > + > + return 1; > +} > + > +static void __init save_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "save state\n"); > + pci_save_state(child); > + } > +} > + > +static void __init restore_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "restore state\n"); > + pci_restore_state(child); > + } > +} > + > +static void __init do_downstream_device_reset(struct pci_dev *dev) > +{ > + u16 ctrl; > + > + dev_info(&dev->dev, "Reset Secondary bus\n"); > + > + /* Assert Secondary Bus Reset */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); > + > + /* Read config again to flush previous write */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + > + msleep(2); > + > + /* De-assert Secondary Bus Reset */ > + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); > +} > + > +struct pci_dev_entry { > + struct list_head list; > + struct pci_dev *dev; > +}; > +static int __initdata pcie_reset_endpoint_devices; > +static int __init reset_pcie_endpoints(void) > +{ > + struct pci_dev *dev = NULL; > + struct pci_dev_entry *pdev_entry, *tmp; > + LIST_HEAD(pdev_list); > + > + if (!pcie_reset_endpoint_devices) > + return 0; > + > + for_each_pci_dev(dev) > + if (need_reset(dev)) { > + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL); > + pdev_entry->dev = dev; > + list_add(&pdev_entry->list, &pdev_list); > + } > + > + list_for_each_entry(pdev_entry, &pdev_list, list) > + save_downstream_configs(pdev_entry->dev); > + > + list_for_each_entry(pdev_entry, &pdev_list, list) > + do_downstream_device_reset(pdev_entry->dev); > + > + msleep(1000); > + > + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) { > + restore_downstream_configs(pdev_entry->dev); > + kfree(pdev_entry); > + } > + > + return 0; > +} > +fs_initcall_sync(reset_pcie_endpoints); > + > static int __init pci_setup(char *str) > { > while (str) { > @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str) > pcie_bus_config = PCIE_BUS_PEER2PEER; > } else if (!strncmp(str, "pcie_scan_all", 13)) { > pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); > + } else if (!strncmp(str, "pcie_reset_endpoint_devices", > + 27)) { > + pcie_reset_endpoint_devices = 1; > } else { > printk(KERN_ERR "PCI: Unknown option `%s'\n", > str); -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Bjorn, Any comments, ack/nack? Thanks, Takao Indoh (2013/05/15 7:04), Eric W. Biederman wrote: > Takao Indoh <indou.takao@jp.fujitsu.com> writes: > >> This patch resets PCIe devices on boot to stop ongoing DMA. When >> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered >> on each PCIe root port and downstream port to reset its downstream >> endpoint. >> >> Problem: >> This patch solves the problem that kdump can fail when intel_iommu=on is >> specified. When intel_iommu=on is specified, many dma-remapping errors >> occur in second kernel and it causes problems like driver error or PCI >> SERR, at last kdump fails. This problem is caused as follows. >> 1) Devices are working on first kernel. >> 2) Switch to second kernel(kdump kernel). The devices are still working >> and its DMA continues during this switch. >> 3) iommu is initialized during second kernel boot and ongoing DMA causes >> dma-remapping errors. >> >> Solution: >> All DMA transactions have to be stopped before iommu is initialized. By >> this patch devices are reset and in-flight DMA is stopped before >> pci_iommu_init. >> >> To invoke hot reset on an endpoint, its upstream link need to be reset. >> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds >> root port/downstream port whose child is PCIe endpoint, and then reset >> link between them. If the endpoint is VGA device, it is skipped because >> the monitor blacks out if VGA controller is reset. > > At a quick skim this patch looks reasonable. > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > >> Changelog: >> v2: >> - Read pci config before de-assert secondary bus reset to flush previous >> write >> - Change function/variable name >> - Make a list of devices to be reset >> >> v1: >> https://patchwork.kernel.org/patch/2482291/ >> >> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> >> --- >> Documentation/kernel-parameters.txt | 2 + >> drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++ >> 2 files changed, 115 insertions(+), 0 deletions(-) >> >> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt >> index c3bfacb..8c9e8e4 100644 >> --- a/Documentation/kernel-parameters.txt >> +++ b/Documentation/kernel-parameters.txt >> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. >> any pair of devices, possibly at the cost of >> reduced performance. This also guarantees >> that hot-added devices will work. >> + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by >> + hot reset >> cbiosize=nn[KMG] The fixed amount of bus space which is >> reserved for the CardBus bridge's IO window. >> The default value is 256 bytes. >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index a899d8b..70c1205 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) >> } >> EXPORT_SYMBOL(pci_fixup_cardbus); >> >> +/* >> + * Return true if dev is PCIe root port or downstream port whose child is PCIe >> + * endpoint except VGA device. >> + */ >> +static int __init need_reset(struct pci_dev *dev) >> +{ >> + struct pci_bus *subordinate; >> + struct pci_dev *child; >> + >> + if (!pci_is_pcie(dev) || !dev->subordinate || >> + list_empty(&dev->subordinate->devices) || >> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >> + return 0; >> + >> + subordinate = dev->subordinate; >> + list_for_each_entry(child, &subordinate->devices, bus_list) { >> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) >> + /* Don't reset switch, bridge, VGA device */ >> + return 0; >> + } >> + >> + return 1; >> +} >> + >> +static void __init save_downstream_configs(struct pci_dev *dev) >> +{ >> + struct pci_bus *subordinate; >> + struct pci_dev *child; >> + >> + subordinate = dev->subordinate; >> + list_for_each_entry(child, &subordinate->devices, bus_list) { >> + dev_info(&child->dev, "save state\n"); >> + pci_save_state(child); >> + } >> +} >> + >> +static void __init restore_downstream_configs(struct pci_dev *dev) >> +{ >> + struct pci_bus *subordinate; >> + struct pci_dev *child; >> + >> + subordinate = dev->subordinate; >> + list_for_each_entry(child, &subordinate->devices, bus_list) { >> + dev_info(&child->dev, "restore state\n"); >> + pci_restore_state(child); >> + } >> +} >> + >> +static void __init do_downstream_device_reset(struct pci_dev *dev) >> +{ >> + u16 ctrl; >> + >> + dev_info(&dev->dev, "Reset Secondary bus\n"); >> + >> + /* Assert Secondary Bus Reset */ >> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); >> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >> + >> + /* Read config again to flush previous write */ >> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); >> + >> + msleep(2); >> + >> + /* De-assert Secondary Bus Reset */ >> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; >> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >> +} >> + >> +struct pci_dev_entry { >> + struct list_head list; >> + struct pci_dev *dev; >> +}; >> +static int __initdata pcie_reset_endpoint_devices; >> +static int __init reset_pcie_endpoints(void) >> +{ >> + struct pci_dev *dev = NULL; >> + struct pci_dev_entry *pdev_entry, *tmp; >> + LIST_HEAD(pdev_list); >> + >> + if (!pcie_reset_endpoint_devices) >> + return 0; >> + >> + for_each_pci_dev(dev) >> + if (need_reset(dev)) { >> + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL); >> + pdev_entry->dev = dev; >> + list_add(&pdev_entry->list, &pdev_list); >> + } >> + >> + list_for_each_entry(pdev_entry, &pdev_list, list) >> + save_downstream_configs(pdev_entry->dev); >> + >> + list_for_each_entry(pdev_entry, &pdev_list, list) >> + do_downstream_device_reset(pdev_entry->dev); >> + >> + msleep(1000); >> + >> + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) { >> + restore_downstream_configs(pdev_entry->dev); >> + kfree(pdev_entry); >> + } >> + >> + return 0; >> +} >> +fs_initcall_sync(reset_pcie_endpoints); >> + >> static int __init pci_setup(char *str) >> { >> while (str) { >> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str) >> pcie_bus_config = PCIE_BUS_PEER2PEER; >> } else if (!strncmp(str, "pcie_scan_all", 13)) { >> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); >> + } else if (!strncmp(str, "pcie_reset_endpoint_devices", >> + 27)) { >> + pcie_reset_endpoint_devices = 1; >> } else { >> printk(KERN_ERR "PCI: Unknown option `%s'\n", >> str); > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ping Bjorn, could you please tell your impression on this idea? This fix is very important for kdump. For exmaple, kdump does not work if PCI passthrough is used on KVM guest. Please see patch description for details. If you don't agree this patch, please tell me what changes I should make. Thanks, Takao Indoh (2013/05/14 14:29), Takao Indoh wrote: > This patch resets PCIe devices on boot to stop ongoing DMA. When > "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered > on each PCIe root port and downstream port to reset its downstream > endpoint. > > Problem: > This patch solves the problem that kdump can fail when intel_iommu=on is > specified. When intel_iommu=on is specified, many dma-remapping errors > occur in second kernel and it causes problems like driver error or PCI > SERR, at last kdump fails. This problem is caused as follows. > 1) Devices are working on first kernel. > 2) Switch to second kernel(kdump kernel). The devices are still working > and its DMA continues during this switch. > 3) iommu is initialized during second kernel boot and ongoing DMA causes > dma-remapping errors. > > Solution: > All DMA transactions have to be stopped before iommu is initialized. By > this patch devices are reset and in-flight DMA is stopped before > pci_iommu_init. > > To invoke hot reset on an endpoint, its upstream link need to be reset. > reset_pcie_endpoints() is called from fs_initcall_sync, and it finds > root port/downstream port whose child is PCIe endpoint, and then reset > link between them. If the endpoint is VGA device, it is skipped because > the monitor blacks out if VGA controller is reset. > > Changelog: > v2: > - Read pci config before de-assert secondary bus reset to flush previous > write > - Change function/variable name > - Make a list of devices to be reset > > v1: > https://patchwork.kernel.org/patch/2482291/ > > Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> > --- > Documentation/kernel-parameters.txt | 2 + > drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++ > 2 files changed, 115 insertions(+), 0 deletions(-) > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt > index c3bfacb..8c9e8e4 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. > any pair of devices, possibly at the cost of > reduced performance. This also guarantees > that hot-added devices will work. > + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by > + hot reset > cbiosize=nn[KMG] The fixed amount of bus space which is > reserved for the CardBus bridge's IO window. > The default value is 256 bytes. > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index a899d8b..70c1205 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) > } > EXPORT_SYMBOL(pci_fixup_cardbus); > > +/* > + * Return true if dev is PCIe root port or downstream port whose child is PCIe > + * endpoint except VGA device. > + */ > +static int __init need_reset(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + if (!pci_is_pcie(dev) || !dev->subordinate || > + list_empty(&dev->subordinate->devices) || > + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && > + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) > + return 0; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || > + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || > + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) > + /* Don't reset switch, bridge, VGA device */ > + return 0; > + } > + > + return 1; > +} > + > +static void __init save_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "save state\n"); > + pci_save_state(child); > + } > +} > + > +static void __init restore_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "restore state\n"); > + pci_restore_state(child); > + } > +} > + > +static void __init do_downstream_device_reset(struct pci_dev *dev) > +{ > + u16 ctrl; > + > + dev_info(&dev->dev, "Reset Secondary bus\n"); > + > + /* Assert Secondary Bus Reset */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); > + > + /* Read config again to flush previous write */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + > + msleep(2); > + > + /* De-assert Secondary Bus Reset */ > + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); > +} > + > +struct pci_dev_entry { > + struct list_head list; > + struct pci_dev *dev; > +}; > +static int __initdata pcie_reset_endpoint_devices; > +static int __init reset_pcie_endpoints(void) > +{ > + struct pci_dev *dev = NULL; > + struct pci_dev_entry *pdev_entry, *tmp; > + LIST_HEAD(pdev_list); > + > + if (!pcie_reset_endpoint_devices) > + return 0; > + > + for_each_pci_dev(dev) > + if (need_reset(dev)) { > + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL); > + pdev_entry->dev = dev; > + list_add(&pdev_entry->list, &pdev_list); > + } > + > + list_for_each_entry(pdev_entry, &pdev_list, list) > + save_downstream_configs(pdev_entry->dev); > + > + list_for_each_entry(pdev_entry, &pdev_list, list) > + do_downstream_device_reset(pdev_entry->dev); > + > + msleep(1000); > + > + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) { > + restore_downstream_configs(pdev_entry->dev); > + kfree(pdev_entry); > + } > + > + return 0; > +} > +fs_initcall_sync(reset_pcie_endpoints); > + > static int __init pci_setup(char *str) > { > while (str) { > @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str) > pcie_bus_config = PCIE_BUS_PEER2PEER; > } else if (!strncmp(str, "pcie_scan_all", 13)) { > pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); > + } else if (!strncmp(str, "pcie_reset_endpoint_devices", > + 27)) { > + pcie_reset_endpoint_devices = 1; > } else { > printk(KERN_ERR "PCI: Unknown option `%s'\n", > str); > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry it's taken me so long to look at this. I've been putting this off because the patch doesn't seem "obviously correct," and I don't feel like I really understand the problem. I'm naive about both IOMMUs and kexec/kdump, so please pardon (and help me with) any silly questions or assumptions below. On Mon, May 13, 2013 at 11:29 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: > This patch resets PCIe devices on boot to stop ongoing DMA. When > "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered > on each PCIe root port and downstream port to reset its downstream > endpoint. > > Problem: > This patch solves the problem that kdump can fail when intel_iommu=on is > specified. When intel_iommu=on is specified, many dma-remapping errors > occur in second kernel and it causes problems like driver error or PCI > SERR, at last kdump fails. This problem is caused as follows. > 1) Devices are working on first kernel. > 2) Switch to second kernel(kdump kernel). The devices are still working > and its DMA continues during this switch. > 3) iommu is initialized during second kernel boot and ongoing DMA causes > dma-remapping errors. If I understand correctly, the problem only happens on systems with an IOMMU that's enabled in either the system or kdump kernel (or both). For systems without an IOMMU (or if it is disabled in both the system and kdump kernels), any ongoing DMA should use addresses that target system-kernel memory and should not affect the kdump kernel. One thing I'm not sure about is that you are only resetting PCIe devices, but I don't think the problem is actually specific to PCIe, is it? I think the same issue could occur on any system with an IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but there are systems with IOMMUs for plain old PCI devices, e.g., PA-RISC. I tried to make a list of the interesting scenarios and the events that are relevant to this problem: Case 1: IOMMU off in system, off in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel re-inits device DMA targets kdump-kernel memory Case 2: IOMMU off in system kernel, on in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel enables IOMMU with no valid mappings DMA causes IOMMU errors (annoying but harmless) kdump kernel re-inits device DMA targets IOMMU-mapped kdump-kernel memory Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel doesn't know about IOMMU or doesn't touch it DMA targets IOMMU-mapped system-kernel memory kdump kernel re-inits device kernel assumes no IOMMU, so all new DMA mappings are invalid because DMAs actually do go through the IOMMU (** corruption or other non-recoverable error likely **) Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel disables IOMMU DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled (** corruption or other non-recoverable error likely **) kdump kernel re-inits device DMA targets kdump-kernel memory Case 4: IOMMU on in system kernel, on in kdump kernel system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel enables IOMMU with no valid mappings DMA causes IOMMU errors (annoying but harmless) kdump kernel re-inits device DMA targets IOMMU-mapped kdump-kernel memory The problem cases I see are 3a and 3b, but that's not the problem you're describing. Obviously I'm missing something. It sounds like you're seeing problems in case 2 or case 4, where the IOMMU is enabled in the kdump kernel. Maybe my assumption about the IOMMU being enabled with no valid mappings is wrong? Or maybe those IOMMU errors are not actually harmless? Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not with case 3a. Therefore, it seems like the kdump kernel *must* contain IOMMU support unless it knows for certain that the system kernel wasn't using the IOMMU. Do you have any bugzilla references or problem report URLs you could include here? Obviously I'm very confused here, so please help educate me :) Also, you mentioned PCI passthrough with a KVM guest as being a problem. Can you explain more about what makes that a problem? I don't know enough about passthrough to know why a device being used by a KVM guest would cause more problems than a device being used directly by the host. Bjorn > Solution: > All DMA transactions have to be stopped before iommu is initialized. By > this patch devices are reset and in-flight DMA is stopped before > pci_iommu_init. > > To invoke hot reset on an endpoint, its upstream link need to be reset. > reset_pcie_endpoints() is called from fs_initcall_sync, and it finds > root port/downstream port whose child is PCIe endpoint, and then reset > link between them. If the endpoint is VGA device, it is skipped because > the monitor blacks out if VGA controller is reset. > > Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> > --- > Documentation/kernel-parameters.txt | 2 + > drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++ > 2 files changed, 115 insertions(+), 0 deletions(-) > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt > index c3bfacb..8c9e8e4 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. > any pair of devices, possibly at the cost of > reduced performance. This also guarantees > that hot-added devices will work. > + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by > + hot reset > cbiosize=nn[KMG] The fixed amount of bus space which is > reserved for the CardBus bridge's IO window. > The default value is 256 bytes. > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index a899d8b..70c1205 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) > } > EXPORT_SYMBOL(pci_fixup_cardbus); > > +/* > + * Return true if dev is PCIe root port or downstream port whose child is PCIe > + * endpoint except VGA device. > + */ > +static int __init need_reset(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + if (!pci_is_pcie(dev) || !dev->subordinate || > + list_empty(&dev->subordinate->devices) || > + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && > + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) > + return 0; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || > + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || > + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) > + /* Don't reset switch, bridge, VGA device */ > + return 0; > + } > + > + return 1; > +} > + > +static void __init save_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "save state\n"); > + pci_save_state(child); > + } > +} > + > +static void __init restore_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "restore state\n"); > + pci_restore_state(child); > + } > +} > + > +static void __init do_downstream_device_reset(struct pci_dev *dev) > +{ > + u16 ctrl; > + > + dev_info(&dev->dev, "Reset Secondary bus\n"); > + > + /* Assert Secondary Bus Reset */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); > + > + /* Read config again to flush previous write */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + > + msleep(2); > + > + /* De-assert Secondary Bus Reset */ > + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); > +} > + > +struct pci_dev_entry { > + struct list_head list; > + struct pci_dev *dev; > +}; > +static int __initdata pcie_reset_endpoint_devices; > +static int __init reset_pcie_endpoints(void) > +{ > + struct pci_dev *dev = NULL; > + struct pci_dev_entry *pdev_entry, *tmp; > + LIST_HEAD(pdev_list); > + > + if (!pcie_reset_endpoint_devices) > + return 0; > + > + for_each_pci_dev(dev) > + if (need_reset(dev)) { > + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL); > + pdev_entry->dev = dev; > + list_add(&pdev_entry->list, &pdev_list); > + } > + > + list_for_each_entry(pdev_entry, &pdev_list, list) > + save_downstream_configs(pdev_entry->dev); > + > + list_for_each_entry(pdev_entry, &pdev_list, list) > + do_downstream_device_reset(pdev_entry->dev); > + > + msleep(1000); > + > + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) { > + restore_downstream_configs(pdev_entry->dev); > + kfree(pdev_entry); > + } > + > + return 0; > +} > +fs_initcall_sync(reset_pcie_endpoints); > + > static int __init pci_setup(char *str) > { > while (str) { > @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str) > pcie_bus_config = PCIE_BUS_PEER2PEER; > } else if (!strncmp(str, "pcie_scan_all", 13)) { > pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); > + } else if (!strncmp(str, "pcie_reset_endpoint_devices", > + 27)) { > + pcie_reset_endpoint_devices = 1; > } else { > printk(KERN_ERR "PCI: Unknown option `%s'\n", > str); > -- > 1.7.1 > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thank you for your comments! (2013/06/07 13:14), Bjorn Helgaas wrote: > Sorry it's taken me so long to look at this. I've been putting this > off because the patch doesn't seem "obviously correct," and I don't > feel like I really understand the problem. I'm naive about both > IOMMUs and kexec/kdump, so please pardon (and help me with) any silly > questions or assumptions below. > > On Mon, May 13, 2013 at 11:29 PM, Takao Indoh > <indou.takao@jp.fujitsu.com> wrote: >> This patch resets PCIe devices on boot to stop ongoing DMA. When >> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered >> on each PCIe root port and downstream port to reset its downstream >> endpoint. >> >> Problem: >> This patch solves the problem that kdump can fail when intel_iommu=on is >> specified. When intel_iommu=on is specified, many dma-remapping errors >> occur in second kernel and it causes problems like driver error or PCI >> SERR, at last kdump fails. This problem is caused as follows. >> 1) Devices are working on first kernel. >> 2) Switch to second kernel(kdump kernel). The devices are still working >> and its DMA continues during this switch. >> 3) iommu is initialized during second kernel boot and ongoing DMA causes >> dma-remapping errors. > > If I understand correctly, the problem only happens on systems with an > IOMMU that's enabled in either the system or kdump kernel (or both). > For systems without an IOMMU (or if it is disabled in both the system > and kdump kernels), any ongoing DMA should use addresses that target > system-kernel memory and should not affect the kdump kernel. Yes, that's correct. > One thing I'm not sure about is that you are only resetting PCIe > devices, but I don't think the problem is actually specific to PCIe, > is it? I think the same issue could occur on any system with an > IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but > there are systems with IOMMUs for plain old PCI devices, e.g., > PA-RISC. Right, this is not specific to PCIe. The reasons why the target is only PCIe is just to make algorithm to reset simple. It is possible to reset legacy PCI devices in my patch, but code becomes somewhat complicated. I thought recently most systems used PCIe and there was little demand for resetting legacy PCI. Therefore I decided not to reset legacy PCI devices, but I'll do if there are requests :-) > > I tried to make a list of the interesting scenarios and the events > that are relevant to this problem: > > Case 1: IOMMU off in system, off in kdump kernel > system kernel leaves IOMMU off > DMA targets system-kernel memory > kexec to kdump kernel (IOMMU off, devices untouched) > DMA targets system-kernel memory (harmless) > kdump kernel re-inits device > DMA targets kdump-kernel memory > > Case 2: IOMMU off in system kernel, on in kdump kernel > system kernel leaves IOMMU off > DMA targets system-kernel memory > kexec to kdump kernel (IOMMU off, devices untouched) > DMA targets system-kernel memory (harmless) > kdump kernel enables IOMMU with no valid mappings > DMA causes IOMMU errors (annoying but harmless) > kdump kernel re-inits device > DMA targets IOMMU-mapped kdump-kernel memory > > Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU > system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory > kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory > kdump kernel doesn't know about IOMMU or doesn't touch it > DMA targets IOMMU-mapped system-kernel memory > kdump kernel re-inits device > kernel assumes no IOMMU, so all new DMA mappings are invalid > because DMAs actually do go through the IOMMU > (** corruption or other non-recoverable error likely **) > > Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU > system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory > kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory > kdump kernel disables IOMMU > DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled > (** corruption or other non-recoverable error likely **) > kdump kernel re-inits device > DMA targets kdump-kernel memory > > Case 4: IOMMU on in system kernel, on in kdump kernel > system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory > kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory > kdump kernel enables IOMMU with no valid mappings > DMA causes IOMMU errors (annoying but harmless) This is not harmless. Errors like PCI SERR are detected here, and it makes driver or system unstable, and kdump fails. I also got report that system hangs up due to this. > kdump kernel re-inits device > DMA targets IOMMU-mapped kdump-kernel memory > > The problem cases I see are 3a and 3b, but that's not the problem > you're describing. Obviously I'm missing something. > > It sounds like you're seeing problems in case 2 or case 4, where the > IOMMU is enabled in the kdump kernel. Maybe my assumption about the > IOMMU being enabled with no valid mappings is wrong? Or maybe those > IOMMU errors are not actually harmless? As I wrote above, IOMMU errors are not harmless. What I know is: Case 1: Harmless Case 2: Not tested Case 3a: Not tested Case 3b: Cause problem, fixed by my patch Case 4: Cause problem, fixed by my patch I have never tested case 2 and 3a, but I think it also causes problem. > Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not > with case 3a. Therefore, it seems like the kdump kernel *must* > contain IOMMU support unless it knows for certain that the system > kernel wasn't using the IOMMU. Yes, kdump kernel has to be compiled with support for IOMMU. > > Do you have any bugzilla references or problem report URLs you could > include here? I know three Red Hat bugzilla, but I think these are private article and you cannot see. I'll add you Cc list in bz so that you can see. BZ#743790: Kdump fails with intel_iommu=on https://bugzilla.redhat.com/show_bug.cgi?id=743790 BZ#743495: Hardware error is detected with intel_iommu=on https://bugzilla.redhat.com/show_bug.cgi?id=743495 BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt https://bugzilla.redhat.com/show_bug.cgi?id=833299 > > Obviously I'm very confused here, so please help educate me :) > > Also, you mentioned PCI passthrough with a KVM guest as being a > problem. Can you explain more about what makes that a problem? I > don't know enough about passthrough to know why a device being used by > a KVM guest would cause more problems than a device being used > directly by the host. Sorry, my explanation was misleading. I mentioned PCI passthrough as usage example of IOMMU. To assign devices to KVM guest directly using PCI passthrough, we have to enable IOMMU for host kernel. But if you enable IOMMU and panic happens in the host, kdump starts in the host but it fails due to the problem I mentioned in patch description. Did I answer your question? Thanks, Takao Indoh > > Bjorn > >> Solution: >> All DMA transactions have to be stopped before iommu is initialized. By >> this patch devices are reset and in-flight DMA is stopped before >> pci_iommu_init. >> >> To invoke hot reset on an endpoint, its upstream link need to be reset. >> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds >> root port/downstream port whose child is PCIe endpoint, and then reset >> link between them. If the endpoint is VGA device, it is skipped because >> the monitor blacks out if VGA controller is reset. >> >> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> >> --- >> Documentation/kernel-parameters.txt | 2 + >> drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++ >> 2 files changed, 115 insertions(+), 0 deletions(-) >> >> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt >> index c3bfacb..8c9e8e4 100644 >> --- a/Documentation/kernel-parameters.txt >> +++ b/Documentation/kernel-parameters.txt >> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. >> any pair of devices, possibly at the cost of >> reduced performance. This also guarantees >> that hot-added devices will work. >> + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by >> + hot reset >> cbiosize=nn[KMG] The fixed amount of bus space which is >> reserved for the CardBus bridge's IO window. >> The default value is 256 bytes. >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index a899d8b..70c1205 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) >> } >> EXPORT_SYMBOL(pci_fixup_cardbus); >> >> +/* >> + * Return true if dev is PCIe root port or downstream port whose child is PCIe >> + * endpoint except VGA device. >> + */ >> +static int __init need_reset(struct pci_dev *dev) >> +{ >> + struct pci_bus *subordinate; >> + struct pci_dev *child; >> + >> + if (!pci_is_pcie(dev) || !dev->subordinate || >> + list_empty(&dev->subordinate->devices) || >> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >> + return 0; >> + >> + subordinate = dev->subordinate; >> + list_for_each_entry(child, &subordinate->devices, bus_list) { >> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) >> + /* Don't reset switch, bridge, VGA device */ >> + return 0; >> + } >> + >> + return 1; >> +} >> + >> +static void __init save_downstream_configs(struct pci_dev *dev) >> +{ >> + struct pci_bus *subordinate; >> + struct pci_dev *child; >> + >> + subordinate = dev->subordinate; >> + list_for_each_entry(child, &subordinate->devices, bus_list) { >> + dev_info(&child->dev, "save state\n"); >> + pci_save_state(child); >> + } >> +} >> + >> +static void __init restore_downstream_configs(struct pci_dev *dev) >> +{ >> + struct pci_bus *subordinate; >> + struct pci_dev *child; >> + >> + subordinate = dev->subordinate; >> + list_for_each_entry(child, &subordinate->devices, bus_list) { >> + dev_info(&child->dev, "restore state\n"); >> + pci_restore_state(child); >> + } >> +} >> + >> +static void __init do_downstream_device_reset(struct pci_dev *dev) >> +{ >> + u16 ctrl; >> + >> + dev_info(&dev->dev, "Reset Secondary bus\n"); >> + >> + /* Assert Secondary Bus Reset */ >> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); >> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >> + >> + /* Read config again to flush previous write */ >> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); >> + >> + msleep(2); >> + >> + /* De-assert Secondary Bus Reset */ >> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; >> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >> +} >> + >> +struct pci_dev_entry { >> + struct list_head list; >> + struct pci_dev *dev; >> +}; >> +static int __initdata pcie_reset_endpoint_devices; >> +static int __init reset_pcie_endpoints(void) >> +{ >> + struct pci_dev *dev = NULL; >> + struct pci_dev_entry *pdev_entry, *tmp; >> + LIST_HEAD(pdev_list); >> + >> + if (!pcie_reset_endpoint_devices) >> + return 0; >> + >> + for_each_pci_dev(dev) >> + if (need_reset(dev)) { >> + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL); >> + pdev_entry->dev = dev; >> + list_add(&pdev_entry->list, &pdev_list); >> + } >> + >> + list_for_each_entry(pdev_entry, &pdev_list, list) >> + save_downstream_configs(pdev_entry->dev); >> + >> + list_for_each_entry(pdev_entry, &pdev_list, list) >> + do_downstream_device_reset(pdev_entry->dev); >> + >> + msleep(1000); >> + >> + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) { >> + restore_downstream_configs(pdev_entry->dev); >> + kfree(pdev_entry); >> + } >> + >> + return 0; >> +} >> +fs_initcall_sync(reset_pcie_endpoints); >> + >> static int __init pci_setup(char *str) >> { >> while (str) { >> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str) >> pcie_bus_config = PCIE_BUS_PEER2PEER; >> } else if (!strncmp(str, "pcie_scan_all", 13)) { >> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); >> + } else if (!strncmp(str, "pcie_reset_endpoint_devices", >> + 27)) { >> + pcie_reset_endpoint_devices = 1; >> } else { >> printk(KERN_ERR "PCI: Unknown option `%s'\n", >> str); >> -- >> 1.7.1 >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: > (2013/06/07 13:14), Bjorn Helgaas wrote: >> One thing I'm not sure about is that you are only resetting PCIe >> devices, but I don't think the problem is actually specific to PCIe, >> is it? I think the same issue could occur on any system with an >> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but >> there are systems with IOMMUs for plain old PCI devices, e.g., >> PA-RISC. > > Right, this is not specific to PCIe. The reasons why the target is only > PCIe is just to make algorithm to reset simple. It is possible to reset > legacy PCI devices in my patch, but code becomes somewhat complicated. I > thought recently most systems used PCIe and there was little demand for > resetting legacy PCI. Therefore I decided not to reset legacy PCI > devices, but I'll do if there are requests :-) I'm not sure you need to reset legacy devices (or non-PCI devices) yet, but the current hook isn't anchored anywhere -- it's just an fs_initcall() that doesn't give the reader any clue about the connection between the reset and the problem it's solving. If we do something like this patch, I think it needs to be done at the point where we enable or disable the IOMMU. That way, it's connected to the important event, and there's a clue about how to make corresponding fixes for other IOMMUs. We already have a "reset_devices" boot option. This is for the same purpose, as far as I can tell, and I'm not sure there's value in having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In fact, there's nothing specific even to PCI here. The Intel VT-d docs seem carefully written so they could apply to either PCIe or non-PCI devices. >> I tried to make a list of the interesting scenarios and the events >> that are relevant to this problem: >> >> Case 1: IOMMU off in system, off in kdump kernel >> system kernel leaves IOMMU off >> DMA targets system-kernel memory >> kexec to kdump kernel (IOMMU off, devices untouched) >> DMA targets system-kernel memory (harmless) >> kdump kernel re-inits device >> DMA targets kdump-kernel memory >> >> Case 2: IOMMU off in system kernel, on in kdump kernel >> system kernel leaves IOMMU off >> DMA targets system-kernel memory >> kexec to kdump kernel (IOMMU off, devices untouched) >> DMA targets system-kernel memory (harmless) >> kdump kernel enables IOMMU with no valid mappings >> DMA causes IOMMU errors (annoying but harmless) >> kdump kernel re-inits device >> DMA targets IOMMU-mapped kdump-kernel memory >> >> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU >> system kernel enables IOMMU >> DMA targets IOMMU-mapped system-kernel memory >> kexec to kdump kernel (IOMMU on, devices untouched) >> DMA targets IOMMU-mapped system-kernel memory >> kdump kernel doesn't know about IOMMU or doesn't touch it >> DMA targets IOMMU-mapped system-kernel memory >> kdump kernel re-inits device >> kernel assumes no IOMMU, so all new DMA mappings are invalid >> because DMAs actually do go through the IOMMU >> (** corruption or other non-recoverable error likely **) >> >> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU >> system kernel enables IOMMU >> DMA targets IOMMU-mapped system-kernel memory >> kexec to kdump kernel (IOMMU on, devices untouched) >> DMA targets IOMMU-mapped system-kernel memory >> kdump kernel disables IOMMU >> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled >> (** corruption or other non-recoverable error likely **) >> kdump kernel re-inits device >> DMA targets kdump-kernel memory >> >> Case 4: IOMMU on in system kernel, on in kdump kernel >> system kernel enables IOMMU >> DMA targets IOMMU-mapped system-kernel memory >> kexec to kdump kernel (IOMMU on, devices untouched) >> DMA targets IOMMU-mapped system-kernel memory >> kdump kernel enables IOMMU with no valid mappings >> DMA causes IOMMU errors (annoying but harmless) >> kdump kernel re-inits device >> DMA targets IOMMU-mapped kdump-kernel memory > > This is not harmless. Errors like PCI SERR are detected here, and it > makes driver or system unstable, and kdump fails. I also got report that > system hangs up due to this. OK, let's take this slowly. Does an IOMMU error in the system kernel also cause SERR or make the system unstable? Is that the expected behavior on IOMMU errors, or is there something special about the kdump scenario that causes SERRs? I see lots of DMAR errors, e.g., those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are reported with printk and don't seem to cause an SERR. Maybe the SERR is system-specific behavior? https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public) report of IOMMU errors related to a driver bug where we just get printks, not SERR. https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang when the kdump kernel reboots (after successfully saving a crashdump). But it is using "iommu=pt", which I don't believe makes sense. The scenario is basically case 4 above, but instead of the kdump kernel starting with no valid IOMMU mappings, it identity-maps bus addresses to physical memory addresses. That's completely bogus because it's certainly not what the system kernel did, so it's entirely likely to make the system unstable or hang. This is not an argument for doing a reset; it's an argument for doing something smarter than "iommu=pt" in the kdump kernel. We might still want to reset PCIe devices, but I want to make sure that we're not papering over other issues when we do. Therefore, I'd like to understand why IOMMU errors seem harmless in some cases but not in others. > Case 1: Harmless > Case 2: Not tested > Case 3a: Not tested > Case 3b: Cause problem, fixed by my patch > Case 4: Cause problem, fixed by my patch > > I have never tested case 2 and 3a, but I think it also causes problem. I do not believe we need to support case 3b (IOMMU on in system kernel but disabled in kdump kernel). There is no way to make that reliable unless every single device that may perform DMA is reset, and since you don't reset legacy PCI or VGA devices, you're not even proposing to do that. I think we need to support case 1 (for systems with no IOMMU at all) and case 4 (IOMMU enabled in both system kernel and kdump kernel). If the kdump kernel can detect whether the IOMMU is enabled, that should be enough -- it could figure out automatically whether we're in case 1 or 4. >> Do you have any bugzilla references or problem report URLs you could >> include here? > > I know three Red Hat bugzilla, but I think these are private article and > you cannot see. I'll add you Cc list in bz so that you can see. > > BZ#743790: Kdump fails with intel_iommu=on > https://bugzilla.redhat.com/show_bug.cgi?id=743790 > > BZ#743495: Hardware error is detected with intel_iommu=on > https://bugzilla.redhat.com/show_bug.cgi?id=743495 > > BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt > https://bugzilla.redhat.com/show_bug.cgi?id=833299 Thanks for adding me to the CC lists. I looked all three and I'm not sure there's anything sensitive in them. It'd be nice if they could be made public if there's not. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(2013/06/11 11:20), Bjorn Helgaas wrote: > On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: >> (2013/06/07 13:14), Bjorn Helgaas wrote: > >>> One thing I'm not sure about is that you are only resetting PCIe >>> devices, but I don't think the problem is actually specific to PCIe, >>> is it? I think the same issue could occur on any system with an >>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but >>> there are systems with IOMMUs for plain old PCI devices, e.g., >>> PA-RISC. >> >> Right, this is not specific to PCIe. The reasons why the target is only >> PCIe is just to make algorithm to reset simple. It is possible to reset >> legacy PCI devices in my patch, but code becomes somewhat complicated. I >> thought recently most systems used PCIe and there was little demand for >> resetting legacy PCI. Therefore I decided not to reset legacy PCI >> devices, but I'll do if there are requests :-) > > I'm not sure you need to reset legacy devices (or non-PCI devices) > yet, but the current hook isn't anchored anywhere -- it's just an > fs_initcall() that doesn't give the reader any clue about the > connection between the reset and the problem it's solving. > > If we do something like this patch, I think it needs to be done at the > point where we enable or disable the IOMMU. That way, it's connected > to the important event, and there's a clue about how to make > corresponding fixes for other IOMMUs. Ok. pci_iommu_init() is appropriate place to add this hook? > We already have a "reset_devices" boot option. This is for the same > purpose, as far as I can tell, and I'm not sure there's value in > having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In > fact, there's nothing specific even to PCI here. The Intel VT-d docs > seem carefully written so they could apply to either PCIe or non-PCI > devices. Yeah, I can integrate this option into reset_devices. The reason why I separate them is to avoid regression. I have tested my patch on many machines and basically it worked, but I found two machines where this reset patch caused problem. The first one was caused by bug of raid card firmware. After updating firmware, this patch worked. The second one was due to bug of PCIe switch chip. I reported this bug to the vendor but it is not fixed yet. Anyway, this patch may cause problems on such a buggy machine, so I introduced new boot parameter so that user can enable or disable this function aside from reset_devices. Actually Vivek Goyal, kdump maintainer said same thing. He proposed using reset_devices instead of adding new one, and using quirk or something to avoid such a buggy devices. So, basically I agree with using reset_devices, but I want to prepare workaround in case this reset causes something wrong. > >>> I tried to make a list of the interesting scenarios and the events >>> that are relevant to this problem: >>> >>> Case 1: IOMMU off in system, off in kdump kernel >>> system kernel leaves IOMMU off >>> DMA targets system-kernel memory >>> kexec to kdump kernel (IOMMU off, devices untouched) >>> DMA targets system-kernel memory (harmless) >>> kdump kernel re-inits device >>> DMA targets kdump-kernel memory >>> >>> Case 2: IOMMU off in system kernel, on in kdump kernel >>> system kernel leaves IOMMU off >>> DMA targets system-kernel memory >>> kexec to kdump kernel (IOMMU off, devices untouched) >>> DMA targets system-kernel memory (harmless) >>> kdump kernel enables IOMMU with no valid mappings >>> DMA causes IOMMU errors (annoying but harmless) >>> kdump kernel re-inits device >>> DMA targets IOMMU-mapped kdump-kernel memory >>> >>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU >>> system kernel enables IOMMU >>> DMA targets IOMMU-mapped system-kernel memory >>> kexec to kdump kernel (IOMMU on, devices untouched) >>> DMA targets IOMMU-mapped system-kernel memory >>> kdump kernel doesn't know about IOMMU or doesn't touch it >>> DMA targets IOMMU-mapped system-kernel memory >>> kdump kernel re-inits device >>> kernel assumes no IOMMU, so all new DMA mappings are invalid >>> because DMAs actually do go through the IOMMU >>> (** corruption or other non-recoverable error likely **) >>> >>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU >>> system kernel enables IOMMU >>> DMA targets IOMMU-mapped system-kernel memory >>> kexec to kdump kernel (IOMMU on, devices untouched) >>> DMA targets IOMMU-mapped system-kernel memory >>> kdump kernel disables IOMMU >>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled >>> (** corruption or other non-recoverable error likely **) >>> kdump kernel re-inits device >>> DMA targets kdump-kernel memory >>> >>> Case 4: IOMMU on in system kernel, on in kdump kernel >>> system kernel enables IOMMU >>> DMA targets IOMMU-mapped system-kernel memory >>> kexec to kdump kernel (IOMMU on, devices untouched) >>> DMA targets IOMMU-mapped system-kernel memory >>> kdump kernel enables IOMMU with no valid mappings >>> DMA causes IOMMU errors (annoying but harmless) >>> kdump kernel re-inits device >>> DMA targets IOMMU-mapped kdump-kernel memory >> >> This is not harmless. Errors like PCI SERR are detected here, and it >> makes driver or system unstable, and kdump fails. I also got report that >> system hangs up due to this. > > OK, let's take this slowly. Does an IOMMU error in the system kernel > also cause SERR or make the system unstable? Is that the expected > behavior on IOMMU errors, or is there something special about the > kdump scenario that causes SERRs? I see lots of DMAR errors, e.g., > those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are > reported with printk and don't seem to cause an SERR. Maybe the SERR > is system-specific behavior? > > https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public) > report of IOMMU errors related to a driver bug where we just get > printks, not SERR. Yes, it depends on platform or devices. At least PCI SERR is detected on Fujitsu PRIMERGY BX920S2 and TX300S6. Intel VT-d documents[1] says: 3.5.1 Hardware Handling of Faulting DMA Requests DMA requests that result in remapping faults must be blocked by hardware. The exact method of DMA blocking is implementation-specific. For example: - Faulting DMA write requests may be handled in much the same way as hardware handles write requests to non-existent memory. For example, the DMA request is discarded in a manner convenient for implementations (such as by dropping the cycle, completing the write request to memory with all byte enables masked off, re-directed to a dummy memory location, etc.). - Faulting DMA read requests may be handled in much the same way as hardware handles read requests to non-existent memory. For example, the request may be redirected to a dummy memory location, returned as all 0’s or 1’s in a manner convenient to the implementation, or the request may be completed with an explicit error indication (preferred). For faulting DMA read requests from PCI Express devices, hardware must indicate "Unsupported Request" (UR) in the completion status field of the PCI Express read completion. So, after IOMMU error, its behavior is implementation-specific. [1] Intel Virtualization Technology for Directed I/O Architecture Specification Rev1.3 > > https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang > when the kdump kernel reboots (after successfully saving a crashdump). > But it is using "iommu=pt", which I don't believe makes sense. The > scenario is basically case 4 above, but instead of the kdump kernel > starting with no valid IOMMU mappings, it identity-maps bus addresses > to physical memory addresses. That's completely bogus because it's > certainly not what the system kernel did, so it's entirely likely to > make the system unstable or hang. This is not an argument for doing a > reset; it's an argument for doing something smarter than "iommu=pt" in > the kdump kernel. > > We might still want to reset PCIe devices, but I want to make sure > that we're not papering over other issues when we do. Therefore, I'd > like to understand why IOMMU errors seem harmless in some cases but > not in others. As I wrote above, IOMMU behavior on error is up to platform/devcies. I think it also depends on the value of Uncorrectable Error Mask Register in AER registers of each device. >> Case 1: Harmless >> Case 2: Not tested >> Case 3a: Not tested >> Case 3b: Cause problem, fixed by my patch >> Case 4: Cause problem, fixed by my patch >> >> I have never tested case 2 and 3a, but I think it also causes problem. > > I do not believe we need to support case 3b (IOMMU on in system kernel > but disabled in kdump kernel). There is no way to make that reliable > unless every single device that may perform DMA is reset, and since > you don't reset legacy PCI or VGA devices, you're not even proposing > to do that. > > I think we need to support case 1 (for systems with no IOMMU at all) > and case 4 (IOMMU enabled in both system kernel and kdump kernel). If > the kdump kernel can detect whether the IOMMU is enabled, that should > be enough -- it could figure out automatically whether we're in case 1 > or 4. Ok, I completely agree. >>> Do you have any bugzilla references or problem report URLs you could >>> include here? >> >> I know three Red Hat bugzilla, but I think these are private article and >> you cannot see. I'll add you Cc list in bz so that you can see. >> >> BZ#743790: Kdump fails with intel_iommu=on >> https://bugzilla.redhat.com/show_bug.cgi?id=743790 >> >> BZ#743495: Hardware error is detected with intel_iommu=on >> https://bugzilla.redhat.com/show_bug.cgi?id=743495 >> >> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt >> https://bugzilla.redhat.com/show_bug.cgi?id=833299 > > Thanks for adding me to the CC lists. I looked all three and I'm not > sure there's anything sensitive in them. It'd be nice if they could > be made public if there's not. I really appreciate your comments! I'll confirm I can make it public. Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>(2013/06/11 11:20), Bjorn Helgaas wrote: >> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: >>> (2013/06/07 13:14), Bjorn Helgaas wrote: >> >>>> One thing I'm not sure about is that you are only resetting PCIe >>>> devices, but I don't think the problem is actually specific to PCIe, >>>> is it? I think the same issue could occur on any system with an >>>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but >>>> there are systems with IOMMUs for plain old PCI devices, e.g., >>>> PA-RISC. >>> >>> Right, this is not specific to PCIe. The reasons why the target is only >>> PCIe is just to make algorithm to reset simple. It is possible to reset >>> legacy PCI devices in my patch, but code becomes somewhat complicated. I >>> thought recently most systems used PCIe and there was little demand for >>> resetting legacy PCI. Therefore I decided not to reset legacy PCI >>> devices, but I'll do if there are requests :-) >> >> I'm not sure you need to reset legacy devices (or non-PCI devices) >> yet, but the current hook isn't anchored anywhere -- it's just an >> fs_initcall() that doesn't give the reader any clue about the >> connection between the reset and the problem it's solving. >> >> If we do something like this patch, I think it needs to be done at the >> point where we enable or disable the IOMMU. That way, it's connected >> to the important event, and there's a clue about how to make >> corresponding fixes for other IOMMUs. > >Ok. pci_iommu_init() is appropriate place to add this hook? > >> We already have a "reset_devices" boot option. This is for the same >> purpose, as far as I can tell, and I'm not sure there's value in >> having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In >> fact, there's nothing specific even to PCI here. The Intel VT-d docs >> seem carefully written so they could apply to either PCIe or non-PCI >> devices. > >Yeah, I can integrate this option into reset_devices. The reason why >I separate them is to avoid regression. > >I have tested my patch on many machines and basically it worked, but I >found two machines where this reset patch caused problem. The first one >was caused by bug of raid card firmware. After updating firmware, this >patch worked. The second one was due to bug of PCIe switch chip. I >reported this bug to the vendor but it is not fixed yet. > >Anyway, this patch may cause problems on such a buggy machine, so I >introduced new boot parameter so that user can enable or disable this >function aside from reset_devices. > >Actually Vivek Goyal, kdump maintainer said same thing. He proposed using >reset_devices instead of adding new one, and using quirk or something to >avoid such a buggy devices. With respect to "and using quirk or something to avoid such buggy devices", I believe that it will be necessary to provide a mechanism for devices that need special handling to do the reset -- perhaps something like a list of tuples: (device_type, function_to_call) with a default function_to_call when the device_type is not found in the list. These functions would need to be physically separate from the device driver because if the device is present it needs to be reset even if the crash kernel chooses not to load the driver for that device. > >So, basically I agree with using reset_devices, but I want to prepare >workaround in case this reset causes something wrong. > I like the ability to specify the original "reset_devices" separately from invoking this new mechanism. With so many different uses for Linux in so many different environments and with so many different device drivers it seems reasonable to keep the ability to tell the device drivers to reset their devices -- instead of pulling the reset line on all devices. I also like the ability to invoke the new reset feature separately from telling the device drivers to do it. > >> >>>> I tried to make a list of the interesting scenarios and the events >>>> that are relevant to this problem: >>>> >>>> Case 1: IOMMU off in system, off in kdump kernel >>>> system kernel leaves IOMMU off >>>> DMA targets system-kernel memory >>>> kexec to kdump kernel (IOMMU off, devices untouched) >>>> DMA targets system-kernel memory (harmless) >>>> kdump kernel re-inits device >>>> DMA targets kdump-kernel memory >>>> >>>> Case 2: IOMMU off in system kernel, on in kdump kernel >>>> system kernel leaves IOMMU off >>>> DMA targets system-kernel memory >>>> kexec to kdump kernel (IOMMU off, devices untouched) >>>> DMA targets system-kernel memory (harmless) >>>> kdump kernel enables IOMMU with no valid mappings >>>> DMA causes IOMMU errors (annoying but harmless) >>>> kdump kernel re-inits device >>>> DMA targets IOMMU-mapped kdump-kernel memory >>>> >>>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU >>>> system kernel enables IOMMU >>>> DMA targets IOMMU-mapped system-kernel memory >>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>> DMA targets IOMMU-mapped system-kernel memory >>>> kdump kernel doesn't know about IOMMU or doesn't touch it >>>> DMA targets IOMMU-mapped system-kernel memory >>>> kdump kernel re-inits device >>>> kernel assumes no IOMMU, so all new DMA mappings are invalid >>>> because DMAs actually do go through the IOMMU >>>> (** corruption or other non-recoverable error likely **) >>>> >>>> >>>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU >>>> system kernel enables IOMMU >>>> DMA targets IOMMU-mapped system-kernel memory >>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>> DMA targets IOMMU-mapped system-kernel memory >>>> kdump kernel disables IOMMU >>>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled >>>> (** corruption or other non-recoverable error likely **) >>>> kdump kernel re-inits device >>>> DMA targets kdump-kernel memory >>>> >>>> Case 4: IOMMU on in system kernel, on in kdump kernel >>>> system kernel enables IOMMU >>>> DMA targets IOMMU-mapped system-kernel memory >>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>> DMA targets IOMMU-mapped system-kernel memory >>>> kdump kernel enables IOMMU with no valid mappings >>>> DMA causes IOMMU errors (annoying but harmless) >>>> kdump kernel re-inits device >>>> DMA targets IOMMU-mapped kdump-kernel memory >>> >>> This is not harmless. Errors like PCI SERR are detected here, and it >>> makes driver or system unstable, and kdump fails. I also got report that >>> system hangs up due to this. >> >> OK, let's take this slowly. Does an IOMMU error in the system kernel >> also cause SERR or make the system unstable? Is that the expected >> behavior on IOMMU errors, or is there something special about the >> kdump scenario that causes SERRs? I see lots of DMAR errors, e.g., >> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are >> reported with printk and don't seem to cause an SERR. Maybe the SERR >> is system-specific behavior? >> >> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public) >> report of IOMMU errors related to a driver bug where we just get >> printks, not SERR. > >Yes, it depends on platform or devices. At least PCI SERR is detected on >Fujitsu PRIMERGY BX920S2 and TX300S6. > >Intel VT-d documents[1] says: > > 3.5.1 Hardware Handling of Faulting DMA Requests > DMA requests that result in remapping faults must be blocked by > hardware. The exact method of DMA blocking is > implementation-specific. For example: > > - Faulting DMA write requests may be handled in much the same way as > hardware handles write requests to non-existent memory. For > example, the DMA request is discarded in a manner convenient for > implementations (such as by dropping the cycle, completing the > write request to memory with all byte enables masked off, > re-directed to a dummy memory location, etc.). > > - Faulting DMA read requests may be handled in much the same way as > hardware handles read requests to non-existent memory. For > example, the request may be redirected to a dummy memory location, > returned as all 0<92>s or 1<92>s in a manner convenient to the > implementation, or the request may be completed with an explicit > error indication (preferred). For faulting DMA read requests from > PCI Express devices, hardware must indicate "Unsupported Request" > (UR) in the completion status field of the PCI Express read > completion. > >So, after IOMMU error, its behavior is implementation-specific. > >[1] >Intel Virtualization Technology for Directed I/O Architecture >Specification Rev1.3 > >> >> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang >> when the kdump kernel reboots (after successfully saving a crashdump). >> But it is using "iommu=pt", which I don't believe makes sense. The >> scenario is basically case 4 above, but instead of the kdump kernel >> starting with no valid IOMMU mappings, it identity-maps bus addresses >> to physical memory addresses. That's completely bogus because it's >> certainly not what the system kernel did, so it's entirely likely to >> make the system unstable or hang. This is not an argument for doing a >> reset; it's an argument for doing something smarter than "iommu=pt" in >> the kdump kernel. >> We might still want to reset PCIe devices, but I want to make sure >> that we're not papering over other issues when we do. Therefore, I'd >> like to understand why IOMMU errors seem harmless in some cases but >> not in others. > >As I wrote above, IOMMU behavior on error is up to platform/devcies. I >think it also depends on the value of Uncorrectable Error Mask Register >in AER registers of each device. > >>> Case 1: Harmless >>> Case 2: Not tested >>> Case 3a: Not tested >>> Case 3b: Cause problem, fixed by my patch >>> Case 4: Cause problem, fixed by my patch >>> >>> I have never tested case 2 and 3a, but I think it also causes problem. >> >> I do not believe we need to support case 3b (IOMMU on in system kernel >> but disabled in kdump kernel). There is no way to make that reliable >> unless every single device that may perform DMA is reset, and since >> you don't reset legacy PCI or VGA devices, you're not even proposing >> to do that. >> >> I think we need to support case 1 (for systems with no IOMMU at all) >> and case 4 (IOMMU enabled in both system kernel and kdump kernel). If >> the kdump kernel can detect whether the IOMMU is enabled, that should >> be enough -- it could figure out automatically whether we're in case 1 >> or 4. > >Ok, I completely agree. > > >>>> Do you have any bugzilla references or problem report URLs you could >>>> include here? >>> >>> I know three Red Hat bugzilla, but I think these are private article and >>> you cannot see. I'll add you Cc list in bz so that you can see. >>> >>> BZ#743790: Kdump fails with intel_iommu=on >>> https://bugzilla.redhat.com/show_bug.cgi?id=743790 >>> >>> BZ#743495: Hardware error is detected with intel_iommu=on >>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 >>> >>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt >>> https://bugzilla.redhat.com/show_bug.cgi?id=833299 >> >> Thanks for adding me to the CC lists. I looked all three and I'm not >> sure there's anything sensitive in them. It'd be nice if they could >> be made public if there's not. > >I really appreciate your comments! I'll confirm I can make it public. I would greatly appreciate being able to see the bugzillas relating to this patch. > >Thanks, >Takao Indoh Thinking out of the box: Much of the discussion about dealing with the ongoing DMA leftover from the system kernel has assumed that the crash kernel will reset the IOMMU -- which causes various problems if done while there is any DMA still active -- which leads to the idea of stopping all of the DMA. Suppose the crash kernel does not reset the hardware IOMMU, but simply detects that it is active, resets only the devices that are necessary for the crash kernel to operate, and re-programs only the translations for those devices. All other translations remain the same (and remain valid) so all leftover DMA continues into its buffer in the system kernel area where it is harmless. New translations needed by the kdump kernel are added to the existing tables. I have not yet tried this, so I am not ready to propose it as anything more than a discussion topic at this time. It might work this way: (A small modification to case 3a above) IOMMU on in system kernel, kdump kernel accepts active IOMMU system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel detects active IOMMU and doesn't touch it DMA targets IOMMU-mapped system-kernel memory kdump kernel does not re-initialize IOMMU hardware kdump kernel initializes IOMMU in-memory management structures kdump kernel calls device drivers' standard initialization functions Drivers initialize their own devices -- DMA from that device stops When drivers request new DMA mappings, the kdump IOMMU driver: 1. Updates its in-memory mgt structures for that device & range 2. Updates IOMMU translate tables for that device & range . Translations for all other devices & ranges are unchanged 3. Flushes IOMMU TLB to force IOMMU hardware update Notes: A. This seems very similar to case 1 B. Only touch devices with drivers loaded into kdump kernel. . No need to touch devices that kdump is not going to use. C. This assumes that all DMA device drivers used by kdump will initialize the device before requesting DMA mappings. . Seems reasonable, but need to confirm how many do (or don't) . Only device drivers that might be included in the kdump kernel need to observe this initialization ordering. D. Could copy system kernel's translate tables into kdump kernel and adjust pointers if this feels more trustworthy than using the original structures where they are in the system kernel. E. Handle IRQ remappings in a similar manner. Thanks, Bill Sumner -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Your quoting is messed up below] >> This is Takao's text > This is Bill's text On Tue, Jun 11, 2013 at 5:19 PM, Sumner, William <bill.sumner@hp.com> wrote: >>(2013/06/11 11:20), Bjorn Helgaas wrote: >>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: >>Actually Vivek Goyal, kdump maintainer said same thing. He proposed using >>reset_devices instead of adding new one, and using quirk or something to >>avoid such a buggy devices. > > With respect to "and using quirk or something to avoid such buggy devices", > I believe that it will be necessary to provide a mechanism for devices that > need special handling to do the reset ... You mean something like pci_dev_specific_reset()? >>So, basically I agree with using reset_devices, but I want to prepare >>workaround in case this reset causes something wrong. >> > I like the ability to specify the original "reset_devices" separately from > invoking this new mechanism. With so many different uses for Linux in > so many different environments and with so many different device drivers > it seems reasonable to keep the ability to tell the device drivers to > reset their devices -- instead of pulling the reset line on all devices. The problem with adding new options for this and that is it confuses users and fragments testing. Users already randomly try options and publish combinations that happen to work as "solutions." Then we don't get problem reports and don't get a chance to fix things properly. The ideal OS would have zero options and be able to figure everything out by itself. So that's my bias about giving users flexibility by adding new options -- I don't like it, and I think we have to accept some lack of flexibility to keep the system complexity tractable :) > ... > Thinking out of the box: > Much of the discussion about dealing with the ongoing DMA leftover > from the system kernel has assumed that the crash kernel will reset > the IOMMU -- which causes various problems if done while there is any DMA > still active -- which leads to the idea of stopping all of the DMA. > > Suppose the crash kernel does not reset the hardware IOMMU, but simply > detects that it is active, resets only the devices that are necessary > for the crash kernel to operate, and re-programs only the translations > for those devices. All other translations remain the same (and remain valid) > so all leftover DMA continues into its buffer in the system kernel area > where it is harmless. New translations needed by the kdump kernel are > added to the existing tables. > > I have not yet tried this, so I am not ready to propose it as anything more > than a discussion topic at this time. > > It might work this way: (A small modification to case 3a above) > > IOMMU on in system kernel, kdump kernel accepts active IOMMU > system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory > kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory > kdump kernel detects active IOMMU and doesn't touch it > DMA targets IOMMU-mapped system-kernel memory > kdump kernel does not re-initialize IOMMU hardware > kdump kernel initializes IOMMU in-memory management structures > kdump kernel calls device drivers' standard initialization functions > Drivers initialize their own devices -- DMA from that device stops > When drivers request new DMA mappings, the kdump IOMMU driver: > 1. Updates its in-memory mgt structures for that device & range > 2. Updates IOMMU translate tables for that device & range > . Translations for all other devices & ranges are unchanged > 3. Flushes IOMMU TLB to force IOMMU hardware update This is certainly an interesting idea. It would require some additional smarts in the IOMMU driver to take over existing page tables. Those data structures from the system kernel are outside the memory map known to the kdump kernel, so there'd likely be VM issues there. The IOMMU driver would also have to be able to reconstruct the state for the bus address space allocator, e.g., use the I/O page tables to rebuild a bitmap of allocated space. We'll end up with a bunch of dma_map() calls from the system kernel that don't have corresponding dma_unmap()s. I guess when a driver initializes its device, there might have to be a new interface to "unmap any existing mappings for this device, even though I don't know what they are." I think it's possible, but it sounds like quite a lot of work and I'm not sure it's worth it for this relatively limited use case. And it's more IOMMU-specific than a "reset devices" strategy, so most of it would have to be done in each IOMMU driver. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[+cc Vivek, Haren; sorry I didn't think to add you earlier] On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: > (2013/06/11 11:20), Bjorn Helgaas wrote: >> I'm not sure you need to reset legacy devices (or non-PCI devices) >> yet, but the current hook isn't anchored anywhere -- it's just an >> fs_initcall() that doesn't give the reader any clue about the >> connection between the reset and the problem it's solving. >> >> If we do something like this patch, I think it needs to be done at the >> point where we enable or disable the IOMMU. That way, it's connected >> to the important event, and there's a clue about how to make >> corresponding fixes for other IOMMUs. > > Ok. pci_iommu_init() is appropriate place to add this hook? I looked at various IOMMU init places today, and it's far more complicated and varied than I had hoped. This reset scheme depends on enumerating PCI devices before we initialize the IOMMU used by those devices. x86 works that way today, but not all architectures do (see the sparc pci_fire_pbm_init(), for example). And I think conceptually, the IOMMU should be enumerated and initialized *before* the devices that use it. So I'm uncomfortable with that aspect of this scheme. It would be at least conceivable to reset the devices in the system kernel, before the kexec. I know we want to do as little as possible in the crashing kernel, but it's at least a possibility, and it might be cleaner. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/11/2013 07:19 PM, Sumner, William wrote: > >> (2013/06/11 11:20), Bjorn Helgaas wrote: >>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh<indou.takao@jp.fujitsu.com> wrote: >>>> (2013/06/07 13:14), Bjorn Helgaas wrote: >>> >>>>> One thing I'm not sure about is that you are only resetting PCIe >>>>> devices, but I don't think the problem is actually specific to PCIe, >>>>> is it? I think the same issue could occur on any system with an >>>>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but >>>>> there are systems with IOMMUs for plain old PCI devices, e.g., >>>>> PA-RISC. >>>> >>>> Right, this is not specific to PCIe. The reasons why the target is only >>>> PCIe is just to make algorithm to reset simple. It is possible to reset >>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I >>>> thought recently most systems used PCIe and there was little demand for >>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI >>>> devices, but I'll do if there are requests :-) >>> >>> I'm not sure you need to reset legacy devices (or non-PCI devices) >>> yet, but the current hook isn't anchored anywhere -- it's just an >>> fs_initcall() that doesn't give the reader any clue about the >>> connection between the reset and the problem it's solving. >>> >>> If we do something like this patch, I think it needs to be done at the >>> point where we enable or disable the IOMMU. That way, it's connected >>> to the important event, and there's a clue about how to make >>> corresponding fixes for other IOMMUs. >> >> Ok. pci_iommu_init() is appropriate place to add this hook? >> >>> We already have a "reset_devices" boot option. This is for the same >>> purpose, as far as I can tell, and I'm not sure there's value in >>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In >>> fact, there's nothing specific even to PCI here. The Intel VT-d docs >>> seem carefully written so they could apply to either PCIe or non-PCI >>> devices. >> >> Yeah, I can integrate this option into reset_devices. The reason why >> I separate them is to avoid regression. >> >> I have tested my patch on many machines and basically it worked, but I >> found two machines where this reset patch caused problem. The first one >> was caused by bug of raid card firmware. After updating firmware, this >> patch worked. The second one was due to bug of PCIe switch chip. I >> reported this bug to the vendor but it is not fixed yet. >> >> Anyway, this patch may cause problems on such a buggy machine, so I >> introduced new boot parameter so that user can enable or disable this >> function aside from reset_devices. >> >> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using >> reset_devices instead of adding new one, and using quirk or something to >> avoid such a buggy devices. > > With respect to "and using quirk or something to avoid such buggy devices", > I believe that it will be necessary to provide a mechanism for devices that > need special handling to do the reset -- perhaps something like a list > of tuples: (device_type, function_to_call) with a default function_to_call > when the device_type is not found in the list. These functions would > need to be physically separate from the device driver because if the device > is present it needs to be reset even if the crash kernel chooses not to load > the driver for that device. > >> >> So, basically I agree with using reset_devices, but I want to prepare >> workaround in case this reset causes something wrong. >> > I like the ability to specify the original "reset_devices" separately from > invoking this new mechanism. With so many different uses for Linux in > so many different environments and with so many different device drivers > it seems reasonable to keep the ability to tell the device drivers to > reset their devices -- instead of pulling the reset line on all devices. > > I also like the ability to invoke the new reset feature separately from > telling the device drivers to do it. > >> >>> >>>>> I tried to make a list of the interesting scenarios and the events >>>>> that are relevant to this problem: >>>>> >>>>> Case 1: IOMMU off in system, off in kdump kernel >>>>> system kernel leaves IOMMU off >>>>> DMA targets system-kernel memory >>>>> kexec to kdump kernel (IOMMU off, devices untouched) >>>>> DMA targets system-kernel memory (harmless) >>>>> kdump kernel re-inits device >>>>> DMA targets kdump-kernel memory >>>>> >>>>> Case 2: IOMMU off in system kernel, on in kdump kernel >>>>> system kernel leaves IOMMU off >>>>> DMA targets system-kernel memory >>>>> kexec to kdump kernel (IOMMU off, devices untouched) >>>>> DMA targets system-kernel memory (harmless) >>>>> kdump kernel enables IOMMU with no valid mappings >>>>> DMA causes IOMMU errors (annoying but harmless) >>>>> kdump kernel re-inits device >>>>> DMA targets IOMMU-mapped kdump-kernel memory >>>>> >>>>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU >>>>> system kernel enables IOMMU >>>>> DMA targets IOMMU-mapped system-kernel memory >>>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>>> DMA targets IOMMU-mapped system-kernel memory >>>>> kdump kernel doesn't know about IOMMU or doesn't touch it >>>>> DMA targets IOMMU-mapped system-kernel memory >>>>> kdump kernel re-inits device >>>>> kernel assumes no IOMMU, so all new DMA mappings are invalid >>>>> because DMAs actually do go through the IOMMU >>>>> (** corruption or other non-recoverable error likely **) >>>>> >>>>> >>>>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU >>>>> system kernel enables IOMMU >>>>> DMA targets IOMMU-mapped system-kernel memory >>>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>>> DMA targets IOMMU-mapped system-kernel memory >>>>> kdump kernel disables IOMMU >>>>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled >>>>> (** corruption or other non-recoverable error likely **) >>>>> kdump kernel re-inits device >>>>> DMA targets kdump-kernel memory >>>>> >>>>> Case 4: IOMMU on in system kernel, on in kdump kernel >>>>> system kernel enables IOMMU >>>>> DMA targets IOMMU-mapped system-kernel memory >>>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>>> DMA targets IOMMU-mapped system-kernel memory >>>>> kdump kernel enables IOMMU with no valid mappings >>>>> DMA causes IOMMU errors (annoying but harmless) >>>>> kdump kernel re-inits device >>>>> DMA targets IOMMU-mapped kdump-kernel memory >>>> >>>> This is not harmless. Errors like PCI SERR are detected here, and it >>>> makes driver or system unstable, and kdump fails. I also got report that >>>> system hangs up due to this. >>> >>> OK, let's take this slowly. Does an IOMMU error in the system kernel >>> also cause SERR or make the system unstable? Is that the expected >>> behavior on IOMMU errors, or is there something special about the >>> kdump scenario that causes SERRs? I see lots of DMAR errors, e.g., >>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are >>> reported with printk and don't seem to cause an SERR. Maybe the SERR >>> is system-specific behavior? >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public) >>> report of IOMMU errors related to a driver bug where we just get >>> printks, not SERR. >> >> Yes, it depends on platform or devices. At least PCI SERR is detected on >> Fujitsu PRIMERGY BX920S2 and TX300S6. >> >> Intel VT-d documents[1] says: >> >> 3.5.1 Hardware Handling of Faulting DMA Requests >> DMA requests that result in remapping faults must be blocked by >> hardware. The exact method of DMA blocking is >> implementation-specific. For example: >> >> - Faulting DMA write requests may be handled in much the same way as >> hardware handles write requests to non-existent memory. For >> example, the DMA request is discarded in a manner convenient for >> implementations (such as by dropping the cycle, completing the >> write request to memory with all byte enables masked off, >> re-directed to a dummy memory location, etc.). >> >> - Faulting DMA read requests may be handled in much the same way as >> hardware handles read requests to non-existent memory. For >> example, the request may be redirected to a dummy memory location, >> returned as all 0<92>s or 1<92>s in a manner convenient to the >> implementation, or the request may be completed with an explicit >> error indication (preferred). For faulting DMA read requests from >> PCI Express devices, hardware must indicate "Unsupported Request" >> (UR) in the completion status field of the PCI Express read >> completion. >> >> So, after IOMMU error, its behavior is implementation-specific. >> >> [1] >> Intel Virtualization Technology for Directed I/O Architecture >> Specification Rev1.3 >> >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang >>> when the kdump kernel reboots (after successfully saving a crashdump). >>> But it is using "iommu=pt", which I don't believe makes sense. The >>> scenario is basically case 4 above, but instead of the kdump kernel >>> starting with no valid IOMMU mappings, it identity-maps bus addresses >>> to physical memory addresses. That's completely bogus because it's >>> certainly not what the system kernel did, so it's entirely likely to >>> make the system unstable or hang. This is not an argument for doing a >>> reset; it's an argument for doing something smarter than "iommu=pt" in >>> the kdump kernel. > >>> We might still want to reset PCIe devices, but I want to make sure >>> that we're not papering over other issues when we do. Therefore, I'd >>> like to understand why IOMMU errors seem harmless in some cases but >>> not in others. >> >> As I wrote above, IOMMU behavior on error is up to platform/devcies. I >> think it also depends on the value of Uncorrectable Error Mask Register >> in AER registers of each device. >> >>>> Case 1: Harmless >>>> Case 2: Not tested >>>> Case 3a: Not tested >>>> Case 3b: Cause problem, fixed by my patch >>>> Case 4: Cause problem, fixed by my patch >>>> >>>> I have never tested case 2 and 3a, but I think it also causes problem. >>> >>> I do not believe we need to support case 3b (IOMMU on in system kernel >>> but disabled in kdump kernel). There is no way to make that reliable >>> unless every single device that may perform DMA is reset, and since >>> you don't reset legacy PCI or VGA devices, you're not even proposing >>> to do that. >>> >>> I think we need to support case 1 (for systems with no IOMMU at all) >>> and case 4 (IOMMU enabled in both system kernel and kdump kernel). If >>> the kdump kernel can detect whether the IOMMU is enabled, that should >>> be enough -- it could figure out automatically whether we're in case 1 >>> or 4. >> >> Ok, I completely agree. >> >> >>>>> Do you have any bugzilla references or problem report URLs you could >>>>> include here? >>>> >>>> I know three Red Hat bugzilla, but I think these are private article and >>>> you cannot see. I'll add you Cc list in bz so that you can see. >>>> >>>> BZ#743790: Kdump fails with intel_iommu=on >>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790 >>>> >>>> BZ#743495: Hardware error is detected with intel_iommu=on >>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 >>>> >>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt >>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299 >>> >>> Thanks for adding me to the CC lists. I looked all three and I'm not >>> sure there's anything sensitive in them. It'd be nice if they could >>> be made public if there's not. >> >> I really appreciate your comments! I'll confirm I can make it public. > > I would greatly appreciate being able to see the bugzillas relating to > this patch. >> >> Thanks, >> Takao Indoh > > Thinking out of the box: > Much of the discussion about dealing with the ongoing DMA leftover > from the system kernel has assumed that the crash kernel will reset > the IOMMU -- which causes various problems if done while there is any DMA > still active -- which leads to the idea of stopping all of the DMA. > > Suppose the crash kernel does not reset the hardware IOMMU, but simply > detects that it is active, resets only the devices that are necessary > for the crash kernel to operate, and re-programs only the translations This suggestion brings up this one: Only reset the devices that are generating IOMMU faults, i.e., add a kdump kernel hook to the IOMMU fault handler. While kdump kernel re-initing, IOMMU faults are grabbed by the kexec'd kernel and the device is reset. IOW, directed, kdump device reset. If no faults, then device is idle, or re-init'd and using new maps, and all is well. Once kdump kernel fully init'd, then just return from kdump kernel callout, and let system do its fault handling. It can be made mostly common (reset code in kexec mod under drivers/iommu), with simple calls out from each IOMMU fault handler. Of course, this assumes that systems don't turn IOMMU faults into system SERRs, and crash the system. IMO, as I've stated to a number of system developers, IOMMU (mapping) faults should not crash the system -- they already isolate, and prevent corruption, so worse case, some part of the system will stop doing I/O, and that will either get retried, or aborted and a cleaner panic (and kdump kernel switch) will ensue. > for those devices. All other translations remain the same (and remain valid) > so all leftover DMA continues into its buffer in the system kernel area > where it is harmless. New translations needed by the kdump kernel are > added to the existing tables. > A crashed system shouldn't assume that ongoing DMA is sane.... it should be stopped. I would expect the kdump kernel to reset devices & acquire new dma mappings upon reboot. Thus, the only issue is how to 'throttle' IOMMU faults, and not allow them to crash systems while the system is recovering/resetting itself, but it's not one big (power) reset to cause it. > I have not yet tried this, so I am not ready to propose it as anything more > than a discussion topic at this time. > > It might work this way: (A small modification to case 3a above) > > IOMMU on in system kernel, kdump kernel accepts active IOMMU > system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory > kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory it may not, it may be bad/bogus device I/O causing the crash... > kdump kernel detects active IOMMU and doesn't touch it > DMA targets IOMMU-mapped system-kernel memory > kdump kernel does not re-initialize IOMMU hardware > kdump kernel initializes IOMMU in-memory management structures > kdump kernel calls device drivers' standard initialization functions > Drivers initialize their own devices -- DMA from that device stops > When drivers request new DMA mappings, the kdump IOMMU driver: > 1. Updates its in-memory mgt structures for that device& range > 2. Updates IOMMU translate tables for that device& range > . Translations for all other devices& ranges are unchanged > 3. Flushes IOMMU TLB to force IOMMU hardware update > > Notes: > A. This seems very similar to case 1 > > B. Only touch devices with drivers loaded into kdump kernel. > . No need to touch devices that kdump is not going to use. > > C. This assumes that all DMA device drivers used by kdump will > initialize the device before requesting DMA mappings. > . Seems reasonable, but need to confirm how many do (or don't) > . Only device drivers that might be included in the kdump > kernel need to observe this initialization ordering. > > D. Could copy system kernel's translate tables into kdump kernel > and adjust pointers if this feels more trustworthy than using > the original structures where they are in the system kernel. > > E. Handle IRQ remappings in a similar manner. An even more serious attack vector: flood system w/interrupts (by assigned device in a guest), effectively creating a DoS, b/c intrs caused raised (hw) ipl, while bogus DMA does not cause system ipl elevation(blockage of other system progress), except when it generates IOMMU faults which become intrs. hmmm, wondering if another solution is to flip IOMMU fault handling into a poll-mode at kdump init time, and then flip it to intr-driven, once I/O has been completely init'd ? Per-device fault throttling would be a nice hw feature! (wishful thinking) > > Thanks, > Bill Sumner > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(2013/06/12 13:45), Bjorn Helgaas wrote: > [+cc Vivek, Haren; sorry I didn't think to add you earlier] > > On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh > <indou.takao@jp.fujitsu.com> wrote: >> (2013/06/11 11:20), Bjorn Helgaas wrote: > >>> I'm not sure you need to reset legacy devices (or non-PCI devices) >>> yet, but the current hook isn't anchored anywhere -- it's just an >>> fs_initcall() that doesn't give the reader any clue about the >>> connection between the reset and the problem it's solving. >>> >>> If we do something like this patch, I think it needs to be done at the >>> point where we enable or disable the IOMMU. That way, it's connected >>> to the important event, and there's a clue about how to make >>> corresponding fixes for other IOMMUs. >> >> Ok. pci_iommu_init() is appropriate place to add this hook? > > I looked at various IOMMU init places today, and it's far more > complicated and varied than I had hoped. > > This reset scheme depends on enumerating PCI devices before we > initialize the IOMMU used by those devices. x86 works that way today, > but not all architectures do (see the sparc pci_fire_pbm_init(), for Sorry, could you tell me which part depends on architecture? > example). And I think conceptually, the IOMMU should be enumerated > and initialized *before* the devices that use it. > > So I'm uncomfortable with that aspect of this scheme. > > It would be at least conceivable to reset the devices in the system > kernel, before the kexec. I know we want to do as little as possible > in the crashing kernel, but it's at least a possibility, and it might > be cleaner. I bet this will be not accepted by kdump maintainer. Everything in panic kernel is unreliable. Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(2013/06/12 22:19), Don Dutile wrote: > On 06/11/2013 07:19 PM, Sumner, William wrote: >> >>> (2013/06/11 11:20), Bjorn Helgaas wrote: >>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh<indou.takao@jp.fujitsu.com> wrote: >>>>> (2013/06/07 13:14), Bjorn Helgaas wrote: >>>> >>>>>> One thing I'm not sure about is that you are only resetting PCIe >>>>>> devices, but I don't think the problem is actually specific to PCIe, >>>>>> is it? I think the same issue could occur on any system with an >>>>>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but >>>>>> there are systems with IOMMUs for plain old PCI devices, e.g., >>>>>> PA-RISC. >>>>> >>>>> Right, this is not specific to PCIe. The reasons why the target is only >>>>> PCIe is just to make algorithm to reset simple. It is possible to reset >>>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I >>>>> thought recently most systems used PCIe and there was little demand for >>>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI >>>>> devices, but I'll do if there are requests :-) >>>> >>>> I'm not sure you need to reset legacy devices (or non-PCI devices) >>>> yet, but the current hook isn't anchored anywhere -- it's just an >>>> fs_initcall() that doesn't give the reader any clue about the >>>> connection between the reset and the problem it's solving. >>>> >>>> If we do something like this patch, I think it needs to be done at the >>>> point where we enable or disable the IOMMU. That way, it's connected >>>> to the important event, and there's a clue about how to make >>>> corresponding fixes for other IOMMUs. >>> >>> Ok. pci_iommu_init() is appropriate place to add this hook? >>> >>>> We already have a "reset_devices" boot option. This is for the same >>>> purpose, as far as I can tell, and I'm not sure there's value in >>>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In >>>> fact, there's nothing specific even to PCI here. The Intel VT-d docs >>>> seem carefully written so they could apply to either PCIe or non-PCI >>>> devices. >>> >>> Yeah, I can integrate this option into reset_devices. The reason why >>> I separate them is to avoid regression. >>> >>> I have tested my patch on many machines and basically it worked, but I >>> found two machines where this reset patch caused problem. The first one >>> was caused by bug of raid card firmware. After updating firmware, this >>> patch worked. The second one was due to bug of PCIe switch chip. I >>> reported this bug to the vendor but it is not fixed yet. >>> >>> Anyway, this patch may cause problems on such a buggy machine, so I >>> introduced new boot parameter so that user can enable or disable this >>> function aside from reset_devices. >>> >>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using >>> reset_devices instead of adding new one, and using quirk or something to >>> avoid such a buggy devices. >> >> With respect to "and using quirk or something to avoid such buggy devices", >> I believe that it will be necessary to provide a mechanism for devices that >> need special handling to do the reset -- perhaps something like a list >> of tuples: (device_type, function_to_call) with a default function_to_call >> when the device_type is not found in the list. These functions would >> need to be physically separate from the device driver because if the device >> is present it needs to be reset even if the crash kernel chooses not to load >> the driver for that device. >> >>> >>> So, basically I agree with using reset_devices, but I want to prepare >>> workaround in case this reset causes something wrong. >>> >> I like the ability to specify the original "reset_devices" separately from >> invoking this new mechanism. With so many different uses for Linux in >> so many different environments and with so many different device drivers >> it seems reasonable to keep the ability to tell the device drivers to >> reset their devices -- instead of pulling the reset line on all devices. >> >> I also like the ability to invoke the new reset feature separately from >> telling the device drivers to do it. >> >>> >>>> >>>>>> I tried to make a list of the interesting scenarios and the events >>>>>> that are relevant to this problem: >>>>>> >>>>>> Case 1: IOMMU off in system, off in kdump kernel >>>>>> system kernel leaves IOMMU off >>>>>> DMA targets system-kernel memory >>>>>> kexec to kdump kernel (IOMMU off, devices untouched) >>>>>> DMA targets system-kernel memory (harmless) >>>>>> kdump kernel re-inits device >>>>>> DMA targets kdump-kernel memory >>>>>> >>>>>> Case 2: IOMMU off in system kernel, on in kdump kernel >>>>>> system kernel leaves IOMMU off >>>>>> DMA targets system-kernel memory >>>>>> kexec to kdump kernel (IOMMU off, devices untouched) >>>>>> DMA targets system-kernel memory (harmless) >>>>>> kdump kernel enables IOMMU with no valid mappings >>>>>> DMA causes IOMMU errors (annoying but harmless) >>>>>> kdump kernel re-inits device >>>>>> DMA targets IOMMU-mapped kdump-kernel memory >>>>>> >>>>>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU >>>>>> system kernel enables IOMMU >>>>>> DMA targets IOMMU-mapped system-kernel memory >>>>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>>>> DMA targets IOMMU-mapped system-kernel memory >>>>>> kdump kernel doesn't know about IOMMU or doesn't touch it >>>>>> DMA targets IOMMU-mapped system-kernel memory >>>>>> kdump kernel re-inits device >>>>>> kernel assumes no IOMMU, so all new DMA mappings are invalid >>>>>> because DMAs actually do go through the IOMMU >>>>>> (** corruption or other non-recoverable error likely **) >>>>>> >>>>>> >>>>>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU >>>>>> system kernel enables IOMMU >>>>>> DMA targets IOMMU-mapped system-kernel memory >>>>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>>>> DMA targets IOMMU-mapped system-kernel memory >>>>>> kdump kernel disables IOMMU >>>>>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled >>>>>> (** corruption or other non-recoverable error likely **) >>>>>> kdump kernel re-inits device >>>>>> DMA targets kdump-kernel memory >>>>>> >>>>>> Case 4: IOMMU on in system kernel, on in kdump kernel >>>>>> system kernel enables IOMMU >>>>>> DMA targets IOMMU-mapped system-kernel memory >>>>>> kexec to kdump kernel (IOMMU on, devices untouched) >>>>>> DMA targets IOMMU-mapped system-kernel memory >>>>>> kdump kernel enables IOMMU with no valid mappings >>>>>> DMA causes IOMMU errors (annoying but harmless) >>>>>> kdump kernel re-inits device >>>>>> DMA targets IOMMU-mapped kdump-kernel memory >>>>> >>>>> This is not harmless. Errors like PCI SERR are detected here, and it >>>>> makes driver or system unstable, and kdump fails. I also got report that >>>>> system hangs up due to this. >>>> >>>> OK, let's take this slowly. Does an IOMMU error in the system kernel >>>> also cause SERR or make the system unstable? Is that the expected >>>> behavior on IOMMU errors, or is there something special about the >>>> kdump scenario that causes SERRs? I see lots of DMAR errors, e.g., >>>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are >>>> reported with printk and don't seem to cause an SERR. Maybe the SERR >>>> is system-specific behavior? >>>> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public) >>>> report of IOMMU errors related to a driver bug where we just get >>>> printks, not SERR. >>> >>> Yes, it depends on platform or devices. At least PCI SERR is detected on >>> Fujitsu PRIMERGY BX920S2 and TX300S6. >>> >>> Intel VT-d documents[1] says: >>> >>> 3.5.1 Hardware Handling of Faulting DMA Requests >>> DMA requests that result in remapping faults must be blocked by >>> hardware. The exact method of DMA blocking is >>> implementation-specific. For example: >>> >>> - Faulting DMA write requests may be handled in much the same way as >>> hardware handles write requests to non-existent memory. For >>> example, the DMA request is discarded in a manner convenient for >>> implementations (such as by dropping the cycle, completing the >>> write request to memory with all byte enables masked off, >>> re-directed to a dummy memory location, etc.). >>> >>> - Faulting DMA read requests may be handled in much the same way as >>> hardware handles read requests to non-existent memory. For >>> example, the request may be redirected to a dummy memory location, >>> returned as all 0<92>s or 1<92>s in a manner convenient to the >>> implementation, or the request may be completed with an explicit >>> error indication (preferred). For faulting DMA read requests from >>> PCI Express devices, hardware must indicate "Unsupported Request" >>> (UR) in the completion status field of the PCI Express read >>> completion. >>> >>> So, after IOMMU error, its behavior is implementation-specific. >>> >>> [1] >>> Intel Virtualization Technology for Directed I/O Architecture >>> Specification Rev1.3 >>> >>>> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang >>>> when the kdump kernel reboots (after successfully saving a crashdump). >>>> But it is using "iommu=pt", which I don't believe makes sense. The >>>> scenario is basically case 4 above, but instead of the kdump kernel >>>> starting with no valid IOMMU mappings, it identity-maps bus addresses >>>> to physical memory addresses. That's completely bogus because it's >>>> certainly not what the system kernel did, so it's entirely likely to >>>> make the system unstable or hang. This is not an argument for doing a >>>> reset; it's an argument for doing something smarter than "iommu=pt" in >>>> the kdump kernel. >> >>>> We might still want to reset PCIe devices, but I want to make sure >>>> that we're not papering over other issues when we do. Therefore, I'd >>>> like to understand why IOMMU errors seem harmless in some cases but >>>> not in others. >>> >>> As I wrote above, IOMMU behavior on error is up to platform/devcies. I >>> think it also depends on the value of Uncorrectable Error Mask Register >>> in AER registers of each device. >>> >>>>> Case 1: Harmless >>>>> Case 2: Not tested >>>>> Case 3a: Not tested >>>>> Case 3b: Cause problem, fixed by my patch >>>>> Case 4: Cause problem, fixed by my patch >>>>> >>>>> I have never tested case 2 and 3a, but I think it also causes problem. >>>> >>>> I do not believe we need to support case 3b (IOMMU on in system kernel >>>> but disabled in kdump kernel). There is no way to make that reliable >>>> unless every single device that may perform DMA is reset, and since >>>> you don't reset legacy PCI or VGA devices, you're not even proposing >>>> to do that. >>>> >>>> I think we need to support case 1 (for systems with no IOMMU at all) >>>> and case 4 (IOMMU enabled in both system kernel and kdump kernel). If >>>> the kdump kernel can detect whether the IOMMU is enabled, that should >>>> be enough -- it could figure out automatically whether we're in case 1 >>>> or 4. >>> >>> Ok, I completely agree. >>> >>> >>>>>> Do you have any bugzilla references or problem report URLs you could >>>>>> include here? >>>>> >>>>> I know three Red Hat bugzilla, but I think these are private article and >>>>> you cannot see. I'll add you Cc list in bz so that you can see. >>>>> >>>>> BZ#743790: Kdump fails with intel_iommu=on >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790 >>>>> >>>>> BZ#743495: Hardware error is detected with intel_iommu=on >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 >>>>> >>>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299 >>>> >>>> Thanks for adding me to the CC lists. I looked all three and I'm not >>>> sure there's anything sensitive in them. It'd be nice if they could >>>> be made public if there's not. >>> >>> I really appreciate your comments! I'll confirm I can make it public. >> >> I would greatly appreciate being able to see the bugzillas relating to >> this patch. >>> >>> Thanks, >>> Takao Indoh >> >> Thinking out of the box: >> Much of the discussion about dealing with the ongoing DMA leftover >> from the system kernel has assumed that the crash kernel will reset >> the IOMMU -- which causes various problems if done while there is any DMA >> still active -- which leads to the idea of stopping all of the DMA. >> >> Suppose the crash kernel does not reset the hardware IOMMU, but simply >> detects that it is active, resets only the devices that are necessary >> for the crash kernel to operate, and re-programs only the translations > > This suggestion brings up this one: > Only reset the devices that are generating IOMMU faults, i.e., add a kdump kernel > hook to the IOMMU fault handler. While kdump kernel re-initing, IOMMU faults are grabbed > by the kexec'd kernel and the device is reset. IOW, directed, kdump device reset. > If no faults, then device is idle, or re-init'd and using new maps, and > all is well. Once kdump kernel fully init'd, then just return from > kdump kernel callout, and let system do its fault handling. > It can be made mostly common (reset code in kexec mod under drivers/iommu), > with simple calls out from each IOMMU fault handler. What I tried before is: 1) Prepare work queue handler 2) In IOMMU fault handler, wake up the work queue handler 3) In the work queue handler, reset devices against the error source. As a result, this method did not work. The device are reset during its initialization. Please take a look at the message below. This is a message when megaraid_sas driver is loaded in second kernel. megasas: 00.00.05.40-rh2 Thu. Aug. 4 17:00:00 PDT 2011 megaraid_sas 0000:01:00.0: resetting MSI-X megasas: 0x1000:0x0073:0x1734:0x1177: bus 1:slot 0:func 0 megaraid_sas 0000:01:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28 megaraid_sas 0000:01:00.0: setting latency timer to 64 megasas: Waiting for FW to come to ready state DRHD: handling fault status reg 702 DMAR:[DMA Write] Request device [01:00.0] fault addr ffff9000 DMAR:[fault reason 05] PTE Write access is not set megasas: FW now in Ready state alloc irq_desc for 51 on node -1 alloc kstat_irqs on node -1 megaraid_sas 0000:01:00.0: irq 51 for MSI/MSI-X megasas_init_mfi: fw_support_ieee=67108864 Note that DMAR error is detected during driver initialization. So when I added reset method which I mentioned above, the device is reset here, and after that, megasas could not find its disks and kdump failed. Thanks, Takao Indoh > > Of course, this assumes that systems don't turn IOMMU faults into system SERRs, > and crash the system. IMO, as I've stated to a number of system developers, > IOMMU (mapping) faults should not crash the system -- they already isolate, and > prevent corruption, so worse case, some part of the system will stop doing I/O, > and that will either get retried, or aborted and a cleaner panic (and kdump > kernel switch) will ensue. > >> for those devices. All other translations remain the same (and remain valid) >> so all leftover DMA continues into its buffer in the system kernel area >> where it is harmless. New translations needed by the kdump kernel are >> added to the existing tables. >> > A crashed system shouldn't assume that ongoing DMA is sane.... it should be stopped. > I would expect the kdump kernel to reset devices & acquire new dma mappings > upon reboot. Thus, the only issue is how to 'throttle' IOMMU faults, and > not allow them to crash systems while the system is recovering/resetting itself, > but it's not one big (power) reset to cause it. > >> I have not yet tried this, so I am not ready to propose it as anything more >> than a discussion topic at this time. >> >> It might work this way: (A small modification to case 3a above) >> >> IOMMU on in system kernel, kdump kernel accepts active IOMMU >> system kernel enables IOMMU >> DMA targets IOMMU-mapped system-kernel memory >> kexec to kdump kernel (IOMMU on, devices untouched) >> DMA targets IOMMU-mapped system-kernel memory > it may not, it may be bad/bogus device I/O causing the crash... > >> kdump kernel detects active IOMMU and doesn't touch it >> DMA targets IOMMU-mapped system-kernel memory >> kdump kernel does not re-initialize IOMMU hardware >> kdump kernel initializes IOMMU in-memory management structures >> kdump kernel calls device drivers' standard initialization functions >> Drivers initialize their own devices -- DMA from that device stops >> When drivers request new DMA mappings, the kdump IOMMU driver: >> 1. Updates its in-memory mgt structures for that device& range >> 2. Updates IOMMU translate tables for that device& range >> . Translations for all other devices& ranges are unchanged >> 3. Flushes IOMMU TLB to force IOMMU hardware update >> >> Notes: >> A. This seems very similar to case 1 >> >> B. Only touch devices with drivers loaded into kdump kernel. >> . No need to touch devices that kdump is not going to use. >> >> C. This assumes that all DMA device drivers used by kdump will >> initialize the device before requesting DMA mappings. >> . Seems reasonable, but need to confirm how many do (or don't) >> . Only device drivers that might be included in the kdump >> kernel need to observe this initialization ordering. >> >> D. Could copy system kernel's translate tables into kdump kernel >> and adjust pointers if this feels more trustworthy than using >> the original structures where they are in the system kernel. >> >> E. Handle IRQ remappings in a similar manner. > An even more serious attack vector: flood system w/interrupts (by assigned device in a guest), > effectively creating a DoS, b/c intrs caused raised (hw) ipl, while bogus DMA does > not cause system ipl elevation(blockage of other system progress), except when it > generates IOMMU faults which become intrs. > hmmm, wondering if another solution is to flip IOMMU fault handling into a poll-mode > at kdump init time, and then flip it to intr-driven, once I/O has been completely init'd ? > Per-device fault throttling would be a nice hw feature! (wishful thinking) > > >> >> Thanks, >> Bill Sumner >> >> > > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: > (2013/06/12 13:45), Bjorn Helgaas wrote: >> [+cc Vivek, Haren; sorry I didn't think to add you earlier] >> >> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh >> <indou.takao@jp.fujitsu.com> wrote: >>> (2013/06/11 11:20), Bjorn Helgaas wrote: >> >>>> I'm not sure you need to reset legacy devices (or non-PCI devices) >>>> yet, but the current hook isn't anchored anywhere -- it's just an >>>> fs_initcall() that doesn't give the reader any clue about the >>>> connection between the reset and the problem it's solving. >>>> >>>> If we do something like this patch, I think it needs to be done at the >>>> point where we enable or disable the IOMMU. That way, it's connected >>>> to the important event, and there's a clue about how to make >>>> corresponding fixes for other IOMMUs. >>> >>> Ok. pci_iommu_init() is appropriate place to add this hook? >> >> I looked at various IOMMU init places today, and it's far more >> complicated and varied than I had hoped. >> >> This reset scheme depends on enumerating PCI devices before we >> initialize the IOMMU used by those devices. x86 works that way today, >> but not all architectures do (see the sparc pci_fire_pbm_init(), for > > Sorry, could you tell me which part depends on architecture? Your patch works if PCIe devices are reset before the kdump kernel enables the IOMMU. On x86, this is possible because PCI enumeration happens before the IOMMU initialization. On sparc, the IOMMU is initialized before PCI devices are enumerated, so there would still be a window where ongoing DMA could cause an IOMMU error. Of course, it might be possible to reorganize the sparc code to to the IOMMU init *after* it enumerates PCI devices. But I think that change would be hard to justify. And I think even on x86, it would be better if we did the IOMMU init before PCI enumeration -- the PCI devices depend on the IOMMU, so logically the IOMMU should be initialized first so the PCI devices can be associated with it as they are enumerated. >> example). And I think conceptually, the IOMMU should be enumerated >> and initialized *before* the devices that use it. >> >> So I'm uncomfortable with that aspect of this scheme. >> >> It would be at least conceivable to reset the devices in the system >> kernel, before the kexec. I know we want to do as little as possible >> in the crashing kernel, but it's at least a possibility, and it might >> be cleaner. > > I bet this will be not accepted by kdump maintainer. Everything in panic > kernel is unreliable. kdump is inherently unreliable. The kdump kernel doesn't start from an arbitrary machine state. We don't expect it to tolerate all CPUs running, for example. Maybe it should be expected to tolerate PCI devices running, either. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(2013/06/13 12:41), Bjorn Helgaas wrote: > On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: >> (2013/06/12 13:45), Bjorn Helgaas wrote: >>> [+cc Vivek, Haren; sorry I didn't think to add you earlier] >>> >>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh >>> <indou.takao@jp.fujitsu.com> wrote: >>>> (2013/06/11 11:20), Bjorn Helgaas wrote: >>> >>>>> I'm not sure you need to reset legacy devices (or non-PCI devices) >>>>> yet, but the current hook isn't anchored anywhere -- it's just an >>>>> fs_initcall() that doesn't give the reader any clue about the >>>>> connection between the reset and the problem it's solving. >>>>> >>>>> If we do something like this patch, I think it needs to be done at the >>>>> point where we enable or disable the IOMMU. That way, it's connected >>>>> to the important event, and there's a clue about how to make >>>>> corresponding fixes for other IOMMUs. >>>> >>>> Ok. pci_iommu_init() is appropriate place to add this hook? >>> >>> I looked at various IOMMU init places today, and it's far more >>> complicated and varied than I had hoped. >>> >>> This reset scheme depends on enumerating PCI devices before we >>> initialize the IOMMU used by those devices. x86 works that way today, >>> but not all architectures do (see the sparc pci_fire_pbm_init(), for >> >> Sorry, could you tell me which part depends on architecture? > > Your patch works if PCIe devices are reset before the kdump kernel > enables the IOMMU. On x86, this is possible because PCI enumeration > happens before the IOMMU initialization. On sparc, the IOMMU is > initialized before PCI devices are enumerated, so there would still be > a window where ongoing DMA could cause an IOMMU error. Ok, understood, thanks. Hmmm, it seems to be difficult to find out method which is common to all architectures. So, what I can do for now is introducing reset scheme which is only for x86. 1) Change this patch so that it work only on x86 platform. For example call this reset code from x86_init.iommu.iommu_init() instead of fs_initcall. Or another idea is: 2) Enumerate PCI devices in IOMMU layer. That is: PCI layer Just provide interface to reset given strcut pci_dev. Maybe pci_reset_function() looks good for this purpose. IOMMU layer Determine which devices should be reset. On kernel boot, check if IOMMU is already active or not, and if active, check IOMMU page table and reset devices whose entry exists there. > Of course, it might be possible to reorganize the sparc code to to the > IOMMU init *after* it enumerates PCI devices. But I think that change > would be hard to justify. > > And I think even on x86, it would be better if we did the IOMMU init > before PCI enumeration -- the PCI devices depend on the IOMMU, so > logically the IOMMU should be initialized first so the PCI devices can > be associated with it as they are enumerated. So third idea is: 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We need to implement new code to enumerate PCI devices and reset them for this purpose. Idea 2 is not difficult to implement, but one problem is that this method may be dangerous. We need to scan IOMMU page table which is used in previous kernel, but it may be broken. Idea 3 seems to be difficult to implement... > >>> example). And I think conceptually, the IOMMU should be enumerated >>> and initialized *before* the devices that use it. >>> >>> So I'm uncomfortable with that aspect of this scheme. >>> >>> It would be at least conceivable to reset the devices in the system >>> kernel, before the kexec. I know we want to do as little as possible >>> in the crashing kernel, but it's at least a possibility, and it might >>> be cleaner. >> >> I bet this will be not accepted by kdump maintainer. Everything in panic >> kernel is unreliable. > > kdump is inherently unreliable. The kdump kernel doesn't start from > an arbitrary machine state. We don't expect it to tolerate all CPUs > running, for example. Maybe it should be expected to tolerate PCI > devices running, either. What I wanted to say is that any resources of first kernel are unreliable. Under panic situation, struct pci_dev tree may be broken, or pci_lock may be already hold by someone, etc. So, if we do this in first kernel, maybe kdump needs its own code to enumerate PCI devices and reset them. Vivek? Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry for letting this discussion slide, I was busy on other works:-( Anyway, the summary of previous discussion is: - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on boot. This expects PCI enumeration is done before IOMMU initialization as follows. (1) PCI enumeration (2) fs_initcall ---> device reset (3) IOMMU initialization - This works on x86, but does not work on other architecture because IOMMU is initialized before PCI enumeration on some architectures. So, device reset should be done where IOMMU is initialized instead of initcall. - Or, as another idea, we can reset devices in first kernel(panic kernel) Resetting devices in panic kernel is against kdump policy and seems not to be good idea. So I think adding reset code into iommu initialization is better. I'll post patches for that. Another discussion point is how to handle buggy devices. Resetting buggy devices makes system more unstable. One of ideas is using boot parameter so that user can choose to reset devices or not. An existed parameter "reset_devices" is not helpful for this purpose because it is always necessary for kdump and user cannot get rid of it. Introducing new boot parameter seems to be unpopular for maintainers. Any ideas? Thanks, Takao Indoh (2013/06/14 11:11), Takao Indoh wrote: > (2013/06/13 12:41), Bjorn Helgaas wrote: >> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: >>> (2013/06/12 13:45), Bjorn Helgaas wrote: >>>> [+cc Vivek, Haren; sorry I didn't think to add you earlier] >>>> >>>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh >>>> <indou.takao@jp.fujitsu.com> wrote: >>>>> (2013/06/11 11:20), Bjorn Helgaas wrote: >>>> >>>>>> I'm not sure you need to reset legacy devices (or non-PCI devices) >>>>>> yet, but the current hook isn't anchored anywhere -- it's just an >>>>>> fs_initcall() that doesn't give the reader any clue about the >>>>>> connection between the reset and the problem it's solving. >>>>>> >>>>>> If we do something like this patch, I think it needs to be done at the >>>>>> point where we enable or disable the IOMMU. That way, it's connected >>>>>> to the important event, and there's a clue about how to make >>>>>> corresponding fixes for other IOMMUs. >>>>> >>>>> Ok. pci_iommu_init() is appropriate place to add this hook? >>>> >>>> I looked at various IOMMU init places today, and it's far more >>>> complicated and varied than I had hoped. >>>> >>>> This reset scheme depends on enumerating PCI devices before we >>>> initialize the IOMMU used by those devices. x86 works that way today, >>>> but not all architectures do (see the sparc pci_fire_pbm_init(), for >>> >>> Sorry, could you tell me which part depends on architecture? >> >> Your patch works if PCIe devices are reset before the kdump kernel >> enables the IOMMU. On x86, this is possible because PCI enumeration >> happens before the IOMMU initialization. On sparc, the IOMMU is >> initialized before PCI devices are enumerated, so there would still be >> a window where ongoing DMA could cause an IOMMU error. > > Ok, understood, thanks. > > Hmmm, it seems to be difficult to find out method which is common to > all architectures. So, what I can do for now is introducing reset scheme > which is only for x86. > > 1) Change this patch so that it work only on x86 platform. For example > call this reset code from x86_init.iommu.iommu_init() instead of > fs_initcall. > > Or another idea is: > > 2) Enumerate PCI devices in IOMMU layer. That is: > PCI layer > Just provide interface to reset given strcut pci_dev. Maybe > pci_reset_function() looks good for this purpose. > IOMMU layer > Determine which devices should be reset. On kernel boot, check if > IOMMU is already active or not, and if active, check IOMMU page > table and reset devices whose entry exists there. > >> Of course, it might be possible to reorganize the sparc code to to the >> IOMMU init *after* it enumerates PCI devices. But I think that change >> would be hard to justify. >> >> And I think even on x86, it would be better if we did the IOMMU init >> before PCI enumeration -- the PCI devices depend on the IOMMU, so >> logically the IOMMU should be initialized first so the PCI devices can >> be associated with it as they are enumerated. > > So third idea is: > > 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We > need to implement new code to enumerate PCI devices and reset them > for this purpose. > > Idea 2 is not difficult to implement, but one problem is that this > method may be dangerous. We need to scan IOMMU page table which is used > in previous kernel, but it may be broken. Idea 3 seems to be difficult > to implement... > > >> >>>> example). And I think conceptually, the IOMMU should be enumerated >>>> and initialized *before* the devices that use it. >>>> >>>> So I'm uncomfortable with that aspect of this scheme. >>>> >>>> It would be at least conceivable to reset the devices in the system >>>> kernel, before the kexec. I know we want to do as little as possible >>>> in the crashing kernel, but it's at least a possibility, and it might >>>> be cleaner. >>> >>> I bet this will be not accepted by kdump maintainer. Everything in panic >>> kernel is unreliable. >> >> kdump is inherently unreliable. The kdump kernel doesn't start from >> an arbitrary machine state. We don't expect it to tolerate all CPUs >> running, for example. Maybe it should be expected to tolerate PCI >> devices running, either. > > What I wanted to say is that any resources of first kernel are > unreliable. Under panic situation, struct pci_dev tree may be broken, or > pci_lock may be already hold by someone, etc. So, if we do this in first > kernel, maybe kdump needs its own code to enumerate PCI devices and > reset them. > > Vivek? > > Thanks, > Takao Indoh > > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote: > Sorry for letting this discussion slide, I was busy on other works:-( > Anyway, the summary of previous discussion is: > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on > boot. This expects PCI enumeration is done before IOMMU > initialization as follows. > (1) PCI enumeration > (2) fs_initcall ---> device reset > (3) IOMMU initialization > - This works on x86, but does not work on other architecture because > IOMMU is initialized before PCI enumeration on some architectures. So, > device reset should be done where IOMMU is initialized instead of > initcall. > - Or, as another idea, we can reset devices in first kernel(panic kernel) > > Resetting devices in panic kernel is against kdump policy and seems not to > be good idea. So I think adding reset code into iommu initialization is > better. I'll post patches for that. I don't understand all the details but I agree that idea of trying to reset IOMMU in crashed kernel might not fly. > > Another discussion point is how to handle buggy devices. Resetting buggy > devices makes system more unstable. One of ideas is using boot parameter > so that user can choose to reset devices or not. So who would decide which device is buggy and don't reset it. Give some details here. Can't we simply blacklist associated module, so that it never loads and then it never tries to reset the devices? Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: > Sorry for letting this discussion slide, I was busy on other works:-( > Anyway, the summary of previous discussion is: > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on > boot. This expects PCI enumeration is done before IOMMU > initialization as follows. > (1) PCI enumeration > (2) fs_initcall ---> device reset > (3) IOMMU initialization > - This works on x86, but does not work on other architecture because > IOMMU is initialized before PCI enumeration on some architectures. So, > device reset should be done where IOMMU is initialized instead of > initcall. > - Or, as another idea, we can reset devices in first kernel(panic kernel) > > Resetting devices in panic kernel is against kdump policy and seems not to > be good idea. So I think adding reset code into iommu initialization is > better. I'll post patches for that. Of course nobody *wants* to do anything in the panic kernel. But simply saying "it's against kdump policy and seems not to be a good idea" is not a technical argument. There are things that are impractical to do in the kdump kernel, so they have to be done in the panic kernel even though we know the kernel is unreliable and the attempt may fail. My point about IOMMU and PCI initialization order doesn't go away just because it doesn't fit "kdump policy." Having system initialization occur in a logical order is far more important than making kdump work. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(2013/07/25 23:24), Vivek Goyal wrote: > On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote: >> Sorry for letting this discussion slide, I was busy on other works:-( >> Anyway, the summary of previous discussion is: >> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on >> boot. This expects PCI enumeration is done before IOMMU >> initialization as follows. >> (1) PCI enumeration >> (2) fs_initcall ---> device reset >> (3) IOMMU initialization >> - This works on x86, but does not work on other architecture because >> IOMMU is initialized before PCI enumeration on some architectures. So, >> device reset should be done where IOMMU is initialized instead of >> initcall. >> - Or, as another idea, we can reset devices in first kernel(panic kernel) >> >> Resetting devices in panic kernel is against kdump policy and seems not to >> be good idea. So I think adding reset code into iommu initialization is >> better. I'll post patches for that. > > I don't understand all the details but I agree that idea of trying to > reset IOMMU in crashed kernel might not fly. > >> >> Another discussion point is how to handle buggy devices. Resetting buggy >> devices makes system more unstable. One of ideas is using boot parameter >> so that user can choose to reset devices or not. > > So who would decide which device is buggy and don't reset it. Give > some details here. I found the case that kdump does not work after resetting devices and it works when removing reset patch. The cause of problem is a bug of PCIe switch chip. If there is boot parameter not to reset devices, user can use it as workaround. I think in this case we should add PCI quirk to avoid this buggy hardware, but we need to wait errata from vendor and it basically takes long time. > > Can't we simply blacklist associated module, so that it never loads > and then it never tries to reset the devices? > So you mean that device reset should be done on its driver loading? Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(2013/07/26 2:00), Bjorn Helgaas wrote: > On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh > <indou.takao@jp.fujitsu.com> wrote: >> Sorry for letting this discussion slide, I was busy on other works:-( >> Anyway, the summary of previous discussion is: >> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on >> boot. This expects PCI enumeration is done before IOMMU >> initialization as follows. >> (1) PCI enumeration >> (2) fs_initcall ---> device reset >> (3) IOMMU initialization >> - This works on x86, but does not work on other architecture because >> IOMMU is initialized before PCI enumeration on some architectures. So, >> device reset should be done where IOMMU is initialized instead of >> initcall. >> - Or, as another idea, we can reset devices in first kernel(panic kernel) >> >> Resetting devices in panic kernel is against kdump policy and seems not to >> be good idea. So I think adding reset code into iommu initialization is >> better. I'll post patches for that. > > Of course nobody *wants* to do anything in the panic kernel. But > simply saying "it's against kdump policy and seems not to be a good > idea" is not a technical argument. There are things that are > impractical to do in the kdump kernel, so they have to be done in the > panic kernel even though we know the kernel is unreliable and the > attempt may fail. Accessing kernel data in panic kernel causes panic again, so - Don't touch kernel data in panic situation - Jump to kdump kernel as quickly as possible, and do things in safe kernel These are basic "kdump policy". Of course if there are any works which we cannot do in kdump kernel and can do only in panic kernel, for example saving registers or stopping cpus, we should do them in panic kernel. Resetting devices in panic kernel is worth considering if we can safely find pci_dev and reset it, but I have no idea how to do that because for example struct pci_dev may be borken. > > My point about IOMMU and PCI initialization order doesn't go away just > because it doesn't fit "kdump policy." Having system initialization > occur in a logical order is far more important than making kdump work. My next plan is as follows. I think this is matched to logical order on boot. drivers/pci/pci.c - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) drivers/iommu/intel-iommu.c - On initialization, if IOMMU is already enabled, call this bus reset function before disabling and re-enabling IOMMU. Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: > (2013/07/26 2:00), Bjorn Helgaas wrote: >> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh >> <indou.takao@jp.fujitsu.com> wrote: >>> Sorry for letting this discussion slide, I was busy on other works:-( >>> Anyway, the summary of previous discussion is: >>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on >>> boot. This expects PCI enumeration is done before IOMMU >>> initialization as follows. >>> (1) PCI enumeration >>> (2) fs_initcall ---> device reset >>> (3) IOMMU initialization >>> - This works on x86, but does not work on other architecture because >>> IOMMU is initialized before PCI enumeration on some architectures. So, >>> device reset should be done where IOMMU is initialized instead of >>> initcall. >>> - Or, as another idea, we can reset devices in first kernel(panic kernel) >>> >>> Resetting devices in panic kernel is against kdump policy and seems not to >>> be good idea. So I think adding reset code into iommu initialization is >>> better. I'll post patches for that. >> >> Of course nobody *wants* to do anything in the panic kernel. But >> simply saying "it's against kdump policy and seems not to be a good >> idea" is not a technical argument. There are things that are >> impractical to do in the kdump kernel, so they have to be done in the >> panic kernel even though we know the kernel is unreliable and the >> attempt may fail. > > Accessing kernel data in panic kernel causes panic again, so > - Don't touch kernel data in panic situation > - Jump to kdump kernel as quickly as possible, and do things in safe > kernel > These are basic "kdump policy". Of course if there are any works which > we cannot do in kdump kernel and can do only in panic kernel, for > example saving registers or stopping cpus, we should do them in panic > kernel. > > Resetting devices in panic kernel is worth considering if we can safely > find pci_dev and reset it, but I have no idea how to do that because > for example struct pci_dev may be borken. Nobody can guarantee that the panic kernel can do *anything* safely because any arbitrary kernel data or text may be corrupted. But if you consider any specific data structure, e.g., CPU or PCI device lists, it's not very likely that it will be corrupted. >> My point about IOMMU and PCI initialization order doesn't go away just >> because it doesn't fit "kdump policy." Having system initialization >> occur in a logical order is far more important than making kdump work. > > My next plan is as follows. I think this is matched to logical order > on boot. > > drivers/pci/pci.c > - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) > > drivers/iommu/intel-iommu.c > - On initialization, if IOMMU is already enabled, call this bus reset > function before disabling and re-enabling IOMMU. I raised this issue because of arches like sparc that enumerate the IOMMU before the PCI devices that use it. In that situation, I think you're proposing this: panic kernel enable IOMMU panic kdump kernel initialize IOMMU (already enabled) pci_reset_bus disable IOMMU enable IOMMU enumerate PCI devices But the problem is that when you call pci_reset_bus(), you haven't enumerated the PCI devices, so you don't know what to reset. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(2013/07/29 23:17), Bjorn Helgaas wrote: > On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: >> (2013/07/26 2:00), Bjorn Helgaas wrote: >>> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh >>> <indou.takao@jp.fujitsu.com> wrote: >>>> Sorry for letting this discussion slide, I was busy on other works:-( >>>> Anyway, the summary of previous discussion is: >>>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on >>>> boot. This expects PCI enumeration is done before IOMMU >>>> initialization as follows. >>>> (1) PCI enumeration >>>> (2) fs_initcall ---> device reset >>>> (3) IOMMU initialization >>>> - This works on x86, but does not work on other architecture because >>>> IOMMU is initialized before PCI enumeration on some architectures. So, >>>> device reset should be done where IOMMU is initialized instead of >>>> initcall. >>>> - Or, as another idea, we can reset devices in first kernel(panic kernel) >>>> >>>> Resetting devices in panic kernel is against kdump policy and seems not to >>>> be good idea. So I think adding reset code into iommu initialization is >>>> better. I'll post patches for that. >>> >>> Of course nobody *wants* to do anything in the panic kernel. But >>> simply saying "it's against kdump policy and seems not to be a good >>> idea" is not a technical argument. There are things that are >>> impractical to do in the kdump kernel, so they have to be done in the >>> panic kernel even though we know the kernel is unreliable and the >>> attempt may fail. >> >> Accessing kernel data in panic kernel causes panic again, so >> - Don't touch kernel data in panic situation >> - Jump to kdump kernel as quickly as possible, and do things in safe >> kernel >> These are basic "kdump policy". Of course if there are any works which >> we cannot do in kdump kernel and can do only in panic kernel, for >> example saving registers or stopping cpus, we should do them in panic >> kernel. >> >> Resetting devices in panic kernel is worth considering if we can safely >> find pci_dev and reset it, but I have no idea how to do that because >> for example struct pci_dev may be borken. > > Nobody can guarantee that the panic kernel can do *anything* safely > because any arbitrary kernel data or text may be corrupted. But if > you consider any specific data structure, e.g., CPU or PCI device > lists, it's not very likely that it will be corrupted. To reset device we need to scan pci device tree using for_each_pci_dev. Something like bust_spinlocks() to clear pci_lock forcibly is needed. Vivek, adding these into kdump is acceptable for you? Or any other ideas? I think iterating over a list like for_each_pci_dev is dangerous. > >>> My point about IOMMU and PCI initialization order doesn't go away just >>> because it doesn't fit "kdump policy." Having system initialization >>> occur in a logical order is far more important than making kdump work. >> >> My next plan is as follows. I think this is matched to logical order >> on boot. >> >> drivers/pci/pci.c >> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) >> >> drivers/iommu/intel-iommu.c >> - On initialization, if IOMMU is already enabled, call this bus reset >> function before disabling and re-enabling IOMMU. > > I raised this issue because of arches like sparc that enumerate the > IOMMU before the PCI devices that use it. In that situation, I think > you're proposing this: > > panic kernel > enable IOMMU > panic > kdump kernel > initialize IOMMU (already enabled) > pci_reset_bus > disable IOMMU > enable IOMMU > enumerate PCI devices > > But the problem is that when you call pci_reset_bus(), you haven't > enumerated the PCI devices, so you don't know what to reset. Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu initialization is based on the assumption that enumeration of PCI devices is already done. We can find target device from IOMMU page table instead of scanning all devices in pci tree. Therefore, this idea is only for intel-iommu. Other architectures need to implement their own reset code. Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: > (2013/07/29 23:17), Bjorn Helgaas wrote: >> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: >>> (2013/07/26 2:00), Bjorn Helgaas wrote: >>>> My point about IOMMU and PCI initialization order doesn't go away just >>>> because it doesn't fit "kdump policy." Having system initialization >>>> occur in a logical order is far more important than making kdump work. >>> >>> My next plan is as follows. I think this is matched to logical order >>> on boot. >>> >>> drivers/pci/pci.c >>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) >>> >>> drivers/iommu/intel-iommu.c >>> - On initialization, if IOMMU is already enabled, call this bus reset >>> function before disabling and re-enabling IOMMU. >> >> I raised this issue because of arches like sparc that enumerate the >> IOMMU before the PCI devices that use it. In that situation, I think >> you're proposing this: >> >> panic kernel >> enable IOMMU >> panic >> kdump kernel >> initialize IOMMU (already enabled) >> pci_reset_bus >> disable IOMMU >> enable IOMMU >> enumerate PCI devices >> >> But the problem is that when you call pci_reset_bus(), you haven't >> enumerated the PCI devices, so you don't know what to reset. > > Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu > initialization is based on the assumption that enumeration of PCI devices > is already done. We can find target device from IOMMU page table instead > of scanning all devices in pci tree. > > Therefore, this idea is only for intel-iommu. Other architectures need > to implement their own reset code. That's my point. I'm opposed to adding code to PCI when it only benefits x86 and we know other arches will need a fundamentally different design. I would rather have a design that can work for all arches. If your implementation is totally implemented under arch/x86 (or in intel-iommu.c, I guess), I can't object as much. However, I think that eventually even x86 should enumerate the IOMMUs via ACPI before we enumerate PCI devices. It's pretty clear that's how BIOS designers expect the OS to work. For example, sec 8.7.3 of the Intel Virtualization Technology for Directed I/O spec, rev 1.3, shows the expectation that remapping hardware (IOMMU) is initialized before discovering the I/O hierarchy below a hot-added host bridge. Obviously you're not talking about a hot-add scenario, but I think the same sequence should apply at boot-time as well. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 25, 2013 at 11:00:46AM -0600, Bjorn Helgaas wrote: > On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh > <indou.takao@jp.fujitsu.com> wrote: > > Sorry for letting this discussion slide, I was busy on other works:-( > > Anyway, the summary of previous discussion is: > > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on > > boot. This expects PCI enumeration is done before IOMMU > > initialization as follows. > > (1) PCI enumeration > > (2) fs_initcall ---> device reset > > (3) IOMMU initialization > > - This works on x86, but does not work on other architecture because > > IOMMU is initialized before PCI enumeration on some architectures. So, > > device reset should be done where IOMMU is initialized instead of > > initcall. > > - Or, as another idea, we can reset devices in first kernel(panic kernel) > > > > Resetting devices in panic kernel is against kdump policy and seems not to > > be good idea. So I think adding reset code into iommu initialization is > > better. I'll post patches for that. > > Of course nobody *wants* to do anything in the panic kernel. But > simply saying "it's against kdump policy and seems not to be a good > idea" is not a technical argument. There are things that are > impractical to do in the kdump kernel, so they have to be done in the > panic kernel even though we know the kernel is unreliable and the > attempt may fail. I think resetting all devices in crashed kernel is really a lot of code. If there is a small piece of code, it can still be considered. I don't know much about IOMMU or PCI or PCIE. But I am taking one step back and discuss again the idea of not resetting the IOMMU in second kernel. I think resetting the bus is a good idea but just resetting PCIE will solve only part of the problem and we will same issues with devices on other buses. So what sounds more appealing if we could fix this particular problem at IOMMU level first (and continue to develp patches for resetting various buses). In the past also these ideas have been proposed that continue to use translation table from first kernel. Retain those mappings and don't reset IOMMU. Reserve some space for kdump mappings in first kernel and use that reserved mapping space in second kernel. It never got implemented though. Bjorn, so what's the fundamental problem with this idea? Also, what's wrong with DMAR error. If some device tried to do DMA, and DMA was blocked because IOMMU got reset and mappings are no more there, why does it lead to failure. Shouldn't we just reate limit error messages in such case and if device is needed, anyway driver will reset it. Other problem mentioned in this thread is PCI SERR. What is it? Is it some kind of error device reports if it can't do DMA successfully. Can these errors be simply ignored kdump kernel? This problem sounds similar to a device keeping interrupt asserted in second kernel and kernel simply disables the interrupt line if nobody claims the interrupt. IOW, it feels to me that we should handle the issue (DMAR error) at IOMMU level first (instead of trying to make sure that by the time we get to initialize IOMMU(), all devices in system have been quiesced and nobody is doing DMA). Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 30, 2013 at 09:59:16AM -0600, Bjorn Helgaas wrote: > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh > <indou.takao@jp.fujitsu.com> wrote: > > (2013/07/29 23:17), Bjorn Helgaas wrote: > >> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote: > >>> (2013/07/26 2:00), Bjorn Helgaas wrote: > > >>>> My point about IOMMU and PCI initialization order doesn't go away just > >>>> because it doesn't fit "kdump policy." Having system initialization > >>>> occur in a logical order is far more important than making kdump work. > >>> > >>> My next plan is as follows. I think this is matched to logical order > >>> on boot. > >>> > >>> drivers/pci/pci.c > >>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) > >>> > >>> drivers/iommu/intel-iommu.c > >>> - On initialization, if IOMMU is already enabled, call this bus reset > >>> function before disabling and re-enabling IOMMU. > >> > >> I raised this issue because of arches like sparc that enumerate the > >> IOMMU before the PCI devices that use it. In that situation, I think > >> you're proposing this: > >> > >> panic kernel > >> enable IOMMU > >> panic > >> kdump kernel > >> initialize IOMMU (already enabled) > >> pci_reset_bus > >> disable IOMMU > >> enable IOMMU > >> enumerate PCI devices > >> > >> But the problem is that when you call pci_reset_bus(), you haven't > >> enumerated the PCI devices, so you don't know what to reset. > > > > Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu > > initialization is based on the assumption that enumeration of PCI devices > > is already done. We can find target device from IOMMU page table instead > > of scanning all devices in pci tree. > > > > Therefore, this idea is only for intel-iommu. Other architectures need > > to implement their own reset code. > > That's my point. I'm opposed to adding code to PCI when it only > benefits x86 and we know other arches will need a fundamentally > different design. I would rather have a design that can work for all > arches. I agree. It makes sense to work on a design which works for all arches. And that's when I feel that handling this problem at IOMMU level makes more sense, if we can. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index c3bfacb..8c9e8e4 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. any pair of devices, possibly at the cost of reduced performance. This also guarantees that hot-added devices will work. + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by + hot reset cbiosize=nn[KMG] The fixed amount of bus space which is reserved for the CardBus bridge's IO window. The default value is 256 bytes. diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index a899d8b..70c1205 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) } EXPORT_SYMBOL(pci_fixup_cardbus); +/* + * Return true if dev is PCIe root port or downstream port whose child is PCIe + * endpoint except VGA device. + */ +static int __init need_reset(struct pci_dev *dev) +{ + struct pci_bus *subordinate; + struct pci_dev *child; + + if (!pci_is_pcie(dev) || !dev->subordinate || + list_empty(&dev->subordinate->devices) || + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) + return 0; + + subordinate = dev->subordinate; + list_for_each_entry(child, &subordinate->devices, bus_list) { + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) + /* Don't reset switch, bridge, VGA device */ + return 0; + } + + return 1; +} + +static void __init save_downstream_configs(struct pci_dev *dev) +{ + struct pci_bus *subordinate; + struct pci_dev *child; + + subordinate = dev->subordinate; + list_for_each_entry(child, &subordinate->devices, bus_list) { + dev_info(&child->dev, "save state\n"); + pci_save_state(child); + } +} + +static void __init restore_downstream_configs(struct pci_dev *dev) +{ + struct pci_bus *subordinate; + struct pci_dev *child; + + subordinate = dev->subordinate; + list_for_each_entry(child, &subordinate->devices, bus_list) { + dev_info(&child->dev, "restore state\n"); + pci_restore_state(child); + } +} + +static void __init do_downstream_device_reset(struct pci_dev *dev) +{ + u16 ctrl; + + dev_info(&dev->dev, "Reset Secondary bus\n"); + + /* Assert Secondary Bus Reset */ + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); + + /* Read config again to flush previous write */ + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); + + msleep(2); + + /* De-assert Secondary Bus Reset */ + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); +} + +struct pci_dev_entry { + struct list_head list; + struct pci_dev *dev; +}; +static int __initdata pcie_reset_endpoint_devices; +static int __init reset_pcie_endpoints(void) +{ + struct pci_dev *dev = NULL; + struct pci_dev_entry *pdev_entry, *tmp; + LIST_HEAD(pdev_list); + + if (!pcie_reset_endpoint_devices) + return 0; + + for_each_pci_dev(dev) + if (need_reset(dev)) { + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL); + pdev_entry->dev = dev; + list_add(&pdev_entry->list, &pdev_list); + } + + list_for_each_entry(pdev_entry, &pdev_list, list) + save_downstream_configs(pdev_entry->dev); + + list_for_each_entry(pdev_entry, &pdev_list, list) + do_downstream_device_reset(pdev_entry->dev); + + msleep(1000); + + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) { + restore_downstream_configs(pdev_entry->dev); + kfree(pdev_entry); + } + + return 0; +} +fs_initcall_sync(reset_pcie_endpoints); + static int __init pci_setup(char *str) { while (str) { @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str) pcie_bus_config = PCIE_BUS_PEER2PEER; } else if (!strncmp(str, "pcie_scan_all", 13)) { pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); + } else if (!strncmp(str, "pcie_reset_endpoint_devices", + 27)) { + pcie_reset_endpoint_devices = 1; } else { printk(KERN_ERR "PCI: Unknown option `%s'\n", str);
This patch resets PCIe devices on boot to stop ongoing DMA. When "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered on each PCIe root port and downstream port to reset its downstream endpoint. Problem: This patch solves the problem that kdump can fail when intel_iommu=on is specified. When intel_iommu=on is specified, many dma-remapping errors occur in second kernel and it causes problems like driver error or PCI SERR, at last kdump fails. This problem is caused as follows. 1) Devices are working on first kernel. 2) Switch to second kernel(kdump kernel). The devices are still working and its DMA continues during this switch. 3) iommu is initialized during second kernel boot and ongoing DMA causes dma-remapping errors. Solution: All DMA transactions have to be stopped before iommu is initialized. By this patch devices are reset and in-flight DMA is stopped before pci_iommu_init. To invoke hot reset on an endpoint, its upstream link need to be reset. reset_pcie_endpoints() is called from fs_initcall_sync, and it finds root port/downstream port whose child is PCIe endpoint, and then reset link between them. If the endpoint is VGA device, it is skipped because the monitor blacks out if VGA controller is reset. Changelog: v2: - Read pci config before de-assert secondary bus reset to flush previous write - Change function/variable name - Make a list of devices to be reset v1: https://patchwork.kernel.org/patch/2482291/ Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> --- Documentation/kernel-parameters.txt | 2 + drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++ 2 files changed, 115 insertions(+), 0 deletions(-)