From patchwork Wed Oct 17 06:37:39 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shivasharan Srikanteshwara X-Patchwork-Id: 10644639 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C37AC1508 for ; Wed, 17 Oct 2018 06:38:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B2E5A2A925 for ; Wed, 17 Oct 2018 06:38:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A71E52A942; Wed, 17 Oct 2018 06:38:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BC5E52A925 for ; Wed, 17 Oct 2018 06:38:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727208AbeJQOcV (ORCPT ); Wed, 17 Oct 2018 10:32:21 -0400 Received: from mail-pg1-f196.google.com ([209.85.215.196]:42952 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727099AbeJQOcU (ORCPT ); Wed, 17 Oct 2018 10:32:20 -0400 Received: by mail-pg1-f196.google.com with SMTP id i4-v6so12018800pgq.9 for ; Tue, 16 Oct 2018 23:38:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=xxt+HooRpYQeCP5Lgv68V/2Q3KGaeEneYH2h97rNvmo=; b=acx4PARCPUpDXimSG08CrcSFSmRp8ZC3AYPaX3ywVz6gslhl3LzYTpmYxmlxdWJNo6 /Hs6BqTPf4R0HzH32eyfC4tiDAuMpZ5StwFSxHnxtCrT0Onxp8upOZRI8LWxwZ3H455g 0mNAU0DFxT/e1Xw/2NIps228ELRdSh6M5iSmk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=xxt+HooRpYQeCP5Lgv68V/2Q3KGaeEneYH2h97rNvmo=; b=Hm8LyxODuRuqJajWJs7Xvql5guHs8FO1YGvvwtm+6mFL9018v03Z1w+4bG34b5UWVT 90pi57444pjMzhtbm8M/Bacc3wWwEStDiMa08Tv6TDoLzXaZSzXDb1oBFEaUQTAa9cvq W7XeS4F0h0OGHQw+NjbUS0B80+f2I78QmICsXQJEoRHBes703FZIhWMha5hfVSlhcDvM AUVo7AbVJptCCP4quc1U+ZtrI5gZnP+2XtQy3uf2R5AVg9X5gdYv1GwFtpqt9+e0e2lE 6I0a0SsSokGeu06+n/gC5Bdd0JHYvXSIY3fpdMQN8AiXM3cCV37u+2CdWFPQonVmtTPD Bisg== X-Gm-Message-State: ABuFfogn9fD7NAsZrgWvtAY59CtyhVdKmjSLr1ckwUzELUuXHptZapHV 17Ux7SEmerJ/yGlcfA4c4J/AyfmHvXUpTA== X-Google-Smtp-Source: ACcGV61JAAPwgv1JI2XGDfZZThA0SMQgWurVaLZ/42v0d+zjOkGPquJ16Xi4+QtB+MKw/hHKtK1myA== X-Received: by 2002:a63:5726:: with SMTP id l38-v6mr23423820pgb.118.1539758291410; Tue, 16 Oct 2018 23:38:11 -0700 (PDT) Received: from dhcp-135-24-192-142.dhcp.broadcom.net ([192.19.234.250]) by smtp.gmail.com with ESMTPSA id y131-v6sm19375425pfg.164.2018.10.16.23.38.08 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 16 Oct 2018 23:38:09 -0700 (PDT) From: Shivasharan S To: linux-scsi@vger.kernel.org Cc: kashyap.desai@broadcom.com, sumit.saxena@broadcom.com, Shivasharan S Subject: [PATCH V2 01/19] megaraid_sas: Add watchdog thread to detect Firmware fault Date: Tue, 16 Oct 2018 23:37:39 -0700 Message-Id: <1539758277-1882-2-git-send-email-shivasharan.srikanteshwara@broadcom.com> X-Mailer: git-send-email 2.4.3 In-Reply-To: <1539758277-1882-1-git-send-email-shivasharan.srikanteshwara@broadcom.com> References: <1539758277-1882-1-git-send-email-shivasharan.srikanteshwara@broadcom.com> Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Issue - Currently driver checks for Firmware state change from ISR context, and only when there are interrupts tied with no I/O completions. We have seen multiple cases where doorbell interrupts sent by firmware, to indicate FW state change are not processed by driver and it takes long time for driver to trigger OCR. And if there are no IOs running, since we only check the FW state as part of ISR code, fault goes undetected by driver and OCR will not be triggered. Fix - This patch introduces a separate workqueue that runs every one second to detect Firmware FAULT state and trigger reset immediately. As an additional gain, removing PCI reads from ISR to check FW state results in improved performance as well. Signed-off-by: Sumit Saxena Signed-off-by: Shivasharan S --- drivers/scsi/megaraid/megaraid_sas.h | 12 +- drivers/scsi/megaraid/megaraid_sas_base.c | 34 +++++- drivers/scsi/megaraid/megaraid_sas_fusion.c | 175 ++++++++++++++++++++-------- 3 files changed, 167 insertions(+), 54 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h index 67d356d84717..8c0f74a2740a 100644 --- a/drivers/scsi/megaraid/megaraid_sas.h +++ b/drivers/scsi/megaraid/megaraid_sas.h @@ -1544,6 +1544,10 @@ enum FW_BOOT_CONTEXT { #define MR_CAN_HANDLE_64_BIT_DMA_OFFSET (1 << 25) +#define MEGASAS_WATCHDOG_THREAD_INTERVAL 1000 +#define MEGASAS_WAIT_FOR_NEXT_DMA_MSECS 20 +#define MEGASAS_WATCHDOG_WAIT_COUNT 50 + enum MR_ADAPTER_TYPE { MFI_SERIES = 1, THUNDERBOLT_SERIES = 2, @@ -2250,7 +2254,9 @@ struct megasas_instance { struct megasas_instance_template *instancet; struct tasklet_struct isr_tasklet; struct work_struct work_init; - struct work_struct crash_init; + struct delayed_work fw_fault_work; + struct workqueue_struct *fw_fault_work_q; + char fault_handler_work_q_name[48]; u8 flag; u8 unload; @@ -2539,7 +2545,6 @@ int megasas_get_target_prop(struct megasas_instance *instance, int megasas_set_crash_dump_params(struct megasas_instance *instance, u8 crash_buf_state); void megasas_free_host_crash_buffer(struct megasas_instance *instance); -void megasas_fusion_crash_dump_wq(struct work_struct *work); void megasas_return_cmd_fusion(struct megasas_instance *instance, struct megasas_cmd_fusion *cmd); @@ -2560,6 +2565,9 @@ int megasas_reset_target_fusion(struct scsi_cmnd *scmd); u32 mega_mod64(u64 dividend, u32 divisor); int megasas_alloc_fusion_context(struct megasas_instance *instance); void megasas_free_fusion_context(struct megasas_instance *instance); +int megasas_fusion_start_watchdog(struct megasas_instance *instance); +void megasas_fusion_stop_watchdog(struct megasas_instance *instance); + void megasas_set_dma_settings(struct megasas_instance *instance, struct megasas_dcmd_frame *dcmd, dma_addr_t dma_addr, u32 dma_len); diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index 9b90c716f06d..4dc29e055461 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -5582,8 +5582,20 @@ static int megasas_init_fw(struct megasas_instance *instance) instance->skip_heartbeat_timer_del = 1; } + /* + * Create and start watchdog thread which will monitor + * controller state every 1 sec and trigger OCR when + * it enters fault state + */ + if (instance->adapter_type != MFI_SERIES) + if (megasas_fusion_start_watchdog(instance) != SUCCESS) + goto fail_start_watchdog; + return 0; +fail_start_watchdog: + if (instance->requestorId && !instance->skip_heartbeat_timer_del) + del_timer_sync(&instance->sriov_heartbeat_timer); fail_get_ld_pd_list: instance->instancet->disable_intr(instance); fail_init_adapter: @@ -6434,12 +6446,10 @@ static inline void megasas_init_ctrl_params(struct megasas_instance *instance) instance->disableOnlineCtrlReset = 1; instance->UnevenSpanSupport = 0; - if (instance->adapter_type != MFI_SERIES) { + if (instance->adapter_type != MFI_SERIES) INIT_WORK(&instance->work_init, megasas_fusion_ocr_wq); - INIT_WORK(&instance->crash_init, megasas_fusion_crash_dump_wq); - } else { + else INIT_WORK(&instance->work_init, process_fw_state_change_wq); - } } /** @@ -6708,6 +6718,10 @@ megasas_suspend(struct pci_dev *pdev, pm_message_t state) if (instance->requestorId && !instance->skip_heartbeat_timer_del) del_timer_sync(&instance->sriov_heartbeat_timer); + /* Stop the FW fault detection watchdog */ + if (instance->adapter_type != MFI_SERIES) + megasas_fusion_stop_watchdog(instance); + megasas_flush_cache(instance); megasas_shutdown_controller(instance, MR_DCMD_HIBERNATE_SHUTDOWN); @@ -6843,8 +6857,16 @@ megasas_resume(struct pci_dev *pdev) if (megasas_start_aen(instance)) dev_err(&instance->pdev->dev, "Start AEN failed\n"); + /* Re-launch FW fault watchdog */ + if (instance->adapter_type != MFI_SERIES) + if (megasas_fusion_start_watchdog(instance) != SUCCESS) + goto fail_start_watchdog; + return 0; +fail_start_watchdog: + if (instance->requestorId && !instance->skip_heartbeat_timer_del) + del_timer_sync(&instance->sriov_heartbeat_timer); fail_init_mfi: megasas_free_ctrl_dma_buffers(instance); megasas_free_ctrl_mem(instance); @@ -6912,6 +6934,10 @@ static void megasas_detach_one(struct pci_dev *pdev) if (instance->requestorId && !instance->skip_heartbeat_timer_del) del_timer_sync(&instance->sriov_heartbeat_timer); + /* Stop the FW fault detection watchdog */ + if (instance->adapter_type != MFI_SERIES) + megasas_fusion_stop_watchdog(instance); + if (instance->fw_crash_state != UNAVAILABLE) megasas_free_host_crash_buffer(instance); scsi_remove_host(instance->host); diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index f74b5ea24f0f..9ca4a52164bd 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -48,6 +48,7 @@ #include #include #include +#include #include #include @@ -95,6 +96,7 @@ static void megasas_free_rdpq_fusion(struct megasas_instance *instance); static void megasas_free_reply_fusion(struct megasas_instance *instance); static inline void megasas_configure_queue_sizes(struct megasas_instance *instance); +static void megasas_fusion_crash_dump(struct megasas_instance *instance); /** * megasas_check_same_4gb_region - check if allocation @@ -1759,6 +1761,90 @@ megasas_init_adapter_fusion(struct megasas_instance *instance) return 1; } +/** + * megasas_fault_detect_work - Worker function of + * FW fault handling workqueue. + */ +static void +megasas_fault_detect_work(struct work_struct *work) +{ + struct megasas_instance *instance = + container_of(work, struct megasas_instance, + fw_fault_work.work); + u32 fw_state, dma_state, status; + + /* Check the fw state */ + fw_state = instance->instancet->read_fw_status_reg(instance->reg_set) & + MFI_STATE_MASK; + + if (fw_state == MFI_STATE_FAULT) { + dma_state = instance->instancet->read_fw_status_reg( + instance->reg_set) & MFI_STATE_DMADONE; + /* Start collecting crash, if DMA bit is done */ + if (instance->crash_dump_drv_support && + instance->crash_dump_app_support && dma_state) { + megasas_fusion_crash_dump(instance); + } else { + if (instance->unload == 0) { + status = megasas_reset_fusion(instance->host, 0); + if (status != SUCCESS) { + dev_err(&instance->pdev->dev, + "Failed from %s %d, do not re-arm timer\n", + __func__, __LINE__); + return; + } + } + } + } + + if (instance->fw_fault_work_q) + queue_delayed_work(instance->fw_fault_work_q, + &instance->fw_fault_work, + msecs_to_jiffies(MEGASAS_WATCHDOG_THREAD_INTERVAL)); +} + +int +megasas_fusion_start_watchdog(struct megasas_instance *instance) +{ + /* Check if the Fault WQ is already started */ + if (instance->fw_fault_work_q) + return SUCCESS; + + INIT_DELAYED_WORK(&instance->fw_fault_work, megasas_fault_detect_work); + + snprintf(instance->fault_handler_work_q_name, + sizeof(instance->fault_handler_work_q_name), + "poll_megasas%d_status", instance->host->host_no); + + instance->fw_fault_work_q = + create_singlethread_workqueue(instance->fault_handler_work_q_name); + if (!instance->fw_fault_work_q) { + dev_err(&instance->pdev->dev, "Failed from %s %d\n", + __func__, __LINE__); + return FAILED; + } + + queue_delayed_work(instance->fw_fault_work_q, + &instance->fw_fault_work, + msecs_to_jiffies(MEGASAS_WATCHDOG_THREAD_INTERVAL)); + + return SUCCESS; +} + +void +megasas_fusion_stop_watchdog(struct megasas_instance *instance) +{ + struct workqueue_struct *wq; + + if (instance->fw_fault_work_q) { + wq = instance->fw_fault_work_q; + instance->fw_fault_work_q = NULL; + if (!cancel_delayed_work_sync(&instance->fw_fault_work)) + flush_workqueue(wq); + destroy_workqueue(wq); + } +} + /** * map_cmd_status - Maps FW cmd status to OS cmd status * @cmd : Pointer to cmd @@ -3525,7 +3611,7 @@ irqreturn_t megasas_isr_fusion(int irq, void *devp) { struct megasas_irq_context *irq_context = devp; struct megasas_instance *instance = irq_context->instance; - u32 mfiStatus, fw_state, dma_state; + u32 mfiStatus; if (instance->mask_interrupts) return IRQ_NONE; @@ -3542,31 +3628,7 @@ irqreturn_t megasas_isr_fusion(int irq, void *devp) return IRQ_HANDLED; } - if (!complete_cmd_fusion(instance, irq_context->MSIxIndex)) { - instance->instancet->clear_intr(instance->reg_set); - /* If we didn't complete any commands, check for FW fault */ - fw_state = instance->instancet->read_fw_status_reg( - instance->reg_set) & MFI_STATE_MASK; - dma_state = instance->instancet->read_fw_status_reg - (instance->reg_set) & MFI_STATE_DMADONE; - if (instance->crash_dump_drv_support && - instance->crash_dump_app_support) { - /* Start collecting crash, if DMA bit is done */ - if ((fw_state == MFI_STATE_FAULT) && dma_state) - schedule_work(&instance->crash_init); - else if (fw_state == MFI_STATE_FAULT) { - if (instance->unload == 0) - schedule_work(&instance->work_init); - } - } else if (fw_state == MFI_STATE_FAULT) { - dev_warn(&instance->pdev->dev, "Iop2SysDoorbellInt" - "for scsi%d\n", instance->host->host_no); - if (instance->unload == 0) - schedule_work(&instance->work_init); - } - } - - return IRQ_HANDLED; + return complete_cmd_fusion(instance, irq_context->MSIxIndex); } /** @@ -4752,13 +4814,12 @@ int megasas_reset_fusion(struct Scsi_Host *shost, int reason) return retval; } -/* Fusion Crash dump collection work queue */ -void megasas_fusion_crash_dump_wq(struct work_struct *work) +/* Fusion Crash dump collection */ +void megasas_fusion_crash_dump(struct megasas_instance *instance) { - struct megasas_instance *instance = - container_of(work, struct megasas_instance, crash_init); u32 status_reg; u8 partial_copy = 0; + int wait = 0; status_reg = instance->instancet->read_fw_status_reg(instance->reg_set); @@ -4786,21 +4847,42 @@ void megasas_fusion_crash_dump_wq(struct work_struct *work) "allocated: %d\n", instance->drv_buf_alloc); } - /* - * Driver has allocated max buffers, which can be allocated - * and FW has more crash dump data, then driver will - * ignore the data. - */ - if (instance->drv_buf_index >= (instance->drv_buf_alloc)) { - dev_info(&instance->pdev->dev, "Driver is done copying " - "the buffer: %d\n", instance->drv_buf_alloc); - status_reg |= MFI_STATE_CRASH_DUMP_DONE; - partial_copy = 1; - } else { - memcpy(instance->crash_buf[instance->drv_buf_index], - instance->crash_dump_buf, CRASH_DMA_BUF_SIZE); - instance->drv_buf_index++; - status_reg &= ~MFI_STATE_DMADONE; + while (!(status_reg & MFI_STATE_CRASH_DUMP_DONE) && + (wait < MEGASAS_WATCHDOG_WAIT_COUNT)) { + if (!(status_reg & MFI_STATE_DMADONE)) { + /* + * Next crash dump buffer is not yet DMA'd by FW + * Check after 10ms. Wait for 1 second for FW to + * post the next buffer. If not bail out. + */ + wait++; + msleep(MEGASAS_WAIT_FOR_NEXT_DMA_MSECS); + status_reg = instance->instancet->read_fw_status_reg( + instance->reg_set); + continue; + } + + wait = 0; + if (instance->drv_buf_index >= instance->drv_buf_alloc) { + dev_info(&instance->pdev->dev, + "Driver is done copying the buffer: %d\n", + instance->drv_buf_alloc); + status_reg |= MFI_STATE_CRASH_DUMP_DONE; + partial_copy = 1; + break; + } else { + memcpy(instance->crash_buf[instance->drv_buf_index], + instance->crash_dump_buf, CRASH_DMA_BUF_SIZE); + instance->drv_buf_index++; + status_reg &= ~MFI_STATE_DMADONE; + } + + writel(status_reg, &instance->reg_set->outbound_scratch_pad); + readl(&instance->reg_set->outbound_scratch_pad); + + msleep(MEGASAS_WAIT_FOR_NEXT_DMA_MSECS); + status_reg = instance->instancet->read_fw_status_reg( + instance->reg_set); } if (status_reg & MFI_STATE_CRASH_DUMP_DONE) { @@ -4813,9 +4895,6 @@ void megasas_fusion_crash_dump_wq(struct work_struct *work) readl(&instance->reg_set->outbound_scratch_pad); if (!partial_copy) megasas_reset_fusion(instance->host, 0); - } else { - writel(status_reg, &instance->reg_set->outbound_scratch_pad); - readl(&instance->reg_set->outbound_scratch_pad); } }