Message ID | 1553104085-32312-3-git-send-email-al.kochet@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix eMMC hang on rk3188 and earlier | expand |
+ Caesar Wang On 2019/3/21 1:48, Alexander Kochetkov wrote: > I've found that sometimes dw_mmc in my rk3188 based board stop transfer > any data with error: > > kernel: dwmmc_rockchip 1021c000.dwmmc: Unexpected command timeout, state 3 > > Further digging into problem showed that sometimes one of EDMA-based > transfers hangs and abort with HTO error. I've made test, that 100% I'm not sure what 100% means, but Caesar fired QA test for RK3036 with EDMA-based dwmmc in vendor 4.4 kernel, and seems not big deal. The vendor 4.4 kernel didn't patch anything else wrt EDMA code, but we did enhance PL330 code and fix some bug there, so you may have a try. > reproduce the error. I found, that setting max_segs parameter to 1 fix > the problem. > > I guess the problem is hardware related and relates to DMA controller > implementation for rk3188. Probably it can relates to missed FLUSHP, > see commit 271e1b86e691 ("dmaengine: pl330: add quirk for broken no > flushp"). It is possible that pl330 and dw_mmc become out of sync then > pl330 driver switch from one scatterlist to another. If we limit > scatterlist size to 1, we can avoid switching scatterlists and avoid > hardware problem. Setting max_segs to 1 tells mmc core to use maximum > one scatterlist for one transfer. > > I guess that all other rk3xxx chips that lacks FLUSHP also affected by > the problem. So I made fix for all rk3xxx chips from rk2928 to rk3188. Hard to find these acient platforms to test, expecially some was EOL.... > > Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com> > --- > drivers/mmc/host/dw_mmc-rockchip.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/mmc/host/dw_mmc-rockchip.c b/drivers/mmc/host/dw_mmc-rockchip.c > index 8c86a80..2eed922 100644 > --- a/drivers/mmc/host/dw_mmc-rockchip.c > +++ b/drivers/mmc/host/dw_mmc-rockchip.c > @@ -292,6 +292,24 @@ static int dw_mci_rk3288_parse_dt(struct dw_mci *host) > return 0; > } > > +static void dw_mci_rk2928_init_slot(struct dw_mci *host) > +{ > + struct mmc_host *mmc = host->slot->mmc; > + > + if (host->use_dma == TRANS_MODE_EDMAC) { > + /* > + * Using max_segs > 1 leads to rare EDMA transfer hangs > + * resulting in HTO errors. > + */ > + mmc->max_segs = 1; > + mmc->max_blk_size = 65535; > + mmc->max_blk_count = 64 * 512; > + mmc->max_req_size = > + mmc->max_blk_size * mmc->max_blk_count; > + mmc->max_seg_size = mmc->max_req_size; > + } > +} > + > static int dw_mci_rockchip_init(struct dw_mci *host) > { > /* It is slot 8 on Rockchip SoCs */ > @@ -314,6 +332,7 @@ static int dw_mci_rockchip_init(struct dw_mci *host) > > static const struct dw_mci_drv_data rk2928_drv_data = { > .init = dw_mci_rockchip_init, > + .init_slot = dw_mci_rk2928_init_slot, > }; > > static const struct dw_mci_drv_data rk3288_drv_data = { >
Hello! Forgot to mention transfer hags happen only on mem to dev transfers (dma writes to device) and never on dev to mem. Yea, I know, rk3188 and earlier are quite ancient, but we made custom hardware based on rk3188 and some of our customers report problems. For testing I use rk3188 based custom board with eMMC (probably rk3188-radxa rock with SD can also be used for testing) with cpufreq enabled. For testing I made simple script, that do in loop following: 1. Creates 6 new empty partitions using mkfs.ext3 about 1Gb total 2. extract 100MB archive of linux image to 512Mb partition (about 400MB extracted size). 3. sleep random time from 60 to 120 sec CPU load looks like that: cpufreq stats: 312 MHz:32.63%, 504 MHz:0.00%, 600 MHz:0.00%, 816 MHz:0.38%, 1.01 GHz:29.83%, 1.20 GHz:0.38%, 1.42 GHz:0.00%, 1.61 GHz:36.79% (494481) This test can run for 6 hours and than transfer can hang. I used 5 devices to test. Some devices may run test for long time, but some may fail within an hour. I played with CPU clock settings in u-boot and mmc bus clock settings dts file. I tried to lower eMMC bus clock frequency to exclude PCB errors. Found that some combinations of settings make my test run longer, but test fail anyway. Also I found, that making following change to dw_mmc, result in high error count: diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c index 9c54d60..dcf7d36e 100644 --- a/drivers/mmc/host/dw_mmc.c +++ b/drivers/mmc/host/dw_mmc.c @@ -2905,10 +2905,9 @@ static int dw_mci_init_slot(struct dw_mci *host) } else if (host->use_dma == TRANS_MODE_EDMAC) { mmc->max_segs = 64; mmc->max_blk_size = 65535; - mmc->max_blk_count = 65535; - mmc->max_req_size = - mmc->max_blk_size * mmc->max_blk_count; - mmc->max_seg_size = mmc->max_req_size; + mmc->max_seg_size = 0x1000; + mmc->max_req_size = mmc->max_seg_size * mmc->max_segs; + mmc->max_blk_count = mmc->max_req_size / 512; } else { /* TRANS_MODE_PIO */ mmc->max_segs = 64; With this settings mmc core split large transfer to multiply item scatterlists and increase scatterlists switching rate inside pl330. So I assumed that the root of problem is dma goes out of sync with device. For, example, there is a patch in mainline linux: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/dma/pl330.c?h=v5.0.3&id=1d48745b192a7a45bbdd3557b4c039609569ca41 It fix the problem EDMA can get out of sync with device. But the patch don’t work for rk3188, because rk3188 has PL330_QUIRK_BROKEN_NO_FLUSHP quirk. I’ll try to backport EDMA driver from vendor 4.4 kernel and report test result. Problem safer to fix patching dw_mmc code, than pl330 code. Because patch change transfer parameters from known to work values: mmc->max_segs = 64; mmc->max_blk_size = 65535; mmc->max_blk_count = 65535; mmc->max_req_size = mmc->max_blk_size * mmc->max_blk_count; mmc->max_seg_size = mmc->max_req_size; to mmc->max_segs = 1; mmc->max_blk_size = 65535; mmc->max_blk_count = 64 * 512; mmc->max_req_size = mmc->max_blk_size * mmc->max_blk_count; mmc->max_seg_size = mmc->max_req_size; > 21 марта 2019 г., в 5:31, Shawn Lin <shawn.lin@rock-chips.com> написал(а): > > + Caesar Wang > > On 2019/3/21 1:48, Alexander Kochetkov wrote: >> I've found that sometimes dw_mmc in my rk3188 based board stop transfer >> any data with error: >> kernel: dwmmc_rockchip 1021c000.dwmmc: Unexpected command timeout, state 3 >> Further digging into problem showed that sometimes one of EDMA-based >> transfers hangs and abort with HTO error. I've made test, that 100% > > I'm not sure what 100% means, but Caesar fired QA test for RK3036 with > EDMA-based dwmmc in vendor 4.4 kernel, and seems not big deal. The > vendor 4.4 kernel didn't patch anything else wrt EDMA code, but we did > enhance PL330 code and fix some bug there, so you may have a try. > >> reproduce the error. I found, that setting max_segs parameter to 1 fix >> the problem. >> I guess the problem is hardware related and relates to DMA controller >> implementation for rk3188. Probably it can relates to missed FLUSHP, >> see commit 271e1b86e691 ("dmaengine: pl330: add quirk for broken no >> flushp"). It is possible that pl330 and dw_mmc become out of sync then >> pl330 driver switch from one scatterlist to another. If we limit >> scatterlist size to 1, we can avoid switching scatterlists and avoid >> hardware problem. Setting max_segs to 1 tells mmc core to use maximum >> one scatterlist for one transfer. >> I guess that all other rk3xxx chips that lacks FLUSHP also affected by >> the problem. So I made fix for all rk3xxx chips from rk2928 to rk3188. > > Hard to find these acient platforms to test, expecially some was EOL.... > >> Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com> >> --- >> drivers/mmc/host/dw_mmc-rockchip.c | 19 +++++++++++++++++++ >> 1 file changed, 19 insertions(+) >> diff --git a/drivers/mmc/host/dw_mmc-rockchip.c b/drivers/mmc/host/dw_mmc-rockchip.c >> index 8c86a80..2eed922 100644 >> --- a/drivers/mmc/host/dw_mmc-rockchip.c >> +++ b/drivers/mmc/host/dw_mmc-rockchip.c >> @@ -292,6 +292,24 @@ static int dw_mci_rk3288_parse_dt(struct dw_mci *host) >> return 0; >> } >> +static void dw_mci_rk2928_init_slot(struct dw_mci *host) >> +{ >> + struct mmc_host *mmc = host->slot->mmc; >> + >> + if (host->use_dma == TRANS_MODE_EDMAC) { >> + /* >> + * Using max_segs > 1 leads to rare EDMA transfer hangs >> + * resulting in HTO errors. >> + */ >> + mmc->max_segs = 1; >> + mmc->max_blk_size = 65535; >> + mmc->max_blk_count = 64 * 512; >> + mmc->max_req_size = >> + mmc->max_blk_size * mmc->max_blk_count; >> + mmc->max_seg_size = mmc->max_req_size; >> + } >> +} >> + >> static int dw_mci_rockchip_init(struct dw_mci *host) >> { >> /* It is slot 8 on Rockchip SoCs */ >> @@ -314,6 +332,7 @@ static int dw_mci_rockchip_init(struct dw_mci *host) >> static const struct dw_mci_drv_data rk2928_drv_data = { >> .init = dw_mci_rockchip_init, >> + .init_slot = dw_mci_rk2928_init_slot, >> }; >> static const struct dw_mci_drv_data rk3288_drv_data = { > >
diff --git a/drivers/mmc/host/dw_mmc-rockchip.c b/drivers/mmc/host/dw_mmc-rockchip.c index 8c86a80..2eed922 100644 --- a/drivers/mmc/host/dw_mmc-rockchip.c +++ b/drivers/mmc/host/dw_mmc-rockchip.c @@ -292,6 +292,24 @@ static int dw_mci_rk3288_parse_dt(struct dw_mci *host) return 0; } +static void dw_mci_rk2928_init_slot(struct dw_mci *host) +{ + struct mmc_host *mmc = host->slot->mmc; + + if (host->use_dma == TRANS_MODE_EDMAC) { + /* + * Using max_segs > 1 leads to rare EDMA transfer hangs + * resulting in HTO errors. + */ + mmc->max_segs = 1; + mmc->max_blk_size = 65535; + mmc->max_blk_count = 64 * 512; + mmc->max_req_size = + mmc->max_blk_size * mmc->max_blk_count; + mmc->max_seg_size = mmc->max_req_size; + } +} + static int dw_mci_rockchip_init(struct dw_mci *host) { /* It is slot 8 on Rockchip SoCs */ @@ -314,6 +332,7 @@ static int dw_mci_rockchip_init(struct dw_mci *host) static const struct dw_mci_drv_data rk2928_drv_data = { .init = dw_mci_rockchip_init, + .init_slot = dw_mci_rk2928_init_slot, }; static const struct dw_mci_drv_data rk3288_drv_data = {
I've found that sometimes dw_mmc in my rk3188 based board stop transfer any data with error: kernel: dwmmc_rockchip 1021c000.dwmmc: Unexpected command timeout, state 3 Further digging into problem showed that sometimes one of EDMA-based transfers hangs and abort with HTO error. I've made test, that 100% reproduce the error. I found, that setting max_segs parameter to 1 fix the problem. I guess the problem is hardware related and relates to DMA controller implementation for rk3188. Probably it can relates to missed FLUSHP, see commit 271e1b86e691 ("dmaengine: pl330: add quirk for broken no flushp"). It is possible that pl330 and dw_mmc become out of sync then pl330 driver switch from one scatterlist to another. If we limit scatterlist size to 1, we can avoid switching scatterlists and avoid hardware problem. Setting max_segs to 1 tells mmc core to use maximum one scatterlist for one transfer. I guess that all other rk3xxx chips that lacks FLUSHP also affected by the problem. So I made fix for all rk3xxx chips from rk2928 to rk3188. Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com> --- drivers/mmc/host/dw_mmc-rockchip.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)